All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-08 11:51 ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:51 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

Hi all,

Here's the cleaned up nominally-final version of the patches everybody's
keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
#2-#4 do some preparatory work (and bid farewell to everyone's least
favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.

The branch I've previously shared has been updated too:

  git://linux-arm.org/linux-rm  iommu/pgtable

All feedback welcome, as I'd really like to land this for 4.13.

Robin.


Robin Murphy (8):
  iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  iommu/io-pgtable-arm: Improve split_blk_unmap
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/arm-smmu-v3: Remove io-pgtable spinlock

 drivers/iommu/arm-smmu-v3.c        |  36 ++-----
 drivers/iommu/arm-smmu.c           |  48 ++++------
 drivers/iommu/io-pgtable-arm-v7s.c | 173 +++++++++++++++++++++------------
 drivers/iommu/io-pgtable-arm.c     | 190 ++++++++++++++++++++++++-------------
 drivers/iommu/io-pgtable.h         |   6 ++
 5 files changed, 268 insertions(+), 185 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-08 11:51 ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

Here's the cleaned up nominally-final version of the patches everybody's
keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
#2-#4 do some preparatory work (and bid farewell to everyone's least
favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.

The branch I've previously shared has been updated too:

  git://linux-arm.org/linux-rm  iommu/pgtable

All feedback welcome, as I'd really like to land this for 4.13.

Robin.


Robin Murphy (8):
  iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  iommu/io-pgtable-arm: Improve split_blk_unmap
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/arm-smmu-v3: Remove io-pgtable spinlock

 drivers/iommu/arm-smmu-v3.c        |  36 ++-----
 drivers/iommu/arm-smmu.c           |  48 ++++------
 drivers/iommu/io-pgtable-arm-v7s.c | 173 +++++++++++++++++++++------------
 drivers/iommu/io-pgtable-arm.c     | 190 ++++++++++++++++++++++++-------------
 drivers/iommu/io-pgtable.h         |   6 ++
 5 files changed, 268 insertions(+), 185 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/8] iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

Whilst we don't support the PXN bit at all, so should never encounter a
level 1 section or supersection PTE with it set, it would still be wise
to check both table type bits to resolve any theoretical ambiguity.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 8d6ca28c3e1f..a490db032c51 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -92,7 +92,8 @@
 #define ARM_V7S_PTE_TYPE_CONT_PAGE	0x1
 
 #define ARM_V7S_PTE_IS_VALID(pte)	(((pte) & 0x3) != 0)
-#define ARM_V7S_PTE_IS_TABLE(pte, lvl)	(lvl == 1 && ((pte) & ARM_V7S_PTE_TYPE_TABLE))
+#define ARM_V7S_PTE_IS_TABLE(pte, lvl) \
+	((lvl) == 1 && (((pte) & 0x3) == ARM_V7S_PTE_TYPE_TABLE))
 
 /* Page table bits */
 #define ARM_V7S_ATTR_XN(lvl)		BIT(4 * (2 - (lvl)))
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 1/8] iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

Whilst we don't support the PXN bit at all, so should never encounter a
level 1 section or supersection PTE with it set, it would still be wise
to check both table type bits to resolve any theoretical ambiguity.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 8d6ca28c3e1f..a490db032c51 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -92,7 +92,8 @@
 #define ARM_V7S_PTE_TYPE_CONT_PAGE	0x1
 
 #define ARM_V7S_PTE_IS_VALID(pte)	(((pte) & 0x3) != 0)
-#define ARM_V7S_PTE_IS_TABLE(pte, lvl)	(lvl == 1 && ((pte) & ARM_V7S_PTE_TYPE_TABLE))
+#define ARM_V7S_PTE_IS_TABLE(pte, lvl) \
+	((lvl) == 1 && (((pte) & 0x3) == ARM_V7S_PTE_TYPE_TABLE))
 
 /* Page table bits */
 #define ARM_V7S_ATTR_XN(lvl)		BIT(4 * (2 - (lvl)))
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/8] iommu/io-pgtable-arm: Improve split_blk_unmap
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

The current split_blk_unmap implementation suffers from some inscrutable
pointer trickery for creating the tables to replace the block entry, but
more than that it also suffers from hideous inefficiency. For example,
the most pathological case of unmapping a level 3 page from a level 1
block will allocate 513 lower-level tables to remap the entire block at
page granularity, when only 2 are actually needed (the rest can be
covered by level 2 block entries).

Also, we would like to be able to relax the spinlock requirement in
future, for which the roll-back-and-try-again logic for race resolution
would be pretty hideous under the current paradigm.

Both issues can be resolved most neatly by turning things sideways:
instead of repeatedly recursing into __arm_lpae_map() map to build up an
entire new sub-table depth-first, we can directly replace the block
entry with a next-level table of block/page entries, then repeat by
unmapping at the next level if necessary. With a little refactoring of
some helper functions, the code ends up not much bigger than before, but
considerably easier to follow and to adapt in future.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm.c | 116 ++++++++++++++++++++++++-----------------
 1 file changed, 68 insertions(+), 48 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6e5df5e0a3bd..97d039952367 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -264,13 +264,32 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			    unsigned long iova, size_t size, int lvl,
 			    arm_lpae_iopte *ptep);
 
+static arm_lpae_iopte __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+					  arm_lpae_iopte prot, int lvl,
+					  phys_addr_t paddr)
+{
+	arm_lpae_iopte pte = prot;
+
+	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		pte |= ARM_LPAE_PTE_NS;
+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	return pte;
+}
+
 static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			     unsigned long iova, phys_addr_t paddr,
 			     arm_lpae_iopte prot, int lvl,
 			     arm_lpae_iopte *ptep)
 {
-	arm_lpae_iopte pte = prot;
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte;
 
 	if (iopte_leaf(*ptep, lvl)) {
 		/* We require an unmap first */
@@ -289,21 +308,25 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			return -EINVAL;
 	}
 
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-		pte |= ARM_LPAE_PTE_NS;
-
-	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
-		pte |= ARM_LPAE_PTE_TYPE_PAGE;
-	else
-		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
-
-	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
-	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
-
-	__arm_lpae_set_pte(ptep, pte, cfg);
+	pte = __arm_lpae_init_pte(data, prot, lvl, paddr);
+	__arm_lpae_set_pte(ptep, pte, &data->iop.cfg);
 	return 0;
 }
 
+static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
+					     arm_lpae_iopte *ptep,
+					     struct io_pgtable_cfg *cfg)
+{
+	arm_lpae_iopte new;
+
+	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_LPAE_PTE_NSTABLE;
+
+	__arm_lpae_set_pte(ptep, new, cfg);
+	return new;
+}
+
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
 			  int lvl, arm_lpae_iopte *ptep)
@@ -331,10 +354,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_LPAE_PTE_NSTABLE;
-		__arm_lpae_set_pte(ptep, pte, cfg);
+		arm_lpae_install_table(cptep, ptep, cfg);
 	} else if (!iopte_leaf(pte, lvl)) {
 		cptep = iopte_deref(pte, data);
 	} else {
@@ -452,40 +472,42 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 
 static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 				    unsigned long iova, size_t size,
-				    arm_lpae_iopte prot, int lvl,
-				    arm_lpae_iopte *ptep, size_t blk_size)
+				    arm_lpae_iopte blk_pte, int lvl,
+				    arm_lpae_iopte *ptep)
 {
-	unsigned long blk_start, blk_end;
-	phys_addr_t blk_paddr;
-	arm_lpae_iopte table = 0;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte, *tablep;
+	size_t tablesz = ARM_LPAE_GRANULE(data);
+	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int i, unmap_idx = -1;
 
-	blk_start = iova & ~(blk_size - 1);
-	blk_end = blk_start + blk_size;
-	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_lpae_iopte *tablep;
+	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
+	if (size == split_sz)
+		unmap_idx = ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	pte = __arm_lpae_init_pte(data, iopte_prot(blk_pte), lvl,
+				iopte_to_pfn(blk_pte, data) << data->pg_shift);
+
+	for (i = 0; i < tablesz / sizeof(pte); pte += split_sz, i++) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_lpae_map expects a pointer to the start of the table */
-		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
-		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
-				   tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, data);
-				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_lpae_set_pte(&tablep[i], pte, cfg);
 	}
 
-	__arm_lpae_set_pte(ptep, table, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_lpae_install_table(tablep, ptep, cfg);
+
+	if (unmap_idx < 0)
+		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -495,7 +517,6 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 {
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
-	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
 	/* Something went horribly wrong and we ran out of page table */
 	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -507,7 +528,7 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		return 0;
 
 	/* If the size matches this level, we're in the right place */
-	if (size == blk_size) {
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
 		__arm_lpae_set_pte(ptep, 0, &iop->cfg);
 
 		if (!iopte_leaf(pte, lvl)) {
@@ -527,9 +548,8 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_lpae_split_blk_unmap(data, iova, size,
-						iopte_prot(pte), lvl, ptep,
-						blk_size);
+		return arm_lpae_split_blk_unmap(data, iova, size, pte,
+						lvl + 1, ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/8] iommu/io-pgtable-arm: Improve split_blk_unmap
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

The current split_blk_unmap implementation suffers from some inscrutable
pointer trickery for creating the tables to replace the block entry, but
more than that it also suffers from hideous inefficiency. For example,
the most pathological case of unmapping a level 3 page from a level 1
block will allocate 513 lower-level tables to remap the entire block at
page granularity, when only 2 are actually needed (the rest can be
covered by level 2 block entries).

Also, we would like to be able to relax the spinlock requirement in
future, for which the roll-back-and-try-again logic for race resolution
would be pretty hideous under the current paradigm.

Both issues can be resolved most neatly by turning things sideways:
instead of repeatedly recursing into __arm_lpae_map() map to build up an
entire new sub-table depth-first, we can directly replace the block
entry with a next-level table of block/page entries, then repeat by
unmapping at the next level if necessary. With a little refactoring of
some helper functions, the code ends up not much bigger than before, but
considerably easier to follow and to adapt in future.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm.c | 116 ++++++++++++++++++++++++-----------------
 1 file changed, 68 insertions(+), 48 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6e5df5e0a3bd..97d039952367 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -264,13 +264,32 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			    unsigned long iova, size_t size, int lvl,
 			    arm_lpae_iopte *ptep);
 
+static arm_lpae_iopte __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+					  arm_lpae_iopte prot, int lvl,
+					  phys_addr_t paddr)
+{
+	arm_lpae_iopte pte = prot;
+
+	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		pte |= ARM_LPAE_PTE_NS;
+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	return pte;
+}
+
 static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			     unsigned long iova, phys_addr_t paddr,
 			     arm_lpae_iopte prot, int lvl,
 			     arm_lpae_iopte *ptep)
 {
-	arm_lpae_iopte pte = prot;
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte;
 
 	if (iopte_leaf(*ptep, lvl)) {
 		/* We require an unmap first */
@@ -289,21 +308,25 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			return -EINVAL;
 	}
 
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-		pte |= ARM_LPAE_PTE_NS;
-
-	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
-		pte |= ARM_LPAE_PTE_TYPE_PAGE;
-	else
-		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
-
-	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
-	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
-
-	__arm_lpae_set_pte(ptep, pte, cfg);
+	pte = __arm_lpae_init_pte(data, prot, lvl, paddr);
+	__arm_lpae_set_pte(ptep, pte, &data->iop.cfg);
 	return 0;
 }
 
+static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
+					     arm_lpae_iopte *ptep,
+					     struct io_pgtable_cfg *cfg)
+{
+	arm_lpae_iopte new;
+
+	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_LPAE_PTE_NSTABLE;
+
+	__arm_lpae_set_pte(ptep, new, cfg);
+	return new;
+}
+
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
 			  int lvl, arm_lpae_iopte *ptep)
@@ -331,10 +354,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_LPAE_PTE_NSTABLE;
-		__arm_lpae_set_pte(ptep, pte, cfg);
+		arm_lpae_install_table(cptep, ptep, cfg);
 	} else if (!iopte_leaf(pte, lvl)) {
 		cptep = iopte_deref(pte, data);
 	} else {
@@ -452,40 +472,42 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 
 static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 				    unsigned long iova, size_t size,
-				    arm_lpae_iopte prot, int lvl,
-				    arm_lpae_iopte *ptep, size_t blk_size)
+				    arm_lpae_iopte blk_pte, int lvl,
+				    arm_lpae_iopte *ptep)
 {
-	unsigned long blk_start, blk_end;
-	phys_addr_t blk_paddr;
-	arm_lpae_iopte table = 0;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte, *tablep;
+	size_t tablesz = ARM_LPAE_GRANULE(data);
+	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int i, unmap_idx = -1;
 
-	blk_start = iova & ~(blk_size - 1);
-	blk_end = blk_start + blk_size;
-	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_lpae_iopte *tablep;
+	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
+	if (size == split_sz)
+		unmap_idx = ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	pte = __arm_lpae_init_pte(data, iopte_prot(blk_pte), lvl,
+				iopte_to_pfn(blk_pte, data) << data->pg_shift);
+
+	for (i = 0; i < tablesz / sizeof(pte); pte += split_sz, i++) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_lpae_map expects a pointer to the start of the table */
-		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
-		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
-				   tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, data);
-				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_lpae_set_pte(&tablep[i], pte, cfg);
 	}
 
-	__arm_lpae_set_pte(ptep, table, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_lpae_install_table(tablep, ptep, cfg);
+
+	if (unmap_idx < 0)
+		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -495,7 +517,6 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 {
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
-	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
 	/* Something went horribly wrong and we ran out of page table */
 	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -507,7 +528,7 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		return 0;
 
 	/* If the size matches this level, we're in the right place */
-	if (size == blk_size) {
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
 		__arm_lpae_set_pte(ptep, 0, &iop->cfg);
 
 		if (!iopte_leaf(pte, lvl)) {
@@ -527,9 +548,8 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_lpae_split_blk_unmap(data, iova, size,
-						iopte_prot(pte), lvl, ptep,
-						blk_size);
+		return arm_lpae_split_blk_unmap(data, iova, size, pte,
+						lvl + 1, ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/8] iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

Whilst the short-descriptor format's split_blk_unmap implementation has
no need to be recursive, it followed the pattern of the LPAE version
anyway for the sake of consistency. With the latter now reworked for
both efficiency and future scalability improvements, tweak the former
similarly, not least to make it less obtuse.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 85 ++++++++++++++++++++------------------
 1 file changed, 45 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index a490db032c51..3ee8e61eeb18 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -281,6 +281,13 @@ static arm_v7s_iopte arm_v7s_prot_to_pte(int prot, int lvl,
 	else if (prot & IOMMU_CACHE)
 		pte |= ARM_V7S_ATTR_B | ARM_V7S_ATTR_C;
 
+	pte |= ARM_V7S_PTE_TYPE_PAGE;
+	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
+		pte |= ARM_V7S_ATTR_NS_SECTION;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
+		pte |= ARM_V7S_ATTR_MTK_4GB;
+
 	return pte;
 }
 
@@ -353,7 +360,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			    int lvl, int num_entries, arm_v7s_iopte *ptep)
 {
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	arm_v7s_iopte pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
+	arm_v7s_iopte pte;
 	int i;
 
 	for (i = 0; i < num_entries; i++)
@@ -375,13 +382,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			return -EEXIST;
 		}
 
-	pte |= ARM_V7S_PTE_TYPE_PAGE;
-	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
-		pte |= ARM_V7S_ATTR_NS_SECTION;
-
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
-		pte |= ARM_V7S_ATTR_MTK_4GB;
-
+	pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
 	if (num_entries > 1)
 		pte = arm_v7s_pte_to_cont(pte, lvl);
 
@@ -391,6 +392,20 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 	return 0;
 }
 
+static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
+					   arm_v7s_iopte *ptep,
+					   struct io_pgtable_cfg *cfg)
+{
+	arm_v7s_iopte new;
+
+	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_V7S_ATTR_NS_TABLE;
+
+	__arm_v7s_set_pte(ptep, new, 1, cfg);
+	return new;
+}
+
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 			 phys_addr_t paddr, size_t size, int prot,
 			 int lvl, arm_v7s_iopte *ptep)
@@ -418,11 +433,7 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = virt_to_phys(cptep) | ARM_V7S_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_V7S_ATTR_NS_TABLE;
-
-		__arm_v7s_set_pte(ptep, pte, 1, cfg);
+		arm_v7s_install_table(cptep, ptep, cfg);
 	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
 		cptep = iopte_deref(pte, lvl);
 	} else {
@@ -503,41 +514,35 @@ static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 				   unsigned long iova, size_t size,
-				   arm_v7s_iopte *ptep)
+				   arm_v7s_iopte blk_pte, arm_v7s_iopte *ptep)
 {
-	unsigned long blk_start, blk_end, blk_size;
-	phys_addr_t blk_paddr;
-	arm_v7s_iopte table = 0;
-	int prot = arm_v7s_pte_to_prot(*ptep, 1);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_v7s_iopte pte, *tablep;
+	int i, unmap_idx, num_entries, num_ptes;
 
-	blk_size = ARM_V7S_BLOCK_SIZE(1);
-	blk_start = iova & ARM_V7S_LVL_MASK(1);
-	blk_end = blk_start + ARM_V7S_BLOCK_SIZE(1);
-	blk_paddr = *ptep & ARM_V7S_LVL_MASK(1);
+	tablep = __arm_v7s_alloc_table(2, GFP_ATOMIC, data);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_v7s_iopte *tablep;
+	num_ptes = ARM_V7S_PTES_PER_LVL(2);
+	num_entries = size >> ARM_V7S_LVL_SHIFT(2);
+	unmap_idx = ARM_V7S_LVL_IDX(iova, 2);
 
+	pte = arm_v7s_prot_to_pte(arm_v7s_pte_to_prot(blk_pte, 1), 2, cfg);
+	if (num_entries > 1)
+		pte = arm_v7s_pte_to_cont(pte, 2);
+
+	for (i = 0; i < num_ptes; pte += size, i += num_entries) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_v7s_map expects a pointer to the start of the table */
-		tablep = &table - ARM_V7S_LVL_IDX(blk_start, 1);
-		if (__arm_v7s_map(data, blk_start, blk_paddr, size, prot, 1,
-				  tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, 1);
-				__arm_v7s_free_table(tablep, 2, data);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	__arm_v7s_set_pte(ptep, table, 1, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_v7s_install_table(tablep, ptep, cfg);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -594,7 +599,7 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_v7s_split_blk_unmap(data, iova, size, ptep);
+		return arm_v7s_split_blk_unmap(data, iova, size, pte[0], ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/8] iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

Whilst the short-descriptor format's split_blk_unmap implementation has
no need to be recursive, it followed the pattern of the LPAE version
anyway for the sake of consistency. With the latter now reworked for
both efficiency and future scalability improvements, tweak the former
similarly, not least to make it less obtuse.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 85 ++++++++++++++++++++------------------
 1 file changed, 45 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index a490db032c51..3ee8e61eeb18 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -281,6 +281,13 @@ static arm_v7s_iopte arm_v7s_prot_to_pte(int prot, int lvl,
 	else if (prot & IOMMU_CACHE)
 		pte |= ARM_V7S_ATTR_B | ARM_V7S_ATTR_C;
 
+	pte |= ARM_V7S_PTE_TYPE_PAGE;
+	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
+		pte |= ARM_V7S_ATTR_NS_SECTION;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
+		pte |= ARM_V7S_ATTR_MTK_4GB;
+
 	return pte;
 }
 
@@ -353,7 +360,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			    int lvl, int num_entries, arm_v7s_iopte *ptep)
 {
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	arm_v7s_iopte pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
+	arm_v7s_iopte pte;
 	int i;
 
 	for (i = 0; i < num_entries; i++)
@@ -375,13 +382,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			return -EEXIST;
 		}
 
-	pte |= ARM_V7S_PTE_TYPE_PAGE;
-	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
-		pte |= ARM_V7S_ATTR_NS_SECTION;
-
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
-		pte |= ARM_V7S_ATTR_MTK_4GB;
-
+	pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
 	if (num_entries > 1)
 		pte = arm_v7s_pte_to_cont(pte, lvl);
 
@@ -391,6 +392,20 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 	return 0;
 }
 
+static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
+					   arm_v7s_iopte *ptep,
+					   struct io_pgtable_cfg *cfg)
+{
+	arm_v7s_iopte new;
+
+	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_V7S_ATTR_NS_TABLE;
+
+	__arm_v7s_set_pte(ptep, new, 1, cfg);
+	return new;
+}
+
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 			 phys_addr_t paddr, size_t size, int prot,
 			 int lvl, arm_v7s_iopte *ptep)
@@ -418,11 +433,7 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = virt_to_phys(cptep) | ARM_V7S_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_V7S_ATTR_NS_TABLE;
-
-		__arm_v7s_set_pte(ptep, pte, 1, cfg);
+		arm_v7s_install_table(cptep, ptep, cfg);
 	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
 		cptep = iopte_deref(pte, lvl);
 	} else {
@@ -503,41 +514,35 @@ static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 				   unsigned long iova, size_t size,
-				   arm_v7s_iopte *ptep)
+				   arm_v7s_iopte blk_pte, arm_v7s_iopte *ptep)
 {
-	unsigned long blk_start, blk_end, blk_size;
-	phys_addr_t blk_paddr;
-	arm_v7s_iopte table = 0;
-	int prot = arm_v7s_pte_to_prot(*ptep, 1);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_v7s_iopte pte, *tablep;
+	int i, unmap_idx, num_entries, num_ptes;
 
-	blk_size = ARM_V7S_BLOCK_SIZE(1);
-	blk_start = iova & ARM_V7S_LVL_MASK(1);
-	blk_end = blk_start + ARM_V7S_BLOCK_SIZE(1);
-	blk_paddr = *ptep & ARM_V7S_LVL_MASK(1);
+	tablep = __arm_v7s_alloc_table(2, GFP_ATOMIC, data);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_v7s_iopte *tablep;
+	num_ptes = ARM_V7S_PTES_PER_LVL(2);
+	num_entries = size >> ARM_V7S_LVL_SHIFT(2);
+	unmap_idx = ARM_V7S_LVL_IDX(iova, 2);
 
+	pte = arm_v7s_prot_to_pte(arm_v7s_pte_to_prot(blk_pte, 1), 2, cfg);
+	if (num_entries > 1)
+		pte = arm_v7s_pte_to_cont(pte, 2);
+
+	for (i = 0; i < num_ptes; pte += size, i += num_entries) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_v7s_map expects a pointer to the start of the table */
-		tablep = &table - ARM_V7S_LVL_IDX(blk_start, 1);
-		if (__arm_v7s_map(data, blk_start, blk_paddr, size, prot, 1,
-				  tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, 1);
-				__arm_v7s_free_table(tablep, 2, data);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	__arm_v7s_set_pte(ptep, table, 1, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_v7s_install_table(tablep, ptep, cfg);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -594,7 +599,7 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_v7s_split_blk_unmap(data, iova, size, ptep);
+		return arm_v7s_split_blk_unmap(data, iova, size, pte[0], ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 4/8] iommu/io-pgtable: Introduce explicit coherency
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

Once we remove the serialising spinlock, a potential race opens up for
non-coherent IOMMUs whereby a caller of .map() can be sure that cache
maintenance has been performed on their new PTE, but will have no
guarantee that such maintenance for table entries above it has actually
completed (e.g. if another CPU took an interrupt immediately after
writing the table entry, but before initiating the DMA sync).

Handling this race safely will add some potentially non-trivial overhead
to installing a table entry, which we would much rather avoid on
coherent systems where it will be unnecessary, and where we are stirivng
to minimise latency by removing the locking in the first place.

To that end, let's introduce an explicit notion of cache-coherency to
io-pgtable, such that we will be able to avoid penalising IOMMUs which
know enough to know when they are coherent.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu-v3.c        |  3 +++
 drivers/iommu/arm-smmu.c           |  3 +++
 drivers/iommu/io-pgtable-arm-v7s.c | 17 ++++++++++-------
 drivers/iommu/io-pgtable-arm.c     | 11 ++++++-----
 drivers/iommu/io-pgtable.h         |  6 ++++++
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969aa60d5..b9c4cf4ccca2 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1555,6 +1555,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops)
 		return -ENOMEM;
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ec30b08b3bd..1f42f339a284 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1010,6 +1010,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	smmu_domain->smmu = smmu;
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops) {
diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 3ee8e61eeb18..2437f2899661 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -187,7 +187,8 @@ static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl)
 static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 				   struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	dma_addr_t dma;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 	void *table = NULL;
@@ -196,7 +197,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 		table = (void *)__get_dma_pages(__GFP_ZERO, get_order(size));
 	else if (lvl == 2)
 		table = kmem_cache_zalloc(data->l2_tables, gfp | GFP_DMA);
-	if (table && !selftest_running) {
+	if (table && !(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, table, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -225,10 +226,11 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 static void __arm_v7s_free_table(void *table, int lvl,
 				 struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(dev, __arm_v7s_dma_addr(table), size,
 				 DMA_TO_DEVICE);
 	if (lvl == 1)
@@ -240,7 +242,7 @@ static void __arm_v7s_free_table(void *table, int lvl,
 static void __arm_v7s_pte_sync(arm_v7s_iopte *ptep, int num_entries,
 			       struct io_pgtable_cfg *cfg)
 {
-	if (selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		return;
 
 	dma_sync_single_for_device(cfg->iommu_dev, __arm_v7s_dma_addr(ptep),
@@ -657,7 +659,8 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
 			    IO_PGTABLE_QUIRK_NO_PERMS |
 			    IO_PGTABLE_QUIRK_TLBI_ON_MAP |
-			    IO_PGTABLE_QUIRK_ARM_MTK_4GB))
+			    IO_PGTABLE_QUIRK_ARM_MTK_4GB |
+			    IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	/* If ARM_MTK_4GB is enabled, the NO_PERMS is also expected. */
@@ -774,7 +777,7 @@ static int __init arm_v7s_do_selftests(void)
 		.tlb = &dummy_tlb_ops,
 		.oas = 32,
 		.ias = 32,
-		.quirks = IO_PGTABLE_QUIRK_ARM_NS,
+		.quirks = IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA,
 		.pgsize_bitmap = SZ_4K | SZ_64K | SZ_1M | SZ_16M,
 	};
 	unsigned int iova, size, iova_start;
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 97d039952367..bdc954cb98c1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -217,7 +217,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 	if (!pages)
 		return NULL;
 
-	if (!selftest_running) {
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -243,7 +243,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 static void __arm_lpae_free_pages(void *pages, size_t size,
 				  struct io_pgtable_cfg *cfg)
 {
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
 				 size, DMA_TO_DEVICE);
 	free_pages_exact(pages, size);
@@ -254,7 +254,7 @@ static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 {
 	*ptep = pte;
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_sync_single_for_device(cfg->iommu_dev,
 					   __arm_lpae_dma_addr(ptep),
 					   sizeof(pte), DMA_TO_DEVICE);
@@ -693,7 +693,7 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	u64 reg;
 	struct arm_lpae_io_pgtable *data;
 
-	if (cfg->quirks & ~IO_PGTABLE_QUIRK_ARM_NS)
+	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -782,7 +782,7 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 	struct arm_lpae_io_pgtable *data;
 
 	/* The NS quirk doesn't apply at stage 2 */
-	if (cfg->quirks)
+	if (cfg->quirks & ~IO_PGTABLE_QUIRK_NO_DMA)
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -1086,6 +1086,7 @@ static int __init arm_lpae_do_selftests(void)
 	struct io_pgtable_cfg cfg = {
 		.tlb = &dummy_tlb_ops,
 		.oas = 48,
+		.quirks = IO_PGTABLE_QUIRK_NO_DMA,
 	};
 
 	for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 969d82cc92ca..524263a7ae6f 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -65,11 +65,17 @@ struct io_pgtable_cfg {
 	 *	PTEs, for Mediatek IOMMUs which treat it as a 33rd address bit
 	 *	when the SoC is in "4GB mode" and they can only access the high
 	 *	remap of DRAM (0x1_00000000 to 0x1_ffffffff).
+	 *
+	 * IO_PGTABLE_QUIRK_NO_DMA: Guarantees that the tables will only ever
+	 *	be accessed by a fully cache-coherent IOMMU or CPU (e.g. for a
+	 *	software-emulated IOMMU), such that pagetable updates need not
+	 *	be treated as explicit DMA data.
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS		BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS	BIT(1)
 	#define IO_PGTABLE_QUIRK_TLBI_ON_MAP	BIT(2)
 	#define IO_PGTABLE_QUIRK_ARM_MTK_4GB	BIT(3)
+	#define IO_PGTABLE_QUIRK_NO_DMA		BIT(4)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 4/8] iommu/io-pgtable: Introduce explicit coherency
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

Once we remove the serialising spinlock, a potential race opens up for
non-coherent IOMMUs whereby a caller of .map() can be sure that cache
maintenance has been performed on their new PTE, but will have no
guarantee that such maintenance for table entries above it has actually
completed (e.g. if another CPU took an interrupt immediately after
writing the table entry, but before initiating the DMA sync).

Handling this race safely will add some potentially non-trivial overhead
to installing a table entry, which we would much rather avoid on
coherent systems where it will be unnecessary, and where we are stirivng
to minimise latency by removing the locking in the first place.

To that end, let's introduce an explicit notion of cache-coherency to
io-pgtable, such that we will be able to avoid penalising IOMMUs which
know enough to know when they are coherent.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu-v3.c        |  3 +++
 drivers/iommu/arm-smmu.c           |  3 +++
 drivers/iommu/io-pgtable-arm-v7s.c | 17 ++++++++++-------
 drivers/iommu/io-pgtable-arm.c     | 11 ++++++-----
 drivers/iommu/io-pgtable.h         |  6 ++++++
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969aa60d5..b9c4cf4ccca2 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1555,6 +1555,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops)
 		return -ENOMEM;
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ec30b08b3bd..1f42f339a284 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1010,6 +1010,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	smmu_domain->smmu = smmu;
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops) {
diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 3ee8e61eeb18..2437f2899661 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -187,7 +187,8 @@ static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl)
 static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 				   struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	dma_addr_t dma;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 	void *table = NULL;
@@ -196,7 +197,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 		table = (void *)__get_dma_pages(__GFP_ZERO, get_order(size));
 	else if (lvl == 2)
 		table = kmem_cache_zalloc(data->l2_tables, gfp | GFP_DMA);
-	if (table && !selftest_running) {
+	if (table && !(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, table, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -225,10 +226,11 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 static void __arm_v7s_free_table(void *table, int lvl,
 				 struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(dev, __arm_v7s_dma_addr(table), size,
 				 DMA_TO_DEVICE);
 	if (lvl == 1)
@@ -240,7 +242,7 @@ static void __arm_v7s_free_table(void *table, int lvl,
 static void __arm_v7s_pte_sync(arm_v7s_iopte *ptep, int num_entries,
 			       struct io_pgtable_cfg *cfg)
 {
-	if (selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		return;
 
 	dma_sync_single_for_device(cfg->iommu_dev, __arm_v7s_dma_addr(ptep),
@@ -657,7 +659,8 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
 			    IO_PGTABLE_QUIRK_NO_PERMS |
 			    IO_PGTABLE_QUIRK_TLBI_ON_MAP |
-			    IO_PGTABLE_QUIRK_ARM_MTK_4GB))
+			    IO_PGTABLE_QUIRK_ARM_MTK_4GB |
+			    IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	/* If ARM_MTK_4GB is enabled, the NO_PERMS is also expected. */
@@ -774,7 +777,7 @@ static int __init arm_v7s_do_selftests(void)
 		.tlb = &dummy_tlb_ops,
 		.oas = 32,
 		.ias = 32,
-		.quirks = IO_PGTABLE_QUIRK_ARM_NS,
+		.quirks = IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA,
 		.pgsize_bitmap = SZ_4K | SZ_64K | SZ_1M | SZ_16M,
 	};
 	unsigned int iova, size, iova_start;
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 97d039952367..bdc954cb98c1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -217,7 +217,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 	if (!pages)
 		return NULL;
 
-	if (!selftest_running) {
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -243,7 +243,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 static void __arm_lpae_free_pages(void *pages, size_t size,
 				  struct io_pgtable_cfg *cfg)
 {
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
 				 size, DMA_TO_DEVICE);
 	free_pages_exact(pages, size);
@@ -254,7 +254,7 @@ static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 {
 	*ptep = pte;
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_sync_single_for_device(cfg->iommu_dev,
 					   __arm_lpae_dma_addr(ptep),
 					   sizeof(pte), DMA_TO_DEVICE);
@@ -693,7 +693,7 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	u64 reg;
 	struct arm_lpae_io_pgtable *data;
 
-	if (cfg->quirks & ~IO_PGTABLE_QUIRK_ARM_NS)
+	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -782,7 +782,7 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 	struct arm_lpae_io_pgtable *data;
 
 	/* The NS quirk doesn't apply at stage 2 */
-	if (cfg->quirks)
+	if (cfg->quirks & ~IO_PGTABLE_QUIRK_NO_DMA)
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -1086,6 +1086,7 @@ static int __init arm_lpae_do_selftests(void)
 	struct io_pgtable_cfg cfg = {
 		.tlb = &dummy_tlb_ops,
 		.oas = 48,
+		.quirks = IO_PGTABLE_QUIRK_NO_DMA,
 	};
 
 	for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 969d82cc92ca..524263a7ae6f 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -65,11 +65,17 @@ struct io_pgtable_cfg {
 	 *	PTEs, for Mediatek IOMMUs which treat it as a 33rd address bit
 	 *	when the SoC is in "4GB mode" and they can only access the high
 	 *	remap of DRAM (0x1_00000000 to 0x1_ffffffff).
+	 *
+	 * IO_PGTABLE_QUIRK_NO_DMA: Guarantees that the tables will only ever
+	 *	be accessed by a fully cache-coherent IOMMU or CPU (e.g. for a
+	 *	software-emulated IOMMU), such that pagetable updates need not
+	 *	be treated as explicit DMA data.
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS		BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS	BIT(1)
 	#define IO_PGTABLE_QUIRK_TLBI_ON_MAP	BIT(2)
 	#define IO_PGTABLE_QUIRK_ARM_MTK_4GB	BIT(3)
+	#define IO_PGTABLE_QUIRK_NO_DMA		BIT(4)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

For parallel I/O with multiple concurrent threads servicing the same
device (or devices, if several share a domain), serialising page table
updates becomes a massive bottleneck. On reflection, though, we don't
strictly need to do that - for valid IOMMU API usage, there are in fact
only two races that we need to guard against: multiple map requests for
different blocks within the same region, when the intermediate-level
table for that region does not yet exist; and multiple unmaps of
different parts of the same block entry. Both of those are fairly easily
solved by using a cmpxchg to install the new table, such that if we then
find that someone else's table got there first, we can simply free ours
and continue.

Make the requisite changes such that we can withstand being called
without the caller maintaining a lock. In theory, this opens up a few
corners in which wildly misbehaving callers making nonsensical
overlapping requests might lead to crashes instead of just unpredictable
results, but correct code really does not deserve to pay a significant
performance cost for the sake of masking bugs in theoretical broken code.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm.c | 75 ++++++++++++++++++++++++++++++++----------
 1 file changed, 58 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index bdc954cb98c1..d857961af1d3 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -20,6 +20,7 @@
 
 #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/iommu.h>
 #include <linux/kernel.h>
 #include <linux/sizes.h>
@@ -99,6 +100,8 @@
 #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
 #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
 					 ARM_LPAE_PTE_ATTR_HI_MASK)
+/* Software bit for solving coherency races */
+#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
 
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
@@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
 	free_pages_exact(pages, size);
 }
 
+static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
+				struct io_pgtable_cfg *cfg)
+{
+	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
+				   sizeof(*ptep), DMA_TO_DEVICE);
+}
+
 static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 			       struct io_pgtable_cfg *cfg)
 {
 	*ptep = pte;
 
 	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
-		dma_sync_single_for_device(cfg->iommu_dev,
-					   __arm_lpae_dma_addr(ptep),
-					   sizeof(pte), DMA_TO_DEVICE);
+		__arm_lpae_sync_pte(ptep, cfg);
 }
 
 static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -289,13 +297,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			     arm_lpae_iopte prot, int lvl,
 			     arm_lpae_iopte *ptep)
 {
-	arm_lpae_iopte pte;
+	arm_lpae_iopte pte = *ptep;
 
-	if (iopte_leaf(*ptep, lvl)) {
+	if (iopte_leaf(pte, lvl)) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
-	} else if (iopte_type(*ptep, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
+	} else if (iopte_type(pte, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
 		/*
 		 * We need to unmap and free the old table before
 		 * overwriting it with a block entry.
@@ -315,16 +323,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 
 static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 					     arm_lpae_iopte *ptep,
+					     arm_lpae_iopte curr,
 					     struct io_pgtable_cfg *cfg)
 {
-	arm_lpae_iopte new;
+	arm_lpae_iopte old, new;
 
 	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_LPAE_PTE_NSTABLE;
 
-	__arm_lpae_set_pte(ptep, new, cfg);
-	return new;
+	old = cmpxchg64_relaxed(ptep, curr, new);
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)
+		return old;
+
+	/* Even if it's not ours, there's no point waiting; just kick it */
+	if (!(old & ARM_LPAE_PTE_SW_SYNC))
+		__arm_lpae_sync_pte(ptep, cfg);
+	if (old == curr) {
+		/* Ensure our sync is finished before we mark it as such */
+		wmb();
+		*ptep |= ARM_LPAE_PTE_SW_SYNC;
+	}
+
+	return old;
 }
 
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
@@ -333,6 +355,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 {
 	arm_lpae_iopte *cptep, pte;
 	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	size_t tblsz = ARM_LPAE_GRANULE(data);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 
 	/* Find our entry at the current level */
@@ -347,20 +370,26 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
-		cptep = __arm_lpae_alloc_pages(ARM_LPAE_GRANULE(data),
-					       GFP_ATOMIC, cfg);
+		cptep = __arm_lpae_alloc_pages(tblsz, GFP_ATOMIC, cfg);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_lpae_install_table(cptep, ptep, cfg);
-	} else if (!iopte_leaf(pte, lvl)) {
-		cptep = iopte_deref(pte, data);
-	} else {
+		pte = arm_lpae_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_lpae_free_pages(cptep, tblsz, cfg);
+	} else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
+		   !(pte & ARM_LPAE_PTE_SW_SYNC)) {
+		__arm_lpae_sync_pte(ptep, cfg);
+	}
+
+	if (iopte_leaf(pte, lvl)) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
+	} else if (pte) {
+		cptep = iopte_deref(pte, data);
 	}
 
 	/* Rinse, repeat */
@@ -502,7 +531,19 @@ static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 		__arm_lpae_set_pte(&tablep[i], pte, cfg);
 	}
 
-	arm_lpae_install_table(tablep, ptep, cfg);
+	pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_lpae_free_pages(tablep, tablesz, cfg);
+		/*
+		 * We may race against someone unmapping another part of this
+		 * block, but anything else is invalid. We can't misinterpret
+		 * a page entry here since we're never at the last level.
+		 */
+		if (iopte_type(pte, lvl) != ARM_LPAE_PTE_TYPE_TABLE)
+			return 0;
+
+		tablep = iopte_deref(pte, data);
+	}
 
 	if (unmap_idx < 0)
 		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

For parallel I/O with multiple concurrent threads servicing the same
device (or devices, if several share a domain), serialising page table
updates becomes a massive bottleneck. On reflection, though, we don't
strictly need to do that - for valid IOMMU API usage, there are in fact
only two races that we need to guard against: multiple map requests for
different blocks within the same region, when the intermediate-level
table for that region does not yet exist; and multiple unmaps of
different parts of the same block entry. Both of those are fairly easily
solved by using a cmpxchg to install the new table, such that if we then
find that someone else's table got there first, we can simply free ours
and continue.

Make the requisite changes such that we can withstand being called
without the caller maintaining a lock. In theory, this opens up a few
corners in which wildly misbehaving callers making nonsensical
overlapping requests might lead to crashes instead of just unpredictable
results, but correct code really does not deserve to pay a significant
performance cost for the sake of masking bugs in theoretical broken code.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm.c | 75 ++++++++++++++++++++++++++++++++----------
 1 file changed, 58 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index bdc954cb98c1..d857961af1d3 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -20,6 +20,7 @@
 
 #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/iommu.h>
 #include <linux/kernel.h>
 #include <linux/sizes.h>
@@ -99,6 +100,8 @@
 #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
 #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
 					 ARM_LPAE_PTE_ATTR_HI_MASK)
+/* Software bit for solving coherency races */
+#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
 
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
@@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
 	free_pages_exact(pages, size);
 }
 
+static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
+				struct io_pgtable_cfg *cfg)
+{
+	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
+				   sizeof(*ptep), DMA_TO_DEVICE);
+}
+
 static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 			       struct io_pgtable_cfg *cfg)
 {
 	*ptep = pte;
 
 	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
-		dma_sync_single_for_device(cfg->iommu_dev,
-					   __arm_lpae_dma_addr(ptep),
-					   sizeof(pte), DMA_TO_DEVICE);
+		__arm_lpae_sync_pte(ptep, cfg);
 }
 
 static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -289,13 +297,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			     arm_lpae_iopte prot, int lvl,
 			     arm_lpae_iopte *ptep)
 {
-	arm_lpae_iopte pte;
+	arm_lpae_iopte pte = *ptep;
 
-	if (iopte_leaf(*ptep, lvl)) {
+	if (iopte_leaf(pte, lvl)) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
-	} else if (iopte_type(*ptep, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
+	} else if (iopte_type(pte, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
 		/*
 		 * We need to unmap and free the old table before
 		 * overwriting it with a block entry.
@@ -315,16 +323,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 
 static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 					     arm_lpae_iopte *ptep,
+					     arm_lpae_iopte curr,
 					     struct io_pgtable_cfg *cfg)
 {
-	arm_lpae_iopte new;
+	arm_lpae_iopte old, new;
 
 	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_LPAE_PTE_NSTABLE;
 
-	__arm_lpae_set_pte(ptep, new, cfg);
-	return new;
+	old = cmpxchg64_relaxed(ptep, curr, new);
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)
+		return old;
+
+	/* Even if it's not ours, there's no point waiting; just kick it */
+	if (!(old & ARM_LPAE_PTE_SW_SYNC))
+		__arm_lpae_sync_pte(ptep, cfg);
+	if (old == curr) {
+		/* Ensure our sync is finished before we mark it as such */
+		wmb();
+		*ptep |= ARM_LPAE_PTE_SW_SYNC;
+	}
+
+	return old;
 }
 
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
@@ -333,6 +355,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 {
 	arm_lpae_iopte *cptep, pte;
 	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	size_t tblsz = ARM_LPAE_GRANULE(data);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 
 	/* Find our entry at the current level */
@@ -347,20 +370,26 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
-		cptep = __arm_lpae_alloc_pages(ARM_LPAE_GRANULE(data),
-					       GFP_ATOMIC, cfg);
+		cptep = __arm_lpae_alloc_pages(tblsz, GFP_ATOMIC, cfg);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_lpae_install_table(cptep, ptep, cfg);
-	} else if (!iopte_leaf(pte, lvl)) {
-		cptep = iopte_deref(pte, data);
-	} else {
+		pte = arm_lpae_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_lpae_free_pages(cptep, tblsz, cfg);
+	} else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
+		   !(pte & ARM_LPAE_PTE_SW_SYNC)) {
+		__arm_lpae_sync_pte(ptep, cfg);
+	}
+
+	if (iopte_leaf(pte, lvl)) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
+	} else if (pte) {
+		cptep = iopte_deref(pte, data);
 	}
 
 	/* Rinse, repeat */
@@ -502,7 +531,19 @@ static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 		__arm_lpae_set_pte(&tablep[i], pte, cfg);
 	}
 
-	arm_lpae_install_table(tablep, ptep, cfg);
+	pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_lpae_free_pages(tablep, tablesz, cfg);
+		/*
+		 * We may race against someone unmapping another part of this
+		 * block, but anything else is invalid. We can't misinterpret
+		 * a page entry here since we're never at the last level.
+		 */
+		if (iopte_type(pte, lvl) != ARM_LPAE_PTE_TYPE_TABLE)
+			return 0;
+
+		tablep = iopte_deref(pte, data);
+	}
 
 	if (unmap_idx < 0)
 		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 6/8] iommu/io-pgtable-arm-v7s: Support lockless operation
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

Mirroring the LPAE implementation, rework the v7s code to be robust
against concurrent operations. The same two potential races exist, and
are solved in the same manner, with the fixed 2-level structure making
life ever so slightly simpler.

What complicates matters compared to LPAE, however, is large page
entries, since we can't update a block of 16 PTEs atomically, nor assume
available software bits to do clever things with. As most users are
never likely to do partial unmaps anyway (due to DMA API rules), it
doesn't seem unreasonable for this case to remain behind a serialising
lock; we just pull said lock down into the bowels of the implementation
so it's well out of the way of the normal call paths.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 78 ++++++++++++++++++++++++++++----------
 1 file changed, 58 insertions(+), 20 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 2437f2899661..35d0f41af276 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -32,6 +32,7 @@
 
 #define pr_fmt(fmt)	"arm-v7s io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/dma-mapping.h>
 #include <linux/gfp.h>
 #include <linux/iommu.h>
@@ -39,6 +40,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/spinlock.h>
 #include <linux/types.h>
 
 #include <asm/barrier.h>
@@ -168,6 +170,7 @@ struct arm_v7s_io_pgtable {
 
 	arm_v7s_iopte		*pgd;
 	struct kmem_cache	*l2_tables;
+	spinlock_t		split_lock;
 };
 
 static dma_addr_t __arm_v7s_dma_addr(void *pages)
@@ -396,16 +399,19 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 
 static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
 					   arm_v7s_iopte *ptep,
+					   arm_v7s_iopte curr,
 					   struct io_pgtable_cfg *cfg)
 {
-	arm_v7s_iopte new;
+	arm_v7s_iopte old, new;
 
 	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_V7S_ATTR_NS_TABLE;
 
-	__arm_v7s_set_pte(ptep, new, 1, cfg);
-	return new;
+	old = cmpxchg_relaxed(ptep, curr, new);
+	__arm_v7s_pte_sync(ptep, 1, cfg);
+
+	return old;
 }
 
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
@@ -429,16 +435,23 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
 		cptep = __arm_v7s_alloc_table(lvl + 1, GFP_ATOMIC, data);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_v7s_install_table(cptep, ptep, cfg);
-	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
-		cptep = iopte_deref(pte, lvl);
+		pte = arm_v7s_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_v7s_free_table(cptep, lvl + 1, data);
 	} else {
+		/* We've no easy way of knowing if it's synced yet, so... */
+		__arm_v7s_pte_sync(ptep, 1, cfg);
+	}
+
+	if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
+		cptep = iopte_deref(pte, lvl);
+	} else if (pte) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
@@ -491,27 +504,31 @@ static void arm_v7s_free_pgtable(struct io_pgtable *iop)
 	kfree(data);
 }
 
-static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
-			       unsigned long iova, int idx, int lvl,
-			       arm_v7s_iopte *ptep)
+static arm_v7s_iopte arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
+					unsigned long iova, int idx, int lvl,
+					arm_v7s_iopte *ptep)
 {
 	struct io_pgtable *iop = &data->iop;
 	arm_v7s_iopte pte;
 	size_t size = ARM_V7S_BLOCK_SIZE(lvl);
 	int i;
 
+	/* Check that we didn't lose a race to get the lock */
+	pte = *ptep;
+	if (!arm_v7s_pte_is_cont(pte, lvl))
+		return pte;
+
 	ptep -= idx & (ARM_V7S_CONT_PAGES - 1);
-	pte = arm_v7s_cont_to_pte(*ptep, lvl);
-	for (i = 0; i < ARM_V7S_CONT_PAGES; i++) {
-		ptep[i] = pte;
-		pte += size;
-	}
+	pte = arm_v7s_cont_to_pte(pte, lvl);
+	for (i = 0; i < ARM_V7S_CONT_PAGES; i++)
+		ptep[i] = pte + i * size;
 
 	__arm_v7s_pte_sync(ptep, ARM_V7S_CONT_PAGES, &iop->cfg);
 
 	size *= ARM_V7S_CONT_PAGES;
 	io_pgtable_tlb_add_flush(iop, iova, size, size, true);
 	io_pgtable_tlb_sync(iop);
+	return pte;
 }
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
@@ -542,7 +559,16 @@ static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	arm_v7s_install_table(tablep, ptep, cfg);
+	pte = arm_v7s_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_v7s_free_table(tablep, 2, data);
+
+		if (!ARM_V7S_PTE_IS_TABLE(pte, 1))
+			return 0;
+
+		tablep = iopte_deref(pte, 1);
+		return __arm_v7s_unmap(data, iova, size, 2, tablep);
+	}
 
 	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
@@ -563,17 +589,28 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 	idx = ARM_V7S_LVL_IDX(iova, lvl);
 	ptep += idx;
 	do {
-		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(ptep[i])))
+		pte[i] = READ_ONCE(ptep[i]);
+		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(pte[i])))
 			return 0;
-		pte[i] = ptep[i];
 	} while (++i < num_entries);
 
 	/*
 	 * If we've hit a contiguous 'large page' entry at this level, it
 	 * needs splitting first, unless we're unmapping the whole lot.
+	 *
+	 * For splitting, we can't rewrite 16 PTEs atomically, and since we
+	 * can't necessarily assume TEX remap we don't have a software bit to
+	 * mark live entries being split. In practice (i.e. DMA API code), we
+	 * will never be splitting large pages anyway, so just wrap this edge
+	 * case in a lock for the sake of correctness and be done with it.
 	 */
-	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl))
-		arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&data->split_lock, flags);
+		pte[0] = arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+		spin_unlock_irqrestore(&data->split_lock, flags);
+	}
 
 	/* If the size matches this level, we're in the right place */
 	if (num_entries) {
@@ -672,6 +709,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (!data)
 		return NULL;
 
+	spin_lock_init(&data->split_lock);
 	data->l2_tables = kmem_cache_create("io-pgtable_armv7s_l2",
 					    ARM_V7S_TABLE_SIZE(2),
 					    ARM_V7S_TABLE_SIZE(2),
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 6/8] iommu/io-pgtable-arm-v7s: Support lockless operation
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

Mirroring the LPAE implementation, rework the v7s code to be robust
against concurrent operations. The same two potential races exist, and
are solved in the same manner, with the fixed 2-level structure making
life ever so slightly simpler.

What complicates matters compared to LPAE, however, is large page
entries, since we can't update a block of 16 PTEs atomically, nor assume
available software bits to do clever things with. As most users are
never likely to do partial unmaps anyway (due to DMA API rules), it
doesn't seem unreasonable for this case to remain behind a serialising
lock; we just pull said lock down into the bowels of the implementation
so it's well out of the way of the normal call paths.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 78 ++++++++++++++++++++++++++++----------
 1 file changed, 58 insertions(+), 20 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 2437f2899661..35d0f41af276 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -32,6 +32,7 @@
 
 #define pr_fmt(fmt)	"arm-v7s io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/dma-mapping.h>
 #include <linux/gfp.h>
 #include <linux/iommu.h>
@@ -39,6 +40,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/spinlock.h>
 #include <linux/types.h>
 
 #include <asm/barrier.h>
@@ -168,6 +170,7 @@ struct arm_v7s_io_pgtable {
 
 	arm_v7s_iopte		*pgd;
 	struct kmem_cache	*l2_tables;
+	spinlock_t		split_lock;
 };
 
 static dma_addr_t __arm_v7s_dma_addr(void *pages)
@@ -396,16 +399,19 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 
 static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
 					   arm_v7s_iopte *ptep,
+					   arm_v7s_iopte curr,
 					   struct io_pgtable_cfg *cfg)
 {
-	arm_v7s_iopte new;
+	arm_v7s_iopte old, new;
 
 	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_V7S_ATTR_NS_TABLE;
 
-	__arm_v7s_set_pte(ptep, new, 1, cfg);
-	return new;
+	old = cmpxchg_relaxed(ptep, curr, new);
+	__arm_v7s_pte_sync(ptep, 1, cfg);
+
+	return old;
 }
 
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
@@ -429,16 +435,23 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
 		cptep = __arm_v7s_alloc_table(lvl + 1, GFP_ATOMIC, data);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_v7s_install_table(cptep, ptep, cfg);
-	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
-		cptep = iopte_deref(pte, lvl);
+		pte = arm_v7s_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_v7s_free_table(cptep, lvl + 1, data);
 	} else {
+		/* We've no easy way of knowing if it's synced yet, so... */
+		__arm_v7s_pte_sync(ptep, 1, cfg);
+	}
+
+	if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
+		cptep = iopte_deref(pte, lvl);
+	} else if (pte) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
@@ -491,27 +504,31 @@ static void arm_v7s_free_pgtable(struct io_pgtable *iop)
 	kfree(data);
 }
 
-static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
-			       unsigned long iova, int idx, int lvl,
-			       arm_v7s_iopte *ptep)
+static arm_v7s_iopte arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
+					unsigned long iova, int idx, int lvl,
+					arm_v7s_iopte *ptep)
 {
 	struct io_pgtable *iop = &data->iop;
 	arm_v7s_iopte pte;
 	size_t size = ARM_V7S_BLOCK_SIZE(lvl);
 	int i;
 
+	/* Check that we didn't lose a race to get the lock */
+	pte = *ptep;
+	if (!arm_v7s_pte_is_cont(pte, lvl))
+		return pte;
+
 	ptep -= idx & (ARM_V7S_CONT_PAGES - 1);
-	pte = arm_v7s_cont_to_pte(*ptep, lvl);
-	for (i = 0; i < ARM_V7S_CONT_PAGES; i++) {
-		ptep[i] = pte;
-		pte += size;
-	}
+	pte = arm_v7s_cont_to_pte(pte, lvl);
+	for (i = 0; i < ARM_V7S_CONT_PAGES; i++)
+		ptep[i] = pte + i * size;
 
 	__arm_v7s_pte_sync(ptep, ARM_V7S_CONT_PAGES, &iop->cfg);
 
 	size *= ARM_V7S_CONT_PAGES;
 	io_pgtable_tlb_add_flush(iop, iova, size, size, true);
 	io_pgtable_tlb_sync(iop);
+	return pte;
 }
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
@@ -542,7 +559,16 @@ static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	arm_v7s_install_table(tablep, ptep, cfg);
+	pte = arm_v7s_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_v7s_free_table(tablep, 2, data);
+
+		if (!ARM_V7S_PTE_IS_TABLE(pte, 1))
+			return 0;
+
+		tablep = iopte_deref(pte, 1);
+		return __arm_v7s_unmap(data, iova, size, 2, tablep);
+	}
 
 	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
@@ -563,17 +589,28 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 	idx = ARM_V7S_LVL_IDX(iova, lvl);
 	ptep += idx;
 	do {
-		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(ptep[i])))
+		pte[i] = READ_ONCE(ptep[i]);
+		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(pte[i])))
 			return 0;
-		pte[i] = ptep[i];
 	} while (++i < num_entries);
 
 	/*
 	 * If we've hit a contiguous 'large page' entry@this level, it
 	 * needs splitting first, unless we're unmapping the whole lot.
+	 *
+	 * For splitting, we can't rewrite 16 PTEs atomically, and since we
+	 * can't necessarily assume TEX remap we don't have a software bit to
+	 * mark live entries being split. In practice (i.e. DMA API code), we
+	 * will never be splitting large pages anyway, so just wrap this edge
+	 * case in a lock for the sake of correctness and be done with it.
 	 */
-	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl))
-		arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&data->split_lock, flags);
+		pte[0] = arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+		spin_unlock_irqrestore(&data->split_lock, flags);
+	}
 
 	/* If the size matches this level, we're in the right place */
 	if (num_entries) {
@@ -672,6 +709,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (!data)
 		return NULL;
 
+	spin_lock_init(&data->split_lock);
 	data->l2_tables = kmem_cache_create("io-pgtable_armv7s_l2",
 					    ARM_V7S_TABLE_SIZE(2),
 					    ARM_V7S_TABLE_SIZE(2),
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 7/8] iommu/arm-smmu: Remove io-pgtable spinlock
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

With the io-pgtable code now robust against (valid) races, we no longer
need to serialise all operations with a lock. This might make broken
callers who issue concurrent operations on overlapping addresses go even
more wrong than before, but hey, they already had little hope of useful
or deterministic results.

We do however still have to keep a lock around to serialise the ATS1*
translation ops, as parallel iova_to_phys() calls could lead to
unpredictable hardware behaviour otherwise.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu.c | 45 ++++++++++++++-------------------------------
 1 file changed, 14 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 1f42f339a284..b8d069a2b31d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -425,10 +425,10 @@ enum arm_smmu_domain_stage {
 struct arm_smmu_domain {
 	struct arm_smmu_device		*smmu;
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
 	struct mutex			init_mutex; /* Protects smmu pointer */
+	spinlock_t			cb_lock; /* Serialises ATS1* ops */
 	struct iommu_domain		domain;
 };
 
@@ -1105,7 +1105,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
+	spin_lock_init(&smmu_domain->cb_lock);
 
 	return &smmu_domain->domain;
 }
@@ -1383,35 +1383,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
@@ -1425,10 +1413,11 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	void __iomem *cb_base;
 	u32 tmp;
 	u64 phys;
-	unsigned long va;
+	unsigned long va, flags;
 
 	cb_base = ARM_SMMU_CB(smmu, cfg->cbndx);
 
+	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
 	/* ATS1 registers can only be written atomically */
 	va = iova & ~0xfffUL;
 	if (smmu->version == ARM_SMMU_V2)
@@ -1438,6 +1427,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 
 	if (readl_poll_timeout_atomic(cb_base + ARM_SMMU_CB_ATSR, tmp,
 				      !(tmp & ATSR_ACTIVE), 5, 50)) {
+		spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 		dev_err(dev,
 			"iova to phys timed out on %pad. Falling back to software table walk.\n",
 			&iova);
@@ -1445,6 +1435,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	}
 
 	phys = readq_relaxed(cb_base + ARM_SMMU_CB_PAR);
+	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 	if (phys & CB_PAR_F) {
 		dev_err(dev, "translation fault!\n");
 		dev_err(dev, "PAR = 0x%llx\n", phys);
@@ -1457,10 +1448,8 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 					dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1468,17 +1457,11 @@ static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
 	if (smmu_domain->smmu->features & ARM_SMMU_FEAT_TRANS_OPS &&
-			smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		ret = arm_smmu_iova_to_phys_hard(domain, iova);
-	} else {
-		ret = ops->iova_to_phys(ops, iova);
-	}
+			smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
+		return arm_smmu_iova_to_phys_hard(domain, iova);
 
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static bool arm_smmu_capable(enum iommu_cap cap)
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 7/8] iommu/arm-smmu: Remove io-pgtable spinlock
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

With the io-pgtable code now robust against (valid) races, we no longer
need to serialise all operations with a lock. This might make broken
callers who issue concurrent operations on overlapping addresses go even
more wrong than before, but hey, they already had little hope of useful
or deterministic results.

We do however still have to keep a lock around to serialise the ATS1*
translation ops, as parallel iova_to_phys() calls could lead to
unpredictable hardware behaviour otherwise.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu.c | 45 ++++++++++++++-------------------------------
 1 file changed, 14 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 1f42f339a284..b8d069a2b31d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -425,10 +425,10 @@ enum arm_smmu_domain_stage {
 struct arm_smmu_domain {
 	struct arm_smmu_device		*smmu;
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
 	struct mutex			init_mutex; /* Protects smmu pointer */
+	spinlock_t			cb_lock; /* Serialises ATS1* ops */
 	struct iommu_domain		domain;
 };
 
@@ -1105,7 +1105,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
+	spin_lock_init(&smmu_domain->cb_lock);
 
 	return &smmu_domain->domain;
 }
@@ -1383,35 +1383,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
@@ -1425,10 +1413,11 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	void __iomem *cb_base;
 	u32 tmp;
 	u64 phys;
-	unsigned long va;
+	unsigned long va, flags;
 
 	cb_base = ARM_SMMU_CB(smmu, cfg->cbndx);
 
+	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
 	/* ATS1 registers can only be written atomically */
 	va = iova & ~0xfffUL;
 	if (smmu->version == ARM_SMMU_V2)
@@ -1438,6 +1427,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 
 	if (readl_poll_timeout_atomic(cb_base + ARM_SMMU_CB_ATSR, tmp,
 				      !(tmp & ATSR_ACTIVE), 5, 50)) {
+		spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 		dev_err(dev,
 			"iova to phys timed out on %pad. Falling back to software table walk.\n",
 			&iova);
@@ -1445,6 +1435,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	}
 
 	phys = readq_relaxed(cb_base + ARM_SMMU_CB_PAR);
+	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 	if (phys & CB_PAR_F) {
 		dev_err(dev, "translation fault!\n");
 		dev_err(dev, "PAR = 0x%llx\n", phys);
@@ -1457,10 +1448,8 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 					dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1468,17 +1457,11 @@ static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
 	if (smmu_domain->smmu->features & ARM_SMMU_FEAT_TRANS_OPS &&
-			smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		ret = arm_smmu_iova_to_phys_hard(domain, iova);
-	} else {
-		ret = ops->iova_to_phys(ops, iova);
-	}
+			smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
+		return arm_smmu_iova_to_phys_hard(domain, iova);
 
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static bool arm_smmu_capable(enum iommu_cap cap)
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 8/8] iommu/arm-smmu-v3: Remove io-pgtable spinlock
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-08 11:52   ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: will.deacon, joro
  Cc: salil.mehta, sunil.goutham, iommu, ray.jui, linux-arm-kernel,
	linu.cherian, nwatters

As for SMMUv2, take advantage of io-pgtable's newfound tolerance for
concurrency. Unfortunately in this case the command queue lock remains a
point of serialisation for the unmap path, but there may be a little
more we can do to ameliorate that in future.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 33 ++++++---------------------------
 1 file changed, 6 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b9c4cf4ccca2..291da5f918d5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -645,7 +645,6 @@ struct arm_smmu_domain {
 	struct mutex			init_mutex; /* Protects smmu pointer */
 
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 
 	enum arm_smmu_domain_stage	stage;
 	union {
@@ -1406,7 +1405,6 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
 	return &smmu_domain->domain;
 }
 
@@ -1678,44 +1676,29 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t
 arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t
 arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1723,11 +1706,7 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->iova_to_phys(ops, iova);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static struct platform_driver arm_smmu_driver;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 8/8] iommu/arm-smmu-v3: Remove io-pgtable spinlock
@ 2017-06-08 11:52   ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-08 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

As for SMMUv2, take advantage of io-pgtable's newfound tolerance for
concurrency. Unfortunately in this case the command queue lock remains a
point of serialisation for the unmap path, but there may be a little
more we can do to ameliorate that in future.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 33 ++++++---------------------------
 1 file changed, 6 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b9c4cf4ccca2..291da5f918d5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -645,7 +645,6 @@ struct arm_smmu_domain {
 	struct mutex			init_mutex; /* Protects smmu pointer */
 
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 
 	enum arm_smmu_domain_stage	stage;
 	union {
@@ -1406,7 +1405,6 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
 	return &smmu_domain->domain;
 }
 
@@ -1678,44 +1676,29 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t
 arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t
 arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1723,11 +1706,7 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->iova_to_phys(ops, iova);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static struct platform_driver arm_smmu_driver;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-08 11:51 ` Robin Murphy
@ 2017-06-09 19:28     ` Nate Watterson
  -1 siblings, 0 replies; 50+ messages in thread
From: Nate Watterson @ 2017-06-09 19:28 UTC (permalink / raw)
  To: Robin Murphy, will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Robin,

On 6/8/2017 7:51 AM, Robin Murphy wrote:
> Hi all,
> 
> Here's the cleaned up nominally-final version of the patches everybody's
> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
> #2-#4 do some preparatory work (and bid farewell to everyone's least
> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
> 
> The branch I've previously shared has been updated too:
> 
>    git://linux-arm.org/linux-rm  iommu/pgtable
> 
> All feedback welcome, as I'd really like to land this for 4.13.
>

I tested the series on a QDF2400 development platform and see notable
performance improvements particularly in workloads that make concurrent
accesses to a single iommu_domain.

> Robin.
> 
> 
> Robin Murphy (8):
>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>    iommu/io-pgtable-arm: Improve split_blk_unmap
>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>    iommu/io-pgtable: Introduce explicit coherency
>    iommu/io-pgtable-arm: Support lockless operation
>    iommu/io-pgtable-arm-v7s: Support lockless operation
>    iommu/arm-smmu: Remove io-pgtable spinlock
>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
> 
>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>   drivers/iommu/arm-smmu.c           |  48 ++++------
>   drivers/iommu/io-pgtable-arm-v7s.c | 173 +++++++++++++++++++++------------
>   drivers/iommu/io-pgtable-arm.c     | 190 ++++++++++++++++++++++++-------------
>   drivers/iommu/io-pgtable.h         |   6 ++
>   5 files changed, 268 insertions(+), 185 deletions(-)
> 

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-09 19:28     ` Nate Watterson
  0 siblings, 0 replies; 50+ messages in thread
From: Nate Watterson @ 2017-06-09 19:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On 6/8/2017 7:51 AM, Robin Murphy wrote:
> Hi all,
> 
> Here's the cleaned up nominally-final version of the patches everybody's
> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
> #2-#4 do some preparatory work (and bid farewell to everyone's least
> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
> 
> The branch I've previously shared has been updated too:
> 
>    git://linux-arm.org/linux-rm  iommu/pgtable
> 
> All feedback welcome, as I'd really like to land this for 4.13.
>

I tested the series on a QDF2400 development platform and see notable
performance improvements particularly in workloads that make concurrent
accesses to a single iommu_domain.

> Robin.
> 
> 
> Robin Murphy (8):
>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>    iommu/io-pgtable-arm: Improve split_blk_unmap
>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>    iommu/io-pgtable: Introduce explicit coherency
>    iommu/io-pgtable-arm: Support lockless operation
>    iommu/io-pgtable-arm-v7s: Support lockless operation
>    iommu/arm-smmu: Remove io-pgtable spinlock
>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
> 
>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>   drivers/iommu/arm-smmu.c           |  48 ++++------
>   drivers/iommu/io-pgtable-arm-v7s.c | 173 +++++++++++++++++++++------------
>   drivers/iommu/io-pgtable-arm.c     | 190 ++++++++++++++++++++++++-------------
>   drivers/iommu/io-pgtable.h         |   6 ++
>   5 files changed, 268 insertions(+), 185 deletions(-)
> 

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-09 19:28     ` Nate Watterson
@ 2017-06-15  0:40         ` Ray Jui
  -1 siblings, 0 replies; 50+ messages in thread
From: Ray Jui via iommu @ 2017-06-15  0:40 UTC (permalink / raw)
  To: Nate Watterson, Robin Murphy, will.deacon-5wv7dgnIgG8,
	joro-zLv9SwRftAIdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

Hi Robin,

I have applied this patch series on top of v4.12-rc4, and ran various
Ethernet and NVMf target throughput tests on it.

To give you some background of my setup:

The system is a ARMv8 based system with 8 cores. It has various PCIe
root complexes that can be used to connect to PCIe endpoint devices
including NIC cards and NVMe SSDs.

I'm particularly interested in the performance of the PCIe root complex
that connects to the NIC card, and during my test, IOMMU is
enabled/disabled against that particular PCIe root complex. The root
complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).

For the Ethernet throughput out of 50G link:

Note during the multiple TCP session test, each session will be spread
to different CPU cores for optimized performance

Without IOMMU:

TX TCP x1 - 29.7 Gbps
TX TCP x4 - 30.5 Gbps
TX TCP x8 - 28 Gbps

RX TCP x1 - 15 Gbps
RX TCP x4 - 33.7 Gbps
RX TCP x8 - 36 Gbps

With IOMMU, but without your latest patch:

TX TCP x1 - 15.2 Gbps
TX TCP x4 - 14.3 Gbps
TX TCP x8 - 13 Gbps

RX TCP x1 - 7.88 Gbps
RX TCP x4 - 13.2 Gbps
RX TCP x8 - 12.6 Gbps

With IOMMU and your latest patch:

TX TCP x1 - 21.4 Gbps
TX TCP x4 - 30.5 Gbps
TX TCP x8 - 21.3 Gbps

RX TCP x1 - 7.7 Gbps
RX TCP x4 - 20.1 Gbps
RX TCP x8 - 27.1 Gbps

With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
8 jobs:

Without IOMMU:

IOPS = 1080K

With IOMMU, but without your latest patch:

IOPS = 520K

With IOMMU and your latest patch:

IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

As you can see, performance has improved significantly with this patch
series! That is very impressive!

However, it is still off, compared to the test runs without the IOMMU.
I'm wondering if more improvement is expected.

In addition, a much larger throughput variation is observed in the tests
with these latest patches, when multiple CPUs are involved. I'm
wondering if that is caused by some remaining lock in the driver?

Also, in a few occasions, I observed the following message during the
test, when multiple cores are involved:

arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

Thanks,

Ray

On 6/9/17 12:28 PM, Nate Watterson wrote:
> Hi Robin,
> 
> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>> Hi all,
>>
>> Here's the cleaned up nominally-final version of the patches everybody's
>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>
>> The branch I've previously shared has been updated too:
>>
>>    git://linux-arm.org/linux-rm  iommu/pgtable
>>
>> All feedback welcome, as I'd really like to land this for 4.13.
>>
> 
> I tested the series on a QDF2400 development platform and see notable
> performance improvements particularly in workloads that make concurrent
> accesses to a single iommu_domain.
> 
>> Robin.
>>
>>
>> Robin Murphy (8):
>>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>    iommu/io-pgtable-arm: Improve split_blk_unmap
>>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>    iommu/io-pgtable: Introduce explicit coherency
>>    iommu/io-pgtable-arm: Support lockless operation
>>    iommu/io-pgtable-arm-v7s: Support lockless operation
>>    iommu/arm-smmu: Remove io-pgtable spinlock
>>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>
>>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>   drivers/iommu/arm-smmu.c           |  48 ++++------
>>   drivers/iommu/io-pgtable-arm-v7s.c | 173
>> +++++++++++++++++++++------------
>>   drivers/iommu/io-pgtable-arm.c     | 190
>> ++++++++++++++++++++++++-------------
>>   drivers/iommu/io-pgtable.h         |   6 ++
>>   5 files changed, 268 insertions(+), 185 deletions(-)
>>
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-15  0:40         ` Ray Jui
  0 siblings, 0 replies; 50+ messages in thread
From: Ray Jui @ 2017-06-15  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

I have applied this patch series on top of v4.12-rc4, and ran various
Ethernet and NVMf target throughput tests on it.

To give you some background of my setup:

The system is a ARMv8 based system with 8 cores. It has various PCIe
root complexes that can be used to connect to PCIe endpoint devices
including NIC cards and NVMe SSDs.

I'm particularly interested in the performance of the PCIe root complex
that connects to the NIC card, and during my test, IOMMU is
enabled/disabled against that particular PCIe root complex. The root
complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).

For the Ethernet throughput out of 50G link:

Note during the multiple TCP session test, each session will be spread
to different CPU cores for optimized performance

Without IOMMU:

TX TCP x1 - 29.7 Gbps
TX TCP x4 - 30.5 Gbps
TX TCP x8 - 28 Gbps

RX TCP x1 - 15 Gbps
RX TCP x4 - 33.7 Gbps
RX TCP x8 - 36 Gbps

With IOMMU, but without your latest patch:

TX TCP x1 - 15.2 Gbps
TX TCP x4 - 14.3 Gbps
TX TCP x8 - 13 Gbps

RX TCP x1 - 7.88 Gbps
RX TCP x4 - 13.2 Gbps
RX TCP x8 - 12.6 Gbps

With IOMMU and your latest patch:

TX TCP x1 - 21.4 Gbps
TX TCP x4 - 30.5 Gbps
TX TCP x8 - 21.3 Gbps

RX TCP x1 - 7.7 Gbps
RX TCP x4 - 20.1 Gbps
RX TCP x8 - 27.1 Gbps

With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
8 jobs:

Without IOMMU:

IOPS = 1080K

With IOMMU, but without your latest patch:

IOPS = 520K

With IOMMU and your latest patch:

IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

As you can see, performance has improved significantly with this patch
series! That is very impressive!

However, it is still off, compared to the test runs without the IOMMU.
I'm wondering if more improvement is expected.

In addition, a much larger throughput variation is observed in the tests
with these latest patches, when multiple CPUs are involved. I'm
wondering if that is caused by some remaining lock in the driver?

Also, in a few occasions, I observed the following message during the
test, when multiple cores are involved:

arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

Thanks,

Ray

On 6/9/17 12:28 PM, Nate Watterson wrote:
> Hi Robin,
> 
> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>> Hi all,
>>
>> Here's the cleaned up nominally-final version of the patches everybody's
>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>
>> The branch I've previously shared has been updated too:
>>
>>    git://linux-arm.org/linux-rm  iommu/pgtable
>>
>> All feedback welcome, as I'd really like to land this for 4.13.
>>
> 
> I tested the series on a QDF2400 development platform and see notable
> performance improvements particularly in workloads that make concurrent
> accesses to a single iommu_domain.
> 
>> Robin.
>>
>>
>> Robin Murphy (8):
>>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>    iommu/io-pgtable-arm: Improve split_blk_unmap
>>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>    iommu/io-pgtable: Introduce explicit coherency
>>    iommu/io-pgtable-arm: Support lockless operation
>>    iommu/io-pgtable-arm-v7s: Support lockless operation
>>    iommu/arm-smmu: Remove io-pgtable spinlock
>>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>
>>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>   drivers/iommu/arm-smmu.c           |  48 ++++------
>>   drivers/iommu/io-pgtable-arm-v7s.c | 173
>> +++++++++++++++++++++------------
>>   drivers/iommu/io-pgtable-arm.c     | 190
>> ++++++++++++++++++++++++-------------
>>   drivers/iommu/io-pgtable.h         |   6 ++
>>   5 files changed, 268 insertions(+), 185 deletions(-)
>>
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-15  0:40         ` Ray Jui
@ 2017-06-15 12:25             ` John Garry
  -1 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2017-06-15 12:25 UTC (permalink / raw)
  To: Robin Murphy, will.deacon-5wv7dgnIgG8
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA, Hanjun Guo, Linuxarm,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Ray Jui,
	Zhou Wang, linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 15/06/2017 01:40, Ray Jui via iommu wrote:

Hi Robin,

wangzhou tested this patchset on our SMMUv3-based development board with 
a 10G PCI NIC card.

Currently we see a ~17% performance (throughput) drop when enabling the 
SMMU, but only a ~8% drop with your patchset.

FYI, for our integrated storage and network adapter, we see a big 
performance hit (maybe 40%) when enabling the SMMU with or without the 
patchset. Leizhen has been doing some investigation on this.

Thanks,
John

> Hi Robin,
>
> I have applied this patch series on top of v4.12-rc4, and ran various
> Ethernet and NVMf target throughput tests on it.
>
> To give you some background of my setup:
>
> The system is a ARMv8 based system with 8 cores. It has various PCIe
> root complexes that can be used to connect to PCIe endpoint devices
> including NIC cards and NVMe SSDs.
>
> I'm particularly interested in the performance of the PCIe root complex
> that connects to the NIC card, and during my test, IOMMU is
> enabled/disabled against that particular PCIe root complex. The root
> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
>
> For the Ethernet throughput out of 50G link:
>
> Note during the multiple TCP session test, each session will be spread
> to different CPU cores for optimized performance
>
> Without IOMMU:
>
> TX TCP x1 - 29.7 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 28 Gbps
>
> RX TCP x1 - 15 Gbps
> RX TCP x4 - 33.7 Gbps
> RX TCP x8 - 36 Gbps
>
> With IOMMU, but without your latest patch:
>
> TX TCP x1 - 15.2 Gbps
> TX TCP x4 - 14.3 Gbps
> TX TCP x8 - 13 Gbps
>
> RX TCP x1 - 7.88 Gbps
> RX TCP x4 - 13.2 Gbps
> RX TCP x8 - 12.6 Gbps
>
> With IOMMU and your latest patch:
>
> TX TCP x1 - 21.4 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 21.3 Gbps
>
> RX TCP x1 - 7.7 Gbps
> RX TCP x4 - 20.1 Gbps
> RX TCP x8 - 27.1 Gbps
>
> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
>
> Without IOMMU:
>
> IOPS = 1080K
>
> With IOMMU, but without your latest patch:
>
> IOPS = 520K
>
> With IOMMU and your latest patch:
>
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)
>
> As you can see, performance has improved significantly with this patch
> series! That is very impressive!
>
> However, it is still off, compared to the test runs without the IOMMU.
> I'm wondering if more improvement is expected.
>
> In addition, a much larger throughput variation is observed in the tests
> with these latest patches, when multiple CPUs are involved. I'm
> wondering if that is caused by some remaining lock in the driver?
>
> Also, in a few occasions, I observed the following message during the
> test, when multiple cores are involved:
>
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>
> Thanks,
>
> Ray
>
> On 6/9/17 12:28 PM, Nate Watterson wrote:
>> Hi Robin,
>>
>> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>>> Hi all,
>>>
>>> Here's the cleaned up nominally-final version of the patches everybody's
>>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>>
>>> The branch I've previously shared has been updated too:
>>>
>>>    git://linux-arm.org/linux-rm  iommu/pgtable
>>>
>>> All feedback welcome, as I'd really like to land this for 4.13.
>>>
>>
>> I tested the series on a QDF2400 development platform and see notable
>> performance improvements particularly in workloads that make concurrent
>> accesses to a single iommu_domain.
>>
>>> Robin.
>>>
>>>
>>> Robin Murphy (8):
>>>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>>    iommu/io-pgtable-arm: Improve split_blk_unmap
>>>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>>    iommu/io-pgtable: Introduce explicit coherency
>>>    iommu/io-pgtable-arm: Support lockless operation
>>>    iommu/io-pgtable-arm-v7s: Support lockless operation
>>>    iommu/arm-smmu: Remove io-pgtable spinlock
>>>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>>   drivers/iommu/arm-smmu.c           |  48 ++++------
>>>   drivers/iommu/io-pgtable-arm-v7s.c | 173
>>> +++++++++++++++++++++------------
>>>   drivers/iommu/io-pgtable-arm.c     | 190
>>> ++++++++++++++++++++++++-------------
>>>   drivers/iommu/io-pgtable.h         |   6 ++
>>>   5 files changed, 268 insertions(+), 185 deletions(-)
>>>
>>
> _______________________________________________
> iommu mailing list
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> .
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-15 12:25             ` John Garry
  0 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2017-06-15 12:25 UTC (permalink / raw)
  To: linux-arm-kernel

On 15/06/2017 01:40, Ray Jui via iommu wrote:

Hi Robin,

wangzhou tested this patchset on our SMMUv3-based development board with 
a 10G PCI NIC card.

Currently we see a ~17% performance (throughput) drop when enabling the 
SMMU, but only a ~8% drop with your patchset.

FYI, for our integrated storage and network adapter, we see a big 
performance hit (maybe 40%) when enabling the SMMU with or without the 
patchset. Leizhen has been doing some investigation on this.

Thanks,
John

> Hi Robin,
>
> I have applied this patch series on top of v4.12-rc4, and ran various
> Ethernet and NVMf target throughput tests on it.
>
> To give you some background of my setup:
>
> The system is a ARMv8 based system with 8 cores. It has various PCIe
> root complexes that can be used to connect to PCIe endpoint devices
> including NIC cards and NVMe SSDs.
>
> I'm particularly interested in the performance of the PCIe root complex
> that connects to the NIC card, and during my test, IOMMU is
> enabled/disabled against that particular PCIe root complex. The root
> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
>
> For the Ethernet throughput out of 50G link:
>
> Note during the multiple TCP session test, each session will be spread
> to different CPU cores for optimized performance
>
> Without IOMMU:
>
> TX TCP x1 - 29.7 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 28 Gbps
>
> RX TCP x1 - 15 Gbps
> RX TCP x4 - 33.7 Gbps
> RX TCP x8 - 36 Gbps
>
> With IOMMU, but without your latest patch:
>
> TX TCP x1 - 15.2 Gbps
> TX TCP x4 - 14.3 Gbps
> TX TCP x8 - 13 Gbps
>
> RX TCP x1 - 7.88 Gbps
> RX TCP x4 - 13.2 Gbps
> RX TCP x8 - 12.6 Gbps
>
> With IOMMU and your latest patch:
>
> TX TCP x1 - 21.4 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 21.3 Gbps
>
> RX TCP x1 - 7.7 Gbps
> RX TCP x4 - 20.1 Gbps
> RX TCP x8 - 27.1 Gbps
>
> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
>
> Without IOMMU:
>
> IOPS = 1080K
>
> With IOMMU, but without your latest patch:
>
> IOPS = 520K
>
> With IOMMU and your latest patch:
>
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)
>
> As you can see, performance has improved significantly with this patch
> series! That is very impressive!
>
> However, it is still off, compared to the test runs without the IOMMU.
> I'm wondering if more improvement is expected.
>
> In addition, a much larger throughput variation is observed in the tests
> with these latest patches, when multiple CPUs are involved. I'm
> wondering if that is caused by some remaining lock in the driver?
>
> Also, in a few occasions, I observed the following message during the
> test, when multiple cores are involved:
>
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>
> Thanks,
>
> Ray
>
> On 6/9/17 12:28 PM, Nate Watterson wrote:
>> Hi Robin,
>>
>> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>>> Hi all,
>>>
>>> Here's the cleaned up nominally-final version of the patches everybody's
>>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>>
>>> The branch I've previously shared has been updated too:
>>>
>>>    git://linux-arm.org/linux-rm  iommu/pgtable
>>>
>>> All feedback welcome, as I'd really like to land this for 4.13.
>>>
>>
>> I tested the series on a QDF2400 development platform and see notable
>> performance improvements particularly in workloads that make concurrent
>> accesses to a single iommu_domain.
>>
>>> Robin.
>>>
>>>
>>> Robin Murphy (8):
>>>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>>    iommu/io-pgtable-arm: Improve split_blk_unmap
>>>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>>    iommu/io-pgtable: Introduce explicit coherency
>>>    iommu/io-pgtable-arm: Support lockless operation
>>>    iommu/io-pgtable-arm-v7s: Support lockless operation
>>>    iommu/arm-smmu: Remove io-pgtable spinlock
>>>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>>   drivers/iommu/arm-smmu.c           |  48 ++++------
>>>   drivers/iommu/io-pgtable-arm-v7s.c | 173
>>> +++++++++++++++++++++------------
>>>   drivers/iommu/io-pgtable-arm.c     | 190
>>> ++++++++++++++++++++++++-------------
>>>   drivers/iommu/io-pgtable.h         |   6 ++
>>>   5 files changed, 268 insertions(+), 185 deletions(-)
>>>
>>
> _______________________________________________
> iommu mailing list
> iommu at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> .
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-15  0:40         ` Ray Jui
@ 2017-06-20 13:37             ` Robin Murphy
  -1 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-20 13:37 UTC (permalink / raw)
  To: Ray Jui, Nate Watterson, will.deacon-5wv7dgnIgG8,
	joro-zLv9SwRftAIdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

On 15/06/17 01:40, Ray Jui wrote:
> Hi Robin,
> 
> I have applied this patch series on top of v4.12-rc4, and ran various
> Ethernet and NVMf target throughput tests on it.
> 
> To give you some background of my setup:
> 
> The system is a ARMv8 based system with 8 cores. It has various PCIe
> root complexes that can be used to connect to PCIe endpoint devices
> including NIC cards and NVMe SSDs.
> 
> I'm particularly interested in the performance of the PCIe root complex
> that connects to the NIC card, and during my test, IOMMU is
> enabled/disabled against that particular PCIe root complex. The root
> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
> 
> For the Ethernet throughput out of 50G link:
> 
> Note during the multiple TCP session test, each session will be spread
> to different CPU cores for optimized performance
> 
> Without IOMMU:
> 
> TX TCP x1 - 29.7 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 28 Gbps
> 
> RX TCP x1 - 15 Gbps
> RX TCP x4 - 33.7 Gbps
> RX TCP x8 - 36 Gbps
> 
> With IOMMU, but without your latest patch:
> 
> TX TCP x1 - 15.2 Gbps
> TX TCP x4 - 14.3 Gbps
> TX TCP x8 - 13 Gbps
> 
> RX TCP x1 - 7.88 Gbps
> RX TCP x4 - 13.2 Gbps
> RX TCP x8 - 12.6 Gbps
> 
> With IOMMU and your latest patch:
> 
> TX TCP x1 - 21.4 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 21.3 Gbps
> 
> RX TCP x1 - 7.7 Gbps
> RX TCP x4 - 20.1 Gbps
> RX TCP x8 - 27.1 Gbps

Cool, those seem more or less in line with expectations. Nate's
currently cooking a patch to further reduce the overhead when unmapping
multi-page buffers, which we believe should make up most of the rest of
the difference.

> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
> 
> Without IOMMU:
> 
> IOPS = 1080K
> 
> With IOMMU, but without your latest patch:
> 
> IOPS = 520K
> 
> With IOMMU and your latest patch:
> 
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

That does seem a bit off - are you able to try some perf profiling to
get a better idea of where the overhead appears to be?

> As you can see, performance has improved significantly with this patch
> series! That is very impressive!
> 
> However, it is still off, compared to the test runs without the IOMMU.
> I'm wondering if more improvement is expected.
> 
> In addition, a much larger throughput variation is observed in the tests
> with these latest patches, when multiple CPUs are involved. I'm
> wondering if that is caused by some remaining lock in the driver?

Assuming this is the platform with MMU-500, there shouldn't be any locks
left, since that shouldn't have the hardware ATOS registers for
iova_to_phys().

> Also, in a few occasions, I observed the following message during the
> test, when multiple cores are involved:
> 
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

That's particularly worrying, because it means we spent over a second
waiting for something that normally shouldn't take more than a few
hundred cycles. The only time I've ever actually seen that happen is if
TLBSYNC is issued while a context fault is pending - on MMU-500 it seems
that the sync just doesn't proceed until the fault is cleared - but that
stemmed from interrupts not being wired up correctly (on FPGAs) such
that we never saw the fault reported in the first place :/

Robin.

> 
> Thanks,
> 
> Ray
> 
> On 6/9/17 12:28 PM, Nate Watterson wrote:
>> Hi Robin,
>>
>> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>>> Hi all,
>>>
>>> Here's the cleaned up nominally-final version of the patches everybody's
>>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>>
>>> The branch I've previously shared has been updated too:
>>>
>>>    git://linux-arm.org/linux-rm  iommu/pgtable
>>>
>>> All feedback welcome, as I'd really like to land this for 4.13.
>>>
>>
>> I tested the series on a QDF2400 development platform and see notable
>> performance improvements particularly in workloads that make concurrent
>> accesses to a single iommu_domain.
>>
>>> Robin.
>>>
>>>
>>> Robin Murphy (8):
>>>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>>    iommu/io-pgtable-arm: Improve split_blk_unmap
>>>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>>    iommu/io-pgtable: Introduce explicit coherency
>>>    iommu/io-pgtable-arm: Support lockless operation
>>>    iommu/io-pgtable-arm-v7s: Support lockless operation
>>>    iommu/arm-smmu: Remove io-pgtable spinlock
>>>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>>   drivers/iommu/arm-smmu.c           |  48 ++++------
>>>   drivers/iommu/io-pgtable-arm-v7s.c | 173
>>> +++++++++++++++++++++------------
>>>   drivers/iommu/io-pgtable-arm.c     | 190
>>> ++++++++++++++++++++++++-------------
>>>   drivers/iommu/io-pgtable.h         |   6 ++
>>>   5 files changed, 268 insertions(+), 185 deletions(-)
>>>
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-20 13:37             ` Robin Murphy
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Murphy @ 2017-06-20 13:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 15/06/17 01:40, Ray Jui wrote:
> Hi Robin,
> 
> I have applied this patch series on top of v4.12-rc4, and ran various
> Ethernet and NVMf target throughput tests on it.
> 
> To give you some background of my setup:
> 
> The system is a ARMv8 based system with 8 cores. It has various PCIe
> root complexes that can be used to connect to PCIe endpoint devices
> including NIC cards and NVMe SSDs.
> 
> I'm particularly interested in the performance of the PCIe root complex
> that connects to the NIC card, and during my test, IOMMU is
> enabled/disabled against that particular PCIe root complex. The root
> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
> 
> For the Ethernet throughput out of 50G link:
> 
> Note during the multiple TCP session test, each session will be spread
> to different CPU cores for optimized performance
> 
> Without IOMMU:
> 
> TX TCP x1 - 29.7 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 28 Gbps
> 
> RX TCP x1 - 15 Gbps
> RX TCP x4 - 33.7 Gbps
> RX TCP x8 - 36 Gbps
> 
> With IOMMU, but without your latest patch:
> 
> TX TCP x1 - 15.2 Gbps
> TX TCP x4 - 14.3 Gbps
> TX TCP x8 - 13 Gbps
> 
> RX TCP x1 - 7.88 Gbps
> RX TCP x4 - 13.2 Gbps
> RX TCP x8 - 12.6 Gbps
> 
> With IOMMU and your latest patch:
> 
> TX TCP x1 - 21.4 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 21.3 Gbps
> 
> RX TCP x1 - 7.7 Gbps
> RX TCP x4 - 20.1 Gbps
> RX TCP x8 - 27.1 Gbps

Cool, those seem more or less in line with expectations. Nate's
currently cooking a patch to further reduce the overhead when unmapping
multi-page buffers, which we believe should make up most of the rest of
the difference.

> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
> 
> Without IOMMU:
> 
> IOPS = 1080K
> 
> With IOMMU, but without your latest patch:
> 
> IOPS = 520K
> 
> With IOMMU and your latest patch:
> 
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

That does seem a bit off - are you able to try some perf profiling to
get a better idea of where the overhead appears to be?

> As you can see, performance has improved significantly with this patch
> series! That is very impressive!
> 
> However, it is still off, compared to the test runs without the IOMMU.
> I'm wondering if more improvement is expected.
> 
> In addition, a much larger throughput variation is observed in the tests
> with these latest patches, when multiple CPUs are involved. I'm
> wondering if that is caused by some remaining lock in the driver?

Assuming this is the platform with MMU-500, there shouldn't be any locks
left, since that shouldn't have the hardware ATOS registers for
iova_to_phys().

> Also, in a few occasions, I observed the following message during the
> test, when multiple cores are involved:
> 
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

That's particularly worrying, because it means we spent over a second
waiting for something that normally shouldn't take more than a few
hundred cycles. The only time I've ever actually seen that happen is if
TLBSYNC is issued while a context fault is pending - on MMU-500 it seems
that the sync just doesn't proceed until the fault is cleared - but that
stemmed from interrupts not being wired up correctly (on FPGAs) such
that we never saw the fault reported in the first place :/

Robin.

> 
> Thanks,
> 
> Ray
> 
> On 6/9/17 12:28 PM, Nate Watterson wrote:
>> Hi Robin,
>>
>> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>>> Hi all,
>>>
>>> Here's the cleaned up nominally-final version of the patches everybody's
>>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>>
>>> The branch I've previously shared has been updated too:
>>>
>>>    git://linux-arm.org/linux-rm  iommu/pgtable
>>>
>>> All feedback welcome, as I'd really like to land this for 4.13.
>>>
>>
>> I tested the series on a QDF2400 development platform and see notable
>> performance improvements particularly in workloads that make concurrent
>> accesses to a single iommu_domain.
>>
>>> Robin.
>>>
>>>
>>> Robin Murphy (8):
>>>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>>    iommu/io-pgtable-arm: Improve split_blk_unmap
>>>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>>    iommu/io-pgtable: Introduce explicit coherency
>>>    iommu/io-pgtable-arm: Support lockless operation
>>>    iommu/io-pgtable-arm-v7s: Support lockless operation
>>>    iommu/arm-smmu: Remove io-pgtable spinlock
>>>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>>   drivers/iommu/arm-smmu.c           |  48 ++++------
>>>   drivers/iommu/io-pgtable-arm-v7s.c | 173
>>> +++++++++++++++++++++------------
>>>   drivers/iommu/io-pgtable-arm.c     | 190
>>> ++++++++++++++++++++++++-------------
>>>   drivers/iommu/io-pgtable.h         |   6 ++
>>>   5 files changed, 268 insertions(+), 185 deletions(-)
>>>
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-15  0:40         ` Ray Jui
@ 2017-06-21 15:47             ` Joerg Roedel
  -1 siblings, 0 replies; 50+ messages in thread
From: Joerg Roedel @ 2017-06-21 15:47 UTC (permalink / raw)
  To: Ray Jui
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, Jun 14, 2017 at 05:40:30PM -0700, Ray Jui wrote:
> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
> 
> Without IOMMU:
> 
> IOPS = 1080K
> 
> With IOMMU, but without your latest patch:
> 
> IOPS = 520K
> 
> With IOMMU and your latest patch:
> 
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

That variation might come from the RB-Tree used in the IOVA allocator.
For block-device workloads the allocation size of iova ranges might be
bigger than what is cached in the magazines, so that the fall-back to
the old (locked) allocator is used.


	Joerg

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-21 15:47             ` Joerg Roedel
  0 siblings, 0 replies; 50+ messages in thread
From: Joerg Roedel @ 2017-06-21 15:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 14, 2017 at 05:40:30PM -0700, Ray Jui wrote:
> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
> 
> Without IOMMU:
> 
> IOPS = 1080K
> 
> With IOMMU, but without your latest patch:
> 
> IOPS = 520K
> 
> With IOMMU and your latest patch:
> 
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

That variation might come from the RB-Tree used in the IOVA allocator.
For block-device workloads the allocation size of iova ranges might be
bigger than what is cached in the magazines, so that the fall-back to
the old (locked) allocator is used.


	Joerg

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-20 13:37             ` Robin Murphy
@ 2017-06-27 16:43                 ` Ray Jui
  -1 siblings, 0 replies; 50+ messages in thread
From: Ray Jui via iommu @ 2017-06-27 16:43 UTC (permalink / raw)
  To: Robin Murphy, Nate Watterson, will.deacon-5wv7dgnIgG8,
	joro-zLv9SwRftAIdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

Hi Robin,

On 6/20/17 6:37 AM, Robin Murphy wrote:
> On 15/06/17 01:40, Ray Jui wrote:
>> Hi Robin,
>>
>> I have applied this patch series on top of v4.12-rc4, and ran various
>> Ethernet and NVMf target throughput tests on it.
>>
>> To give you some background of my setup:
>>
>> The system is a ARMv8 based system with 8 cores. It has various PCIe
>> root complexes that can be used to connect to PCIe endpoint devices
>> including NIC cards and NVMe SSDs.
>>
>> I'm particularly interested in the performance of the PCIe root complex
>> that connects to the NIC card, and during my test, IOMMU is
>> enabled/disabled against that particular PCIe root complex. The root
>> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
>>
>> For the Ethernet throughput out of 50G link:
>>
>> Note during the multiple TCP session test, each session will be spread
>> to different CPU cores for optimized performance
>>
>> Without IOMMU:
>>
>> TX TCP x1 - 29.7 Gbps
>> TX TCP x4 - 30.5 Gbps
>> TX TCP x8 - 28 Gbps
>>
>> RX TCP x1 - 15 Gbps
>> RX TCP x4 - 33.7 Gbps
>> RX TCP x8 - 36 Gbps
>>
>> With IOMMU, but without your latest patch:
>>
>> TX TCP x1 - 15.2 Gbps
>> TX TCP x4 - 14.3 Gbps
>> TX TCP x8 - 13 Gbps
>>
>> RX TCP x1 - 7.88 Gbps
>> RX TCP x4 - 13.2 Gbps
>> RX TCP x8 - 12.6 Gbps
>>
>> With IOMMU and your latest patch:
>>
>> TX TCP x1 - 21.4 Gbps
>> TX TCP x4 - 30.5 Gbps
>> TX TCP x8 - 21.3 Gbps
>>
>> RX TCP x1 - 7.7 Gbps
>> RX TCP x4 - 20.1 Gbps
>> RX TCP x8 - 27.1 Gbps
> 
> Cool, those seem more or less in line with expectations. Nate's
> currently cooking a patch to further reduce the overhead when unmapping
> multi-page buffers, which we believe should make up most of the rest of
> the difference.
> 

That's great to hear!

>> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
>> 8 jobs:
>>
>> Without IOMMU:
>>
>> IOPS = 1080K
>>
>> With IOMMU, but without your latest patch:
>>
>> IOPS = 520K
>>
>> With IOMMU and your latest patch:
>>
>> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)
> 
> That does seem a bit off - are you able to try some perf profiling to
> get a better idea of where the overhead appears to be?
> 

I haven't had any time to look into this closer. But when I have a
chance, I will take a look (but that will not be anytime soon).

>> As you can see, performance has improved significantly with this patch
>> series! That is very impressive!
>>
>> However, it is still off, compared to the test runs without the IOMMU.
>> I'm wondering if more improvement is expected.
>>
>> In addition, a much larger throughput variation is observed in the tests
>> with these latest patches, when multiple CPUs are involved. I'm
>> wondering if that is caused by some remaining lock in the driver?
> 
> Assuming this is the platform with MMU-500, there shouldn't be any locks
> left, since that shouldn't have the hardware ATOS registers for
> iova_to_phys().
> 

Yes, this is with MMU-500.

>> Also, in a few occasions, I observed the following message during the
>> test, when multiple cores are involved:
>>
>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> 
> That's particularly worrying, because it means we spent over a second
> waiting for something that normally shouldn't take more than a few
> hundred cycles. The only time I've ever actually seen that happen is if
> TLBSYNC is issued while a context fault is pending - on MMU-500 it seems
> that the sync just doesn't proceed until the fault is cleared - but that
> stemmed from interrupts not being wired up correctly (on FPGAs) such
> that we never saw the fault reported in the first place :/
> 
> Robin.
> 

Okay, note the above error is reproduced only when we have a lot of TCP
sessions spread across all 8 CPU cores. It's fairly easy to reproduce in
our system. But I haven't had any time to take a closer look.

I also saw that patchset v2 is out. Based on your reply to other people,
I assume I do not need to test v2 explicitly. If you think there's a
need for me to help test v2, don't hesitate to let me know.

Thanks,

Ray

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-27 16:43                 ` Ray Jui
  0 siblings, 0 replies; 50+ messages in thread
From: Ray Jui @ 2017-06-27 16:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On 6/20/17 6:37 AM, Robin Murphy wrote:
> On 15/06/17 01:40, Ray Jui wrote:
>> Hi Robin,
>>
>> I have applied this patch series on top of v4.12-rc4, and ran various
>> Ethernet and NVMf target throughput tests on it.
>>
>> To give you some background of my setup:
>>
>> The system is a ARMv8 based system with 8 cores. It has various PCIe
>> root complexes that can be used to connect to PCIe endpoint devices
>> including NIC cards and NVMe SSDs.
>>
>> I'm particularly interested in the performance of the PCIe root complex
>> that connects to the NIC card, and during my test, IOMMU is
>> enabled/disabled against that particular PCIe root complex. The root
>> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
>>
>> For the Ethernet throughput out of 50G link:
>>
>> Note during the multiple TCP session test, each session will be spread
>> to different CPU cores for optimized performance
>>
>> Without IOMMU:
>>
>> TX TCP x1 - 29.7 Gbps
>> TX TCP x4 - 30.5 Gbps
>> TX TCP x8 - 28 Gbps
>>
>> RX TCP x1 - 15 Gbps
>> RX TCP x4 - 33.7 Gbps
>> RX TCP x8 - 36 Gbps
>>
>> With IOMMU, but without your latest patch:
>>
>> TX TCP x1 - 15.2 Gbps
>> TX TCP x4 - 14.3 Gbps
>> TX TCP x8 - 13 Gbps
>>
>> RX TCP x1 - 7.88 Gbps
>> RX TCP x4 - 13.2 Gbps
>> RX TCP x8 - 12.6 Gbps
>>
>> With IOMMU and your latest patch:
>>
>> TX TCP x1 - 21.4 Gbps
>> TX TCP x4 - 30.5 Gbps
>> TX TCP x8 - 21.3 Gbps
>>
>> RX TCP x1 - 7.7 Gbps
>> RX TCP x4 - 20.1 Gbps
>> RX TCP x8 - 27.1 Gbps
> 
> Cool, those seem more or less in line with expectations. Nate's
> currently cooking a patch to further reduce the overhead when unmapping
> multi-page buffers, which we believe should make up most of the rest of
> the difference.
> 

That's great to hear!

>> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
>> 8 jobs:
>>
>> Without IOMMU:
>>
>> IOPS = 1080K
>>
>> With IOMMU, but without your latest patch:
>>
>> IOPS = 520K
>>
>> With IOMMU and your latest patch:
>>
>> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)
> 
> That does seem a bit off - are you able to try some perf profiling to
> get a better idea of where the overhead appears to be?
> 

I haven't had any time to look into this closer. But when I have a
chance, I will take a look (but that will not be anytime soon).

>> As you can see, performance has improved significantly with this patch
>> series! That is very impressive!
>>
>> However, it is still off, compared to the test runs without the IOMMU.
>> I'm wondering if more improvement is expected.
>>
>> In addition, a much larger throughput variation is observed in the tests
>> with these latest patches, when multiple CPUs are involved. I'm
>> wondering if that is caused by some remaining lock in the driver?
> 
> Assuming this is the platform with MMU-500, there shouldn't be any locks
> left, since that shouldn't have the hardware ATOS registers for
> iova_to_phys().
> 

Yes, this is with MMU-500.

>> Also, in a few occasions, I observed the following message during the
>> test, when multiple cores are involved:
>>
>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> 
> That's particularly worrying, because it means we spent over a second
> waiting for something that normally shouldn't take more than a few
> hundred cycles. The only time I've ever actually seen that happen is if
> TLBSYNC is issued while a context fault is pending - on MMU-500 it seems
> that the sync just doesn't proceed until the fault is cleared - but that
> stemmed from interrupts not being wired up correctly (on FPGAs) such
> that we never saw the fault reported in the first place :/
> 
> Robin.
> 

Okay, note the above error is reproduced only when we have a lot of TCP
sessions spread across all 8 CPU cores. It's fairly easy to reproduce in
our system. But I haven't had any time to take a closer look.

I also saw that patchset v2 is out. Based on your reply to other people,
I assume I do not need to test v2 explicitly. If you think there's a
need for me to help test v2, don't hesitate to let me know.

Thanks,

Ray

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-27 16:43                 ` Ray Jui
@ 2017-06-28 11:46                     ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-06-28 11:46 UTC (permalink / raw)
  To: Ray Jui
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Ray,

Robin and I have been bashing our heads against the tlb_sync_pending flag
this morning, and we reckon it could have something to do with your timeouts
on MMU-500.

On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
> >> Also, in a few occasions, I observed the following message during the
> >> test, when multiple cores are involved:
> >>
> >> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

The tlb_sync_pending logic was written under the assumption of a global
page-table lock, so it assumes that it only has to care about syncing
flushes from the current CPU/context. That's not true anymore, and the
current code can accidentally skip syncs and (what I think is happening in
your case) allow concurrent syncs, which will potentially lead to timeouts
if a CPU is unlucky enough to keep missing the Ack.

Please can you try the diff below and see if it fixes things for you?
This applies on top of my for-joerg/arm-smmu/updates branch, but note
that I've only shown it to the compiler. Not tested at all.

Will

--->8

diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 127558d83667..cd8d7aaec161 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
 	iop->cookie	= cookie;
 	iop->cfg	= *cfg;
 
+	atomic_set(&iop->tlb_sync_pending, 0);
 	return &iop->ops;
 }
 
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 524263a7ae6f..b64580c9d03d 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -1,5 +1,7 @@
 #ifndef __IO_PGTABLE_H
 #define __IO_PGTABLE_H
+
+#include <linux/atomic.h>
 #include <linux/bitops.h>
 
 /*
@@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
 struct io_pgtable {
 	enum io_pgtable_fmt	fmt;
 	void			*cookie;
-	bool			tlb_sync_pending;
+	atomic_t		tlb_sync_pending;
 	struct io_pgtable_cfg	cfg;
 	struct io_pgtable_ops	ops;
 };
@@ -175,22 +177,20 @@ struct io_pgtable {
 static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
 {
 	iop->cfg.tlb->tlb_flush_all(iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
 		unsigned long iova, size_t size, size_t granule, bool leaf)
 {
 	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
 {
-	if (iop->tlb_sync_pending) {
+	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
 		iop->cfg.tlb->tlb_sync(iop->cookie);
-		iop->tlb_sync_pending = false;
-	}
 }
 
 /**

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-28 11:46                     ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-06-28 11:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ray,

Robin and I have been bashing our heads against the tlb_sync_pending flag
this morning, and we reckon it could have something to do with your timeouts
on MMU-500.

On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
> >> Also, in a few occasions, I observed the following message during the
> >> test, when multiple cores are involved:
> >>
> >> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

The tlb_sync_pending logic was written under the assumption of a global
page-table lock, so it assumes that it only has to care about syncing
flushes from the current CPU/context. That's not true anymore, and the
current code can accidentally skip syncs and (what I think is happening in
your case) allow concurrent syncs, which will potentially lead to timeouts
if a CPU is unlucky enough to keep missing the Ack.

Please can you try the diff below and see if it fixes things for you?
This applies on top of my for-joerg/arm-smmu/updates branch, but note
that I've only shown it to the compiler. Not tested at all.

Will

--->8

diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 127558d83667..cd8d7aaec161 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
 	iop->cookie	= cookie;
 	iop->cfg	= *cfg;
 
+	atomic_set(&iop->tlb_sync_pending, 0);
 	return &iop->ops;
 }
 
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 524263a7ae6f..b64580c9d03d 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -1,5 +1,7 @@
 #ifndef __IO_PGTABLE_H
 #define __IO_PGTABLE_H
+
+#include <linux/atomic.h>
 #include <linux/bitops.h>
 
 /*
@@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
 struct io_pgtable {
 	enum io_pgtable_fmt	fmt;
 	void			*cookie;
-	bool			tlb_sync_pending;
+	atomic_t		tlb_sync_pending;
 	struct io_pgtable_cfg	cfg;
 	struct io_pgtable_ops	ops;
 };
@@ -175,22 +177,20 @@ struct io_pgtable {
 static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
 {
 	iop->cfg.tlb->tlb_flush_all(iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
 		unsigned long iova, size_t size, size_t granule, bool leaf)
 {
 	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
 {
-	if (iop->tlb_sync_pending) {
+	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
 		iop->cfg.tlb->tlb_sync(iop->cookie);
-		iop->tlb_sync_pending = false;
-	}
 }
 
 /**

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-28 11:46                     ` Will Deacon
@ 2017-06-28 17:02                         ` Ray Jui
  -1 siblings, 0 replies; 50+ messages in thread
From: Ray Jui via iommu @ 2017-06-28 17:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will/Robin,

On 6/28/17 4:46 AM, Will Deacon wrote:
> Hi Ray,
> 
> Robin and I have been bashing our heads against the tlb_sync_pending flag
> this morning, and we reckon it could have something to do with your timeouts
> on MMU-500.
> 
> On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
>>>> Also, in a few occasions, I observed the following message during the
>>>> test, when multiple cores are involved:
>>>>
>>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> 
> The tlb_sync_pending logic was written under the assumption of a global
> page-table lock, so it assumes that it only has to care about syncing
> flushes from the current CPU/context. That's not true anymore, and the
> current code can accidentally skip syncs and (what I think is happening in
> your case) allow concurrent syncs, which will potentially lead to timeouts
> if a CPU is unlucky enough to keep missing the Ack.
> 
> Please can you try the diff below and see if it fixes things for you?
> This applies on top of my for-joerg/arm-smmu/updates branch, but note
> that I've only shown it to the compiler. Not tested at all.
> 
> Will
> 

Thanks for looking into this. I'm a bit busy at work but will certainly
find time to test the diff below. I hopefully will get to it later this
week.

Thanks,

Ray

> --->8
> 
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> index 127558d83667..cd8d7aaec161 100644
> --- a/drivers/iommu/io-pgtable.c
> +++ b/drivers/iommu/io-pgtable.c
> @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
>  	iop->cookie	= cookie;
>  	iop->cfg	= *cfg;
>  
> +	atomic_set(&iop->tlb_sync_pending, 0);
>  	return &iop->ops;
>  }
>  
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index 524263a7ae6f..b64580c9d03d 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -1,5 +1,7 @@
>  #ifndef __IO_PGTABLE_H
>  #define __IO_PGTABLE_H
> +
> +#include <linux/atomic.h>
>  #include <linux/bitops.h>
>  
>  /*
> @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
>  struct io_pgtable {
>  	enum io_pgtable_fmt	fmt;
>  	void			*cookie;
> -	bool			tlb_sync_pending;
> +	atomic_t		tlb_sync_pending;
>  	struct io_pgtable_cfg	cfg;
>  	struct io_pgtable_ops	ops;
>  };
> @@ -175,22 +177,20 @@ struct io_pgtable {
>  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
>  {
>  	iop->cfg.tlb->tlb_flush_all(iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
>  		unsigned long iova, size_t size, size_t granule, bool leaf)
>  {
>  	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
>  {
> -	if (iop->tlb_sync_pending) {
> +	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
>  		iop->cfg.tlb->tlb_sync(iop->cookie);
> -		iop->tlb_sync_pending = false;
> -	}
>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-06-28 17:02                         ` Ray Jui
  0 siblings, 0 replies; 50+ messages in thread
From: Ray Jui @ 2017-06-28 17:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will/Robin,

On 6/28/17 4:46 AM, Will Deacon wrote:
> Hi Ray,
> 
> Robin and I have been bashing our heads against the tlb_sync_pending flag
> this morning, and we reckon it could have something to do with your timeouts
> on MMU-500.
> 
> On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
>>>> Also, in a few occasions, I observed the following message during the
>>>> test, when multiple cores are involved:
>>>>
>>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> 
> The tlb_sync_pending logic was written under the assumption of a global
> page-table lock, so it assumes that it only has to care about syncing
> flushes from the current CPU/context. That's not true anymore, and the
> current code can accidentally skip syncs and (what I think is happening in
> your case) allow concurrent syncs, which will potentially lead to timeouts
> if a CPU is unlucky enough to keep missing the Ack.
> 
> Please can you try the diff below and see if it fixes things for you?
> This applies on top of my for-joerg/arm-smmu/updates branch, but note
> that I've only shown it to the compiler. Not tested at all.
> 
> Will
> 

Thanks for looking into this. I'm a bit busy at work but will certainly
find time to test the diff below. I hopefully will get to it later this
week.

Thanks,

Ray

> --->8
> 
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> index 127558d83667..cd8d7aaec161 100644
> --- a/drivers/iommu/io-pgtable.c
> +++ b/drivers/iommu/io-pgtable.c
> @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
>  	iop->cookie	= cookie;
>  	iop->cfg	= *cfg;
>  
> +	atomic_set(&iop->tlb_sync_pending, 0);
>  	return &iop->ops;
>  }
>  
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index 524263a7ae6f..b64580c9d03d 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -1,5 +1,7 @@
>  #ifndef __IO_PGTABLE_H
>  #define __IO_PGTABLE_H
> +
> +#include <linux/atomic.h>
>  #include <linux/bitops.h>
>  
>  /*
> @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
>  struct io_pgtable {
>  	enum io_pgtable_fmt	fmt;
>  	void			*cookie;
> -	bool			tlb_sync_pending;
> +	atomic_t		tlb_sync_pending;
>  	struct io_pgtable_cfg	cfg;
>  	struct io_pgtable_ops	ops;
>  };
> @@ -175,22 +177,20 @@ struct io_pgtable {
>  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
>  {
>  	iop->cfg.tlb->tlb_flush_all(iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
>  		unsigned long iova, size_t size, size_t granule, bool leaf)
>  {
>  	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
>  {
> -	if (iop->tlb_sync_pending) {
> +	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
>  		iop->cfg.tlb->tlb_sync(iop->cookie);
> -		iop->tlb_sync_pending = false;
> -	}
>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-06-28 17:02                         ` Ray Jui
@ 2017-07-04 17:31                             ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-04 17:31 UTC (permalink / raw)
  To: Ray Jui
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Ray,

On Wed, Jun 28, 2017 at 10:02:35AM -0700, Ray Jui wrote:
> On 6/28/17 4:46 AM, Will Deacon wrote:
> > Robin and I have been bashing our heads against the tlb_sync_pending flag
> > this morning, and we reckon it could have something to do with your timeouts
> > on MMU-500.
> > 
> > On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
> >>>> Also, in a few occasions, I observed the following message during the
> >>>> test, when multiple cores are involved:
> >>>>
> >>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> > 
> > The tlb_sync_pending logic was written under the assumption of a global
> > page-table lock, so it assumes that it only has to care about syncing
> > flushes from the current CPU/context. That's not true anymore, and the
> > current code can accidentally skip syncs and (what I think is happening in
> > your case) allow concurrent syncs, which will potentially lead to timeouts
> > if a CPU is unlucky enough to keep missing the Ack.
> > 
> > Please can you try the diff below and see if it fixes things for you?
> > This applies on top of my for-joerg/arm-smmu/updates branch, but note
> > that I've only shown it to the compiler. Not tested at all.
> > 
> > Will
> > 
> 
> Thanks for looking into this. I'm a bit busy at work but will certainly
> find time to test the diff below. I hopefully will get to it later this
> week.

It would be really handy if you could test this, since I think it could
cause some nasty problems if we don't get it fixed. Updated patch (with
commit message) below.

Will

--->8

>From eeb11dab63fcdd698b671a3a8c63516005caa9ec Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
Date: Thu, 29 Jun 2017 15:08:09 +0100
Subject: [PATCH] iommu/io-pgtable: Fix tlb_sync_pending flag access from
 concurrent CPUs

The tlb_sync_pending flag is used to elide back-to-back TLB sync operations
for two reasons:

  1. TLB sync operations can be expensive, and so avoiding them where we
     can is a good idea.

  2. Some hardware (mtk_iommu) locks up if it sees a TLB sync without an
     unsync'd TLB flush preceding it

The flag is set on an ->add_flush callback and cleared on a ->sync callback,
which worked nicely when all map/unmap operations where protected by a
global lock.

Unfortunately, moving to a lockless implementation means that we suddenly
have races on the flag: updates can go missing and we can end up with
back-to-back syncs once again.

This patch resolves the problem by making the tlb_sync_pending flag an
atomic_t and sorts out the ordering with respect to TLB callbacks.
Now, the flag is set with release semantics after adding a flush and
checked with an xchg operation (and subsequent control dependency) when
performing the sync. We could consider using a cmpxchg here, but we'll
likely just hit our local update to the flag anyway.

Cc: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Cc: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
Fixes: 2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")
Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/io-pgtable.c |  1 +
 drivers/iommu/io-pgtable.h | 12 ++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 127558d83667..cd8d7aaec161 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
 	iop->cookie	= cookie;
 	iop->cfg	= *cfg;
 
+	atomic_set(&iop->tlb_sync_pending, 0);
 	return &iop->ops;
 }
 
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 524263a7ae6f..b64580c9d03d 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -1,5 +1,7 @@
 #ifndef __IO_PGTABLE_H
 #define __IO_PGTABLE_H
+
+#include <linux/atomic.h>
 #include <linux/bitops.h>
 
 /*
@@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
 struct io_pgtable {
 	enum io_pgtable_fmt	fmt;
 	void			*cookie;
-	bool			tlb_sync_pending;
+	atomic_t		tlb_sync_pending;
 	struct io_pgtable_cfg	cfg;
 	struct io_pgtable_ops	ops;
 };
@@ -175,22 +177,20 @@ struct io_pgtable {
 static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
 {
 	iop->cfg.tlb->tlb_flush_all(iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
 		unsigned long iova, size_t size, size_t granule, bool leaf)
 {
 	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
 {
-	if (iop->tlb_sync_pending) {
+	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
 		iop->cfg.tlb->tlb_sync(iop->cookie);
-		iop->tlb_sync_pending = false;
-	}
 }
 
 /**
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-04 17:31                             ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-04 17:31 UTC (permalink / raw)
  To: linux-arm-kernel

Ray,

On Wed, Jun 28, 2017 at 10:02:35AM -0700, Ray Jui wrote:
> On 6/28/17 4:46 AM, Will Deacon wrote:
> > Robin and I have been bashing our heads against the tlb_sync_pending flag
> > this morning, and we reckon it could have something to do with your timeouts
> > on MMU-500.
> > 
> > On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
> >>>> Also, in a few occasions, I observed the following message during the
> >>>> test, when multiple cores are involved:
> >>>>
> >>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> > 
> > The tlb_sync_pending logic was written under the assumption of a global
> > page-table lock, so it assumes that it only has to care about syncing
> > flushes from the current CPU/context. That's not true anymore, and the
> > current code can accidentally skip syncs and (what I think is happening in
> > your case) allow concurrent syncs, which will potentially lead to timeouts
> > if a CPU is unlucky enough to keep missing the Ack.
> > 
> > Please can you try the diff below and see if it fixes things for you?
> > This applies on top of my for-joerg/arm-smmu/updates branch, but note
> > that I've only shown it to the compiler. Not tested at all.
> > 
> > Will
> > 
> 
> Thanks for looking into this. I'm a bit busy at work but will certainly
> find time to test the diff below. I hopefully will get to it later this
> week.

It would be really handy if you could test this, since I think it could
cause some nasty problems if we don't get it fixed. Updated patch (with
commit message) below.

Will

--->8

>From eeb11dab63fcdd698b671a3a8c63516005caa9ec Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Thu, 29 Jun 2017 15:08:09 +0100
Subject: [PATCH] iommu/io-pgtable: Fix tlb_sync_pending flag access from
 concurrent CPUs

The tlb_sync_pending flag is used to elide back-to-back TLB sync operations
for two reasons:

  1. TLB sync operations can be expensive, and so avoiding them where we
     can is a good idea.

  2. Some hardware (mtk_iommu) locks up if it sees a TLB sync without an
     unsync'd TLB flush preceding it

The flag is set on an ->add_flush callback and cleared on a ->sync callback,
which worked nicely when all map/unmap operations where protected by a
global lock.

Unfortunately, moving to a lockless implementation means that we suddenly
have races on the flag: updates can go missing and we can end up with
back-to-back syncs once again.

This patch resolves the problem by making the tlb_sync_pending flag an
atomic_t and sorts out the ordering with respect to TLB callbacks.
Now, the flag is set with release semantics after adding a flush and
checked with an xchg operation (and subsequent control dependency) when
performing the sync. We could consider using a cmpxchg here, but we'll
likely just hit our local update to the flag anyway.

Cc: Ray Jui <ray.jui@broadcom.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Fixes: 2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/io-pgtable.c |  1 +
 drivers/iommu/io-pgtable.h | 12 ++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 127558d83667..cd8d7aaec161 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
 	iop->cookie	= cookie;
 	iop->cfg	= *cfg;
 
+	atomic_set(&iop->tlb_sync_pending, 0);
 	return &iop->ops;
 }
 
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 524263a7ae6f..b64580c9d03d 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -1,5 +1,7 @@
 #ifndef __IO_PGTABLE_H
 #define __IO_PGTABLE_H
+
+#include <linux/atomic.h>
 #include <linux/bitops.h>
 
 /*
@@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
 struct io_pgtable {
 	enum io_pgtable_fmt	fmt;
 	void			*cookie;
-	bool			tlb_sync_pending;
+	atomic_t		tlb_sync_pending;
 	struct io_pgtable_cfg	cfg;
 	struct io_pgtable_ops	ops;
 };
@@ -175,22 +177,20 @@ struct io_pgtable {
 static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
 {
 	iop->cfg.tlb->tlb_flush_all(iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
 		unsigned long iova, size_t size, size_t granule, bool leaf)
 {
 	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
-	iop->tlb_sync_pending = true;
+	atomic_set_release(&iop->tlb_sync_pending, 1);
 }
 
 static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
 {
-	if (iop->tlb_sync_pending) {
+	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
 		iop->cfg.tlb->tlb_sync(iop->cookie);
-		iop->tlb_sync_pending = false;
-	}
 }
 
 /**
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-07-04 17:31                             ` Will Deacon
@ 2017-07-04 17:39                                 ` Ray Jui
  -1 siblings, 0 replies; 50+ messages in thread
From: Ray Jui via iommu @ 2017-07-04 17:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On 7/4/17 10:31 AM, Will Deacon wrote:
> Ray,
> 
> On Wed, Jun 28, 2017 at 10:02:35AM -0700, Ray Jui wrote:
>> On 6/28/17 4:46 AM, Will Deacon wrote:
>>> Robin and I have been bashing our heads against the tlb_sync_pending flag
>>> this morning, and we reckon it could have something to do with your timeouts
>>> on MMU-500.
>>>
>>> On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
>>>>>> Also, in a few occasions, I observed the following message during the
>>>>>> test, when multiple cores are involved:
>>>>>>
>>>>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>>>
>>> The tlb_sync_pending logic was written under the assumption of a global
>>> page-table lock, so it assumes that it only has to care about syncing
>>> flushes from the current CPU/context. That's not true anymore, and the
>>> current code can accidentally skip syncs and (what I think is happening in
>>> your case) allow concurrent syncs, which will potentially lead to timeouts
>>> if a CPU is unlucky enough to keep missing the Ack.
>>>
>>> Please can you try the diff below and see if it fixes things for you?
>>> This applies on top of my for-joerg/arm-smmu/updates branch, but note
>>> that I've only shown it to the compiler. Not tested at all.
>>>
>>> Will
>>>
>>
>> Thanks for looking into this. I'm a bit busy at work but will certainly
>> find time to test the diff below. I hopefully will get to it later this
>> week.
> 
> It would be really handy if you could test this, since I think it could
> cause some nasty problems if we don't get it fixed. Updated patch (with
> commit message) below.
> 
> Will

Yes I understand. Sorry I was way too busy last week and could not get
to it. Will definitely find time to test this ASAP.

Regards,

Ray

> 
> --->8
> 
> From eeb11dab63fcdd698b671a3a8c63516005caa9ec Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> Date: Thu, 29 Jun 2017 15:08:09 +0100
> Subject: [PATCH] iommu/io-pgtable: Fix tlb_sync_pending flag access from
>  concurrent CPUs
> 
> The tlb_sync_pending flag is used to elide back-to-back TLB sync operations
> for two reasons:
> 
>   1. TLB sync operations can be expensive, and so avoiding them where we
>      can is a good idea.
> 
>   2. Some hardware (mtk_iommu) locks up if it sees a TLB sync without an
>      unsync'd TLB flush preceding it
> 
> The flag is set on an ->add_flush callback and cleared on a ->sync callback,
> which worked nicely when all map/unmap operations where protected by a
> global lock.
> 
> Unfortunately, moving to a lockless implementation means that we suddenly
> have races on the flag: updates can go missing and we can end up with
> back-to-back syncs once again.
> 
> This patch resolves the problem by making the tlb_sync_pending flag an
> atomic_t and sorts out the ordering with respect to TLB callbacks.
> Now, the flag is set with release semantics after adding a flush and
> checked with an xchg operation (and subsequent control dependency) when
> performing the sync. We could consider using a cmpxchg here, but we'll
> likely just hit our local update to the flag anyway.
> 
> Cc: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Cc: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> Fixes: 2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")
> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> ---
>  drivers/iommu/io-pgtable.c |  1 +
>  drivers/iommu/io-pgtable.h | 12 ++++++------
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> index 127558d83667..cd8d7aaec161 100644
> --- a/drivers/iommu/io-pgtable.c
> +++ b/drivers/iommu/io-pgtable.c
> @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
>  	iop->cookie	= cookie;
>  	iop->cfg	= *cfg;
>  
> +	atomic_set(&iop->tlb_sync_pending, 0);
>  	return &iop->ops;
>  }
>  
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index 524263a7ae6f..b64580c9d03d 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -1,5 +1,7 @@
>  #ifndef __IO_PGTABLE_H
>  #define __IO_PGTABLE_H
> +
> +#include <linux/atomic.h>
>  #include <linux/bitops.h>
>  
>  /*
> @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
>  struct io_pgtable {
>  	enum io_pgtable_fmt	fmt;
>  	void			*cookie;
> -	bool			tlb_sync_pending;
> +	atomic_t		tlb_sync_pending;
>  	struct io_pgtable_cfg	cfg;
>  	struct io_pgtable_ops	ops;
>  };
> @@ -175,22 +177,20 @@ struct io_pgtable {
>  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
>  {
>  	iop->cfg.tlb->tlb_flush_all(iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
>  		unsigned long iova, size_t size, size_t granule, bool leaf)
>  {
>  	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
>  {
> -	if (iop->tlb_sync_pending) {
> +	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
>  		iop->cfg.tlb->tlb_sync(iop->cookie);
> -		iop->tlb_sync_pending = false;
> -	}
>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-04 17:39                                 ` Ray Jui
  0 siblings, 0 replies; 50+ messages in thread
From: Ray Jui @ 2017-07-04 17:39 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 7/4/17 10:31 AM, Will Deacon wrote:
> Ray,
> 
> On Wed, Jun 28, 2017 at 10:02:35AM -0700, Ray Jui wrote:
>> On 6/28/17 4:46 AM, Will Deacon wrote:
>>> Robin and I have been bashing our heads against the tlb_sync_pending flag
>>> this morning, and we reckon it could have something to do with your timeouts
>>> on MMU-500.
>>>
>>> On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
>>>>>> Also, in a few occasions, I observed the following message during the
>>>>>> test, when multiple cores are involved:
>>>>>>
>>>>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>>>
>>> The tlb_sync_pending logic was written under the assumption of a global
>>> page-table lock, so it assumes that it only has to care about syncing
>>> flushes from the current CPU/context. That's not true anymore, and the
>>> current code can accidentally skip syncs and (what I think is happening in
>>> your case) allow concurrent syncs, which will potentially lead to timeouts
>>> if a CPU is unlucky enough to keep missing the Ack.
>>>
>>> Please can you try the diff below and see if it fixes things for you?
>>> This applies on top of my for-joerg/arm-smmu/updates branch, but note
>>> that I've only shown it to the compiler. Not tested at all.
>>>
>>> Will
>>>
>>
>> Thanks for looking into this. I'm a bit busy at work but will certainly
>> find time to test the diff below. I hopefully will get to it later this
>> week.
> 
> It would be really handy if you could test this, since I think it could
> cause some nasty problems if we don't get it fixed. Updated patch (with
> commit message) below.
> 
> Will

Yes I understand. Sorry I was way too busy last week and could not get
to it. Will definitely find time to test this ASAP.

Regards,

Ray

> 
> --->8
> 
> From eeb11dab63fcdd698b671a3a8c63516005caa9ec Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Thu, 29 Jun 2017 15:08:09 +0100
> Subject: [PATCH] iommu/io-pgtable: Fix tlb_sync_pending flag access from
>  concurrent CPUs
> 
> The tlb_sync_pending flag is used to elide back-to-back TLB sync operations
> for two reasons:
> 
>   1. TLB sync operations can be expensive, and so avoiding them where we
>      can is a good idea.
> 
>   2. Some hardware (mtk_iommu) locks up if it sees a TLB sync without an
>      unsync'd TLB flush preceding it
> 
> The flag is set on an ->add_flush callback and cleared on a ->sync callback,
> which worked nicely when all map/unmap operations where protected by a
> global lock.
> 
> Unfortunately, moving to a lockless implementation means that we suddenly
> have races on the flag: updates can go missing and we can end up with
> back-to-back syncs once again.
> 
> This patch resolves the problem by making the tlb_sync_pending flag an
> atomic_t and sorts out the ordering with respect to TLB callbacks.
> Now, the flag is set with release semantics after adding a flush and
> checked with an xchg operation (and subsequent control dependency) when
> performing the sync. We could consider using a cmpxchg here, but we'll
> likely just hit our local update to the flag anyway.
> 
> Cc: Ray Jui <ray.jui@broadcom.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Fixes: 2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  drivers/iommu/io-pgtable.c |  1 +
>  drivers/iommu/io-pgtable.h | 12 ++++++------
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
> index 127558d83667..cd8d7aaec161 100644
> --- a/drivers/iommu/io-pgtable.c
> +++ b/drivers/iommu/io-pgtable.c
> @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
>  	iop->cookie	= cookie;
>  	iop->cfg	= *cfg;
>  
> +	atomic_set(&iop->tlb_sync_pending, 0);
>  	return &iop->ops;
>  }
>  
> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
> index 524263a7ae6f..b64580c9d03d 100644
> --- a/drivers/iommu/io-pgtable.h
> +++ b/drivers/iommu/io-pgtable.h
> @@ -1,5 +1,7 @@
>  #ifndef __IO_PGTABLE_H
>  #define __IO_PGTABLE_H
> +
> +#include <linux/atomic.h>
>  #include <linux/bitops.h>
>  
>  /*
> @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
>  struct io_pgtable {
>  	enum io_pgtable_fmt	fmt;
>  	void			*cookie;
> -	bool			tlb_sync_pending;
> +	atomic_t		tlb_sync_pending;
>  	struct io_pgtable_cfg	cfg;
>  	struct io_pgtable_ops	ops;
>  };
> @@ -175,22 +177,20 @@ struct io_pgtable {
>  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
>  {
>  	iop->cfg.tlb->tlb_flush_all(iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
>  		unsigned long iova, size_t size, size_t granule, bool leaf)
>  {
>  	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
> -	iop->tlb_sync_pending = true;
> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>  }
>  
>  static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
>  {
> -	if (iop->tlb_sync_pending) {
> +	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
>  		iop->cfg.tlb->tlb_sync(iop->cookie);
> -		iop->tlb_sync_pending = false;
> -	}
>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-07-04 17:39                                 ` Ray Jui
@ 2017-07-05  1:45                                     ` Ray Jui
  -1 siblings, 0 replies; 50+ messages in thread
From: Ray Jui via iommu @ 2017-07-05  1:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will/Robin,

Has anything functionally changed between PATCH v2 and v1? I'm seeing a
very different L2 throughput with v2 (in general a lot worse with v2 vs.
v1); however, I'm currently unable to reproduce the TLB sync timed out
issue with v2 (without the patch from Will's email).

It could also be something else that has changed in my setup, but so far
I have not yet been able to spot anything wrong in the setup.

Thanks,

Ray

On 7/4/17 10:39 AM, Ray Jui wrote:
> Hi Will,
> 
> On 7/4/17 10:31 AM, Will Deacon wrote:
>> Ray,
>>
>> On Wed, Jun 28, 2017 at 10:02:35AM -0700, Ray Jui wrote:
>>> On 6/28/17 4:46 AM, Will Deacon wrote:
>>>> Robin and I have been bashing our heads against the tlb_sync_pending flag
>>>> this morning, and we reckon it could have something to do with your timeouts
>>>> on MMU-500.
>>>>
>>>> On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
>>>>>>> Also, in a few occasions, I observed the following message during the
>>>>>>> test, when multiple cores are involved:
>>>>>>>
>>>>>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>>>>
>>>> The tlb_sync_pending logic was written under the assumption of a global
>>>> page-table lock, so it assumes that it only has to care about syncing
>>>> flushes from the current CPU/context. That's not true anymore, and the
>>>> current code can accidentally skip syncs and (what I think is happening in
>>>> your case) allow concurrent syncs, which will potentially lead to timeouts
>>>> if a CPU is unlucky enough to keep missing the Ack.
>>>>
>>>> Please can you try the diff below and see if it fixes things for you?
>>>> This applies on top of my for-joerg/arm-smmu/updates branch, but note
>>>> that I've only shown it to the compiler. Not tested at all.
>>>>
>>>> Will
>>>>
>>>
>>> Thanks for looking into this. I'm a bit busy at work but will certainly
>>> find time to test the diff below. I hopefully will get to it later this
>>> week.
>>
>> It would be really handy if you could test this, since I think it could
>> cause some nasty problems if we don't get it fixed. Updated patch (with
>> commit message) below.
>>
>> Will
> 
> Yes I understand. Sorry I was way too busy last week and could not get
> to it. Will definitely find time to test this ASAP.
> 
> Regards,
> 
> Ray
> 
>>
>> --->8
>>
>> From eeb11dab63fcdd698b671a3a8c63516005caa9ec Mon Sep 17 00:00:00 2001
>> From: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
>> Date: Thu, 29 Jun 2017 15:08:09 +0100
>> Subject: [PATCH] iommu/io-pgtable: Fix tlb_sync_pending flag access from
>>  concurrent CPUs
>>
>> The tlb_sync_pending flag is used to elide back-to-back TLB sync operations
>> for two reasons:
>>
>>   1. TLB sync operations can be expensive, and so avoiding them where we
>>      can is a good idea.
>>
>>   2. Some hardware (mtk_iommu) locks up if it sees a TLB sync without an
>>      unsync'd TLB flush preceding it
>>
>> The flag is set on an ->add_flush callback and cleared on a ->sync callback,
>> which worked nicely when all map/unmap operations where protected by a
>> global lock.
>>
>> Unfortunately, moving to a lockless implementation means that we suddenly
>> have races on the flag: updates can go missing and we can end up with
>> back-to-back syncs once again.
>>
>> This patch resolves the problem by making the tlb_sync_pending flag an
>> atomic_t and sorts out the ordering with respect to TLB callbacks.
>> Now, the flag is set with release semantics after adding a flush and
>> checked with an xchg operation (and subsequent control dependency) when
>> performing the sync. We could consider using a cmpxchg here, but we'll
>> likely just hit our local update to the flag anyway.
>>
>> Cc: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>> Cc: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
>> Fixes: 2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")
>> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
>> ---
>>  drivers/iommu/io-pgtable.c |  1 +
>>  drivers/iommu/io-pgtable.h | 12 ++++++------
>>  2 files changed, 7 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
>> index 127558d83667..cd8d7aaec161 100644
>> --- a/drivers/iommu/io-pgtable.c
>> +++ b/drivers/iommu/io-pgtable.c
>> @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
>>  	iop->cookie	= cookie;
>>  	iop->cfg	= *cfg;
>>  
>> +	atomic_set(&iop->tlb_sync_pending, 0);
>>  	return &iop->ops;
>>  }
>>  
>> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
>> index 524263a7ae6f..b64580c9d03d 100644
>> --- a/drivers/iommu/io-pgtable.h
>> +++ b/drivers/iommu/io-pgtable.h
>> @@ -1,5 +1,7 @@
>>  #ifndef __IO_PGTABLE_H
>>  #define __IO_PGTABLE_H
>> +
>> +#include <linux/atomic.h>
>>  #include <linux/bitops.h>
>>  
>>  /*
>> @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
>>  struct io_pgtable {
>>  	enum io_pgtable_fmt	fmt;
>>  	void			*cookie;
>> -	bool			tlb_sync_pending;
>> +	atomic_t		tlb_sync_pending;
>>  	struct io_pgtable_cfg	cfg;
>>  	struct io_pgtable_ops	ops;
>>  };
>> @@ -175,22 +177,20 @@ struct io_pgtable {
>>  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
>>  {
>>  	iop->cfg.tlb->tlb_flush_all(iop->cookie);
>> -	iop->tlb_sync_pending = true;
>> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>>  }
>>  
>>  static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
>>  		unsigned long iova, size_t size, size_t granule, bool leaf)
>>  {
>>  	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
>> -	iop->tlb_sync_pending = true;
>> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>>  }
>>  
>>  static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
>>  {
>> -	if (iop->tlb_sync_pending) {
>> +	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
>>  		iop->cfg.tlb->tlb_sync(iop->cookie);
>> -		iop->tlb_sync_pending = false;
>> -	}
>>  }
>>  
>>  /**
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-05  1:45                                     ` Ray Jui
  0 siblings, 0 replies; 50+ messages in thread
From: Ray Jui @ 2017-07-05  1:45 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will/Robin,

Has anything functionally changed between PATCH v2 and v1? I'm seeing a
very different L2 throughput with v2 (in general a lot worse with v2 vs.
v1); however, I'm currently unable to reproduce the TLB sync timed out
issue with v2 (without the patch from Will's email).

It could also be something else that has changed in my setup, but so far
I have not yet been able to spot anything wrong in the setup.

Thanks,

Ray

On 7/4/17 10:39 AM, Ray Jui wrote:
> Hi Will,
> 
> On 7/4/17 10:31 AM, Will Deacon wrote:
>> Ray,
>>
>> On Wed, Jun 28, 2017 at 10:02:35AM -0700, Ray Jui wrote:
>>> On 6/28/17 4:46 AM, Will Deacon wrote:
>>>> Robin and I have been bashing our heads against the tlb_sync_pending flag
>>>> this morning, and we reckon it could have something to do with your timeouts
>>>> on MMU-500.
>>>>
>>>> On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote:
>>>>>>> Also, in a few occasions, I observed the following message during the
>>>>>>> test, when multiple cores are involved:
>>>>>>>
>>>>>>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>>>>
>>>> The tlb_sync_pending logic was written under the assumption of a global
>>>> page-table lock, so it assumes that it only has to care about syncing
>>>> flushes from the current CPU/context. That's not true anymore, and the
>>>> current code can accidentally skip syncs and (what I think is happening in
>>>> your case) allow concurrent syncs, which will potentially lead to timeouts
>>>> if a CPU is unlucky enough to keep missing the Ack.
>>>>
>>>> Please can you try the diff below and see if it fixes things for you?
>>>> This applies on top of my for-joerg/arm-smmu/updates branch, but note
>>>> that I've only shown it to the compiler. Not tested at all.
>>>>
>>>> Will
>>>>
>>>
>>> Thanks for looking into this. I'm a bit busy at work but will certainly
>>> find time to test the diff below. I hopefully will get to it later this
>>> week.
>>
>> It would be really handy if you could test this, since I think it could
>> cause some nasty problems if we don't get it fixed. Updated patch (with
>> commit message) below.
>>
>> Will
> 
> Yes I understand. Sorry I was way too busy last week and could not get
> to it. Will definitely find time to test this ASAP.
> 
> Regards,
> 
> Ray
> 
>>
>> --->8
>>
>> From eeb11dab63fcdd698b671a3a8c63516005caa9ec Mon Sep 17 00:00:00 2001
>> From: Will Deacon <will.deacon@arm.com>
>> Date: Thu, 29 Jun 2017 15:08:09 +0100
>> Subject: [PATCH] iommu/io-pgtable: Fix tlb_sync_pending flag access from
>>  concurrent CPUs
>>
>> The tlb_sync_pending flag is used to elide back-to-back TLB sync operations
>> for two reasons:
>>
>>   1. TLB sync operations can be expensive, and so avoiding them where we
>>      can is a good idea.
>>
>>   2. Some hardware (mtk_iommu) locks up if it sees a TLB sync without an
>>      unsync'd TLB flush preceding it
>>
>> The flag is set on an ->add_flush callback and cleared on a ->sync callback,
>> which worked nicely when all map/unmap operations where protected by a
>> global lock.
>>
>> Unfortunately, moving to a lockless implementation means that we suddenly
>> have races on the flag: updates can go missing and we can end up with
>> back-to-back syncs once again.
>>
>> This patch resolves the problem by making the tlb_sync_pending flag an
>> atomic_t and sorts out the ordering with respect to TLB callbacks.
>> Now, the flag is set with release semantics after adding a flush and
>> checked with an xchg operation (and subsequent control dependency) when
>> performing the sync. We could consider using a cmpxchg here, but we'll
>> likely just hit our local update to the flag anyway.
>>
>> Cc: Ray Jui <ray.jui@broadcom.com>
>> Cc: Robin Murphy <robin.murphy@arm.com>
>> Fixes: 2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")
>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>> ---
>>  drivers/iommu/io-pgtable.c |  1 +
>>  drivers/iommu/io-pgtable.h | 12 ++++++------
>>  2 files changed, 7 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
>> index 127558d83667..cd8d7aaec161 100644
>> --- a/drivers/iommu/io-pgtable.c
>> +++ b/drivers/iommu/io-pgtable.c
>> @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
>>  	iop->cookie	= cookie;
>>  	iop->cfg	= *cfg;
>>  
>> +	atomic_set(&iop->tlb_sync_pending, 0);
>>  	return &iop->ops;
>>  }
>>  
>> diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
>> index 524263a7ae6f..b64580c9d03d 100644
>> --- a/drivers/iommu/io-pgtable.h
>> +++ b/drivers/iommu/io-pgtable.h
>> @@ -1,5 +1,7 @@
>>  #ifndef __IO_PGTABLE_H
>>  #define __IO_PGTABLE_H
>> +
>> +#include <linux/atomic.h>
>>  #include <linux/bitops.h>
>>  
>>  /*
>> @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops);
>>  struct io_pgtable {
>>  	enum io_pgtable_fmt	fmt;
>>  	void			*cookie;
>> -	bool			tlb_sync_pending;
>> +	atomic_t		tlb_sync_pending;
>>  	struct io_pgtable_cfg	cfg;
>>  	struct io_pgtable_ops	ops;
>>  };
>> @@ -175,22 +177,20 @@ struct io_pgtable {
>>  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)
>>  {
>>  	iop->cfg.tlb->tlb_flush_all(iop->cookie);
>> -	iop->tlb_sync_pending = true;
>> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>>  }
>>  
>>  static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop,
>>  		unsigned long iova, size_t size, size_t granule, bool leaf)
>>  {
>>  	iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie);
>> -	iop->tlb_sync_pending = true;
>> +	atomic_set_release(&iop->tlb_sync_pending, 1);
>>  }
>>  
>>  static inline void io_pgtable_tlb_sync(struct io_pgtable *iop)
>>  {
>> -	if (iop->tlb_sync_pending) {
>> +	if (atomic_xchg_relaxed(&iop->tlb_sync_pending, 0))
>>  		iop->cfg.tlb->tlb_sync(iop->cookie);
>> -		iop->tlb_sync_pending = false;
>> -	}
>>  }
>>  
>>  /**
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-07-05  1:45                                     ` Ray Jui
@ 2017-07-05  8:41                                         ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-05  8:41 UTC (permalink / raw)
  To: Ray Jui
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
> Hi Will/Robin,
> 
> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
> very different L2 throughput with v2 (in general a lot worse with v2 vs.
> v1); however, I'm currently unable to reproduce the TLB sync timed out
> issue with v2 (without the patch from Will's email).
> 
> It could also be something else that has changed in my setup, but so far
> I have not yet been able to spot anything wrong in the setup.

There were fixes, and that initially involved a DSB that was found to be
expensive. The patches queued in -next should have that addressed, so please
use those (or my for-joerg/arm-smmu/updates branch).

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-05  8:41                                         ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-05  8:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
> Hi Will/Robin,
> 
> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
> very different L2 throughput with v2 (in general a lot worse with v2 vs.
> v1); however, I'm currently unable to reproduce the TLB sync timed out
> issue with v2 (without the patch from Will's email).
> 
> It could also be something else that has changed in my setup, but so far
> I have not yet been able to spot anything wrong in the setup.

There were fixes, and that initially involved a DSB that was found to be
expensive. The patches queued in -next should have that addressed, so please
use those (or my for-joerg/arm-smmu/updates branch).

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-07-05  8:41                                         ` Will Deacon
@ 2017-07-05 23:24                                             ` Ray Jui
  -1 siblings, 0 replies; 50+ messages in thread
From: Ray Jui via iommu @ 2017-07-05 23:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On 7/5/17 1:41 AM, Will Deacon wrote:
> On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
>> Hi Will/Robin,
>>
>> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
>> very different L2 throughput with v2 (in general a lot worse with v2 vs.
>> v1); however, I'm currently unable to reproduce the TLB sync timed out
>> issue with v2 (without the patch from Will's email).
>>
>> It could also be something else that has changed in my setup, but so far
>> I have not yet been able to spot anything wrong in the setup.
> 
> There were fixes, and that initially involved a DSB that was found to be
> expensive. The patches queued in -next should have that addressed, so please
> use those (or my for-joerg/arm-smmu/updates branch).
> 
> Will
> 

That was my bad yesterday. I was in a rush and the setup was incorrect.

I redo my Ethernet performance test with both PATCH v1 and v2 today, and
can confirm the performance is consistent between v1 and v2 as expected.

I also made sure the following message can still be reproduced with
patch set v2:

arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

Then I proceeded to apply your patch that attempt to fix the deadlock
issue. I also added a print to ensure I'm running the correct build with
your fix patch applied:

diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index cd8d7aa..01a6fa8 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -60,6 +60,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum
io_pgtable_fmt fmt,
        iop->cfg        = *cfg;

        atomic_set(&iop->tlb_sync_pending, 0);
+       pr_err("tlb sync pending cleared\n");
        return &iop->ops;
 }

root@bcm958742k:~# dmesg | grep tlb
[    6.495754] tlb sync pending cleared
[    6.509934] tlb sync pending cleared
[    6.510067] tlb sync pending cleared
[    6.510207] tlb sync pending cleared
[    9.864543] tlb sync pending cleared
[    9.874019] tlb sync pending cleared
[    9.979311] tlb sync pending cleared
[   39.616465] tlb sync pending cleared


However, with the fix patch, I can still see the deadlock message when I
have > 32 iperf TX threads active in the system:

root@bcm958742k:~# iperf -c 192.168.1.20 -P64
------------------------------------------------------------
Client connecting to 192.168.1.20, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 66] local 192.168.1.10 port 48802 connected with 192.168.1.20 port 5001
[  6] local 192.168.1.10 port 48680 connected with 192.168.1.20 port 5001
[ 22] local 192.168.1.10 port 48710 connected with 192.168.1.20 port 5001
[ 50] local 192.168.1.10 port 48770 connected with 192.168.1.20 port 5001
[ 32] local 192.168.1.10 port 48734 connected with 192.168.1.20 port 5001
[ 23] local 192.168.1.10 port 48716 connected with 192.168.1.20 port 5001
[ 21] local 192.168.1.10 port 48712 connected with 192.168.1.20 port 5001
[ 10] local 192.168.1.10 port 48688 connected with 192.168.1.20 port 5001
[ 56] local 192.168.1.10 port 48782 connected with 192.168.1.20 port 5001
[ 31] local 192.168.1.10 port 48732 connected with 192.168.1.20 port 5001
[ 63] local 192.168.1.10 port 48796 connected with 192.168.1.20 port 5001
[ 58] local 192.168.1.10 port 48786 connected with 192.168.1.20 port 5001
[ 19] local 192.168.1.10 port 48706 connected with 192.168.1.20 port 5001
[ 47] local 192.168.1.10 port 48764 connected with 192.168.1.20 port 5001
[ 25] local 192.168.1.10 port 48720 connected with 192.168.1.20 port 5001
[ 34] local 192.168.1.10 port 48738 connected with 192.168.1.20 port 5001
[ 64] local 192.168.1.10 port 48798 connected with 192.168.1.20 port 5001
[ 52] local 192.168.1.10 port 48774 connected with 192.168.1.20 port 5001
[ 59] local 192.168.1.10 port 48788 connected with 192.168.1.20 port 5001
[ 30] local 192.168.1.10 port 48730 connected with 192.168.1.20 port 5001
[ 65] local 192.168.1.10 port 48800 connected with 192.168.1.20 port 5001
[ 17] local 192.168.1.10 port 48702 connected with 192.168.1.20 port 5001
[ 20] local 192.168.1.10 port 48708 connected with 192.168.1.20 port 5001
[ 44] local 192.168.1.10 port 48758 connected with 192.168.1.20 port 5001
[ 55] local 192.168.1.10 port 48780 connected with 192.168.1.20 port 5001
[ 33] local 192.168.1.10 port 48736 connected with 192.168.1.20 port 5001
[ 62] local 192.168.1.10 port 48794 connected with 192.168.1.20 port 5001
[ 60] local 192.168.1.10 port 48790 connected with 192.168.1.20 port 5001
[ 14] local 192.168.1.10 port 48696 connected with 192.168.1.20 port 5001
[ 28] local 192.168.1.10 port 48726 connected with 192.168.1.20 port 5001
[ 53] local 192.168.1.10 port 48776 connected with 192.168.1.20 port 5001
[ 42] local 192.168.1.10 port 48754 connected with 192.168.1.20 port 5001
[ 16] local 192.168.1.10 port 48700 connected with 192.168.1.20 port 5001
[  3] local 192.168.1.10 port 48678 connected with 192.168.1.20 port 5001
[ 29] local 192.168.1.10 port 48728 connected with 192.168.1.20 port 5001
[ 27] local 192.168.1.10 port 48724 connected with 192.168.1.20 port 5001
[ 38] local 192.168.1.10 port 48746 connected with 192.168.1.20 port 5001
[ 13] local 192.168.1.10 port 48694 connected with 192.168.1.20 port 5001
[ 12] local 192.168.1.10 port 48692 connected with 192.168.1.20 port 5001
[ 41] local 192.168.1.10 port 48752 connected with 192.168.1.20 port 5001
[ 26] local 192.168.1.10 port 48722 connected with 192.168.1.20 port 5001
[ 11] local 192.168.1.10 port 48690 connected with 192.168.1.20 port 5001
[ 24] local 192.168.1.10 port 48718 connected with 192.168.1.20 port 5001
[ 15] local 192.168.1.10 port 48698 connected with 192.168.1.20 port 5001
[ 37] local 192.168.1.10 port 48744 connected with 192.168.1.20 port 5001
[ 36] local 192.168.1.10 port 48742 connected with 192.168.1.20 port 5001
[ 43] local 192.168.1.10 port 48756 connected with 192.168.1.20 port 5001
[ 48] local 192.168.1.10 port 48766 connected with 192.168.1.20 port 5001
[ 45] local 192.168.1.10 port 48760 connected with 192.168.1.20 port 5001
[ 35] local 192.168.1.10 port 48740 connected with 192.168.1.20 port 5001
[  7] local 192.168.1.10 port 48672 connected with 192.168.1.20 port 5001
[ 39] local 192.168.1.10 port 48748 connected with 192.168.1.20 port 5001
[ 40] local 192.168.1.10 port 48750 connected with 192.168.1.20 port 5001
[  8] local 192.168.1.10 port 48682 connected with 192.168.1.20 port 5001
[ 18] local 192.168.1.10 port 48704 connected with 192.168.1.20 port 5001
[  4] local 192.168.1.10 port 48674 connected with 192.168.1.20 port 5001
[ 46] local 192.168.1.10 port 48762 connected with 192.168.1.20 port 5001
[  5] local 192.168.1.10 port 48676 connected with 192.168.1.20 port 5001
[ 49] local 192.168.1.10 port 48768 connected with 192.168.1.20 port 5001
[ 54] local 192.168.1.10 port 48778 connected with 192.168.1.20 port 5001
[ 57] local 192.168.1.10 port 48784 connected with 192.168.1.20 port 5001
[ 51] local 192.168.1.10 port 48772 connected with 192.168.1.20 port 5001
[  9] local 192.168.1.10 port 48686 connected with 192.168.1.20 port 5001
[ 61] local 192.168.1.10 port 48792 connected with 192.168.1.20 port 5001
[  698.284709] arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be
deadlocked
[  699.386010] arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be
deadlocked
[  702.064900] arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be
deadlocked
[ ID] Interval       Transfer     Bandwidth
[ 26]  0.0-10.0 sec   544 MBytes   456 Mbits/sec
[  6]  0.0-10.0 sec   382 MBytes   320 Mbits/sec
[ 22]  0.0-10.1 sec   667 MBytes   556 Mbits/sec
[ 50]  0.0-10.1 sec   245 MBytes   204 Mbits/sec
[ 21]  0.0-10.1 sec   291 MBytes   242 Mbits/sec
[ 56]  0.0-10.1 sec   256 MBytes   213 Mbits/sec
[ 19]  0.0-10.0 sec  17.0 MBytes  14.2 Mbits/sec
[ 47]  0.0-10.0 sec   357 MBytes   299 Mbits/sec
[ 52]  0.0-10.1 sec   121 MBytes   101 Mbits/sec
[ 59]  0.0-10.0 sec   364 MBytes   304 Mbits/sec
[ 30]  0.0-10.0 sec   469 MBytes   391 Mbits/sec
[ 20]  0.0-10.0 sec   435 MBytes   364 Mbits/sec
[ 44]  0.0-10.0 sec   379 MBytes   317 Mbits/sec
[ 33]  0.0-10.0 sec   468 MBytes   392 Mbits/sec
[ 60]  0.0-10.0 sec   178 MBytes   149 Mbits/sec
[ 14]  0.0-10.1 sec   539 MBytes   449 Mbits/sec
[ 28]  0.0-10.1 sec  60.6 MBytes  50.5 Mbits/sec
[ 42]  0.0-10.1 sec   365 MBytes   304 Mbits/sec
[  3]  0.0-10.1 sec   109 MBytes  90.5 Mbits/sec
[ 29]  0.0-10.1 sec   473 MBytes   395 Mbits/sec
[ 38]  0.0-10.0 sec   254 MBytes   212 Mbits/sec
[ 13]  0.0-10.0 sec   523 MBytes   438 Mbits/sec
[ 12]  0.0-10.1 sec   182 MBytes   152 Mbits/sec
[ 11]  0.0-10.1 sec   130 MBytes   109 Mbits/sec
[ 15]  0.0-10.1 sec   174 MBytes   145 Mbits/sec
[ 43]  0.0-10.1 sec   399 MBytes   333 Mbits/sec
[ 48]  0.0-10.1 sec   543 MBytes   452 Mbits/sec
[ 45]  0.0-10.1 sec  69.1 MBytes  57.6 Mbits/sec
[ 35]  0.0-10.1 sec  54.0 MBytes  45.0 Mbits/sec
[  4]  0.0-10.0 sec   116 MBytes  97.4 Mbits/sec
[ 46]  0.0-10.1 sec   300 MBytes   250 Mbits/sec
[ 51]  0.0-10.1 sec  49.8 MBytes  41.5 Mbits/sec
[ 61]  0.0-10.1 sec   102 MBytes  85.0 Mbits/sec
[ 23]  0.0-10.1 sec  1.64 GBytes  1.39 Gbits/sec
[ 10]  0.0-10.1 sec   210 MBytes   174 Mbits/sec
[ 31]  0.0-10.1 sec  1.16 GBytes   988 Mbits/sec
[ 63]  0.0-10.1 sec   468 MBytes   389 Mbits/sec
[ 25]  0.0-10.1 sec   457 MBytes   381 Mbits/sec
[ 34]  0.0-10.1 sec   332 MBytes   276 Mbits/sec
[ 64]  0.0-10.1 sec   280 MBytes   233 Mbits/sec
[ 17]  0.0-10.1 sec   425 MBytes   354 Mbits/sec
[ 62]  0.0-10.1 sec   616 MBytes   513 Mbits/sec
[ 53]  0.0-10.1 sec   289 MBytes   241 Mbits/sec
[ 16]  0.0-10.1 sec   661 MBytes   550 Mbits/sec
[ 27]  0.0-10.1 sec   298 MBytes   249 Mbits/sec
[ 41]  0.0-10.1 sec  11.5 MBytes  9.57 Mbits/sec
[ 37]  0.0-10.1 sec   945 MBytes   786 Mbits/sec
[ 36]  0.0-10.1 sec   164 MBytes   136 Mbits/sec
[ 40]  0.0-10.1 sec   782 MBytes   650 Mbits/sec
[  8]  0.0-10.1 sec   883 MBytes   734 Mbits/sec
[ 18]  0.0-10.1 sec   140 MBytes   117 Mbits/sec
[  5]  0.0-10.1 sec   366 MBytes   305 Mbits/sec
[ 49]  0.0-10.1 sec   229 MBytes   191 Mbits/sec
[ 54]  0.0-10.1 sec   884 MBytes   736 Mbits/sec
[ 57]  0.0-10.1 sec  56.6 MBytes  47.1 Mbits/sec
[  9]  0.0-10.1 sec  72.8 MBytes  60.4 Mbits/sec
[ 66]  0.0-10.1 sec   170 MBytes   141 Mbits/sec
[ 32]  0.0-10.1 sec   201 MBytes   167 Mbits/sec
[ 58]  0.0-10.1 sec   381 MBytes   317 Mbits/sec
[ 65]  0.0-10.1 sec   373 MBytes   310 Mbits/sec
[ 55]  0.0-10.1 sec  98.0 MBytes  81.5 Mbits/sec
[ 24]  0.0-10.1 sec   292 MBytes   243 Mbits/sec
[  7]  0.0-10.1 sec  1.08 GBytes   918 Mbits/sec
[ 39]  0.0-10.1 sec  95.8 MBytes  79.6 Mbits/sec
[SUM]  0.0-10.1 sec  23.2 GBytes  19.7 Gbits/sec


I played with it a bit and can confirm if I have all interrupt affinity
set to CPU0, I then do not see this issue. This tells us that there
still seem to be a race somewhere, when multiple CPUs are involved?

Regards,

Ray

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-05 23:24                                             ` Ray Jui
  0 siblings, 0 replies; 50+ messages in thread
From: Ray Jui @ 2017-07-05 23:24 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 7/5/17 1:41 AM, Will Deacon wrote:
> On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
>> Hi Will/Robin,
>>
>> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
>> very different L2 throughput with v2 (in general a lot worse with v2 vs.
>> v1); however, I'm currently unable to reproduce the TLB sync timed out
>> issue with v2 (without the patch from Will's email).
>>
>> It could also be something else that has changed in my setup, but so far
>> I have not yet been able to spot anything wrong in the setup.
> 
> There were fixes, and that initially involved a DSB that was found to be
> expensive. The patches queued in -next should have that addressed, so please
> use those (or my for-joerg/arm-smmu/updates branch).
> 
> Will
> 

That was my bad yesterday. I was in a rush and the setup was incorrect.

I redo my Ethernet performance test with both PATCH v1 and v2 today, and
can confirm the performance is consistent between v1 and v2 as expected.

I also made sure the following message can still be reproduced with
patch set v2:

arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

Then I proceeded to apply your patch that attempt to fix the deadlock
issue. I also added a print to ensure I'm running the correct build with
your fix patch applied:

diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index cd8d7aa..01a6fa8 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -60,6 +60,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum
io_pgtable_fmt fmt,
        iop->cfg        = *cfg;

        atomic_set(&iop->tlb_sync_pending, 0);
+       pr_err("tlb sync pending cleared\n");
        return &iop->ops;
 }

root at bcm958742k:~# dmesg | grep tlb
[    6.495754] tlb sync pending cleared
[    6.509934] tlb sync pending cleared
[    6.510067] tlb sync pending cleared
[    6.510207] tlb sync pending cleared
[    9.864543] tlb sync pending cleared
[    9.874019] tlb sync pending cleared
[    9.979311] tlb sync pending cleared
[   39.616465] tlb sync pending cleared


However, with the fix patch, I can still see the deadlock message when I
have > 32 iperf TX threads active in the system:

root at bcm958742k:~# iperf -c 192.168.1.20 -P64
------------------------------------------------------------
Client connecting to 192.168.1.20, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 66] local 192.168.1.10 port 48802 connected with 192.168.1.20 port 5001
[  6] local 192.168.1.10 port 48680 connected with 192.168.1.20 port 5001
[ 22] local 192.168.1.10 port 48710 connected with 192.168.1.20 port 5001
[ 50] local 192.168.1.10 port 48770 connected with 192.168.1.20 port 5001
[ 32] local 192.168.1.10 port 48734 connected with 192.168.1.20 port 5001
[ 23] local 192.168.1.10 port 48716 connected with 192.168.1.20 port 5001
[ 21] local 192.168.1.10 port 48712 connected with 192.168.1.20 port 5001
[ 10] local 192.168.1.10 port 48688 connected with 192.168.1.20 port 5001
[ 56] local 192.168.1.10 port 48782 connected with 192.168.1.20 port 5001
[ 31] local 192.168.1.10 port 48732 connected with 192.168.1.20 port 5001
[ 63] local 192.168.1.10 port 48796 connected with 192.168.1.20 port 5001
[ 58] local 192.168.1.10 port 48786 connected with 192.168.1.20 port 5001
[ 19] local 192.168.1.10 port 48706 connected with 192.168.1.20 port 5001
[ 47] local 192.168.1.10 port 48764 connected with 192.168.1.20 port 5001
[ 25] local 192.168.1.10 port 48720 connected with 192.168.1.20 port 5001
[ 34] local 192.168.1.10 port 48738 connected with 192.168.1.20 port 5001
[ 64] local 192.168.1.10 port 48798 connected with 192.168.1.20 port 5001
[ 52] local 192.168.1.10 port 48774 connected with 192.168.1.20 port 5001
[ 59] local 192.168.1.10 port 48788 connected with 192.168.1.20 port 5001
[ 30] local 192.168.1.10 port 48730 connected with 192.168.1.20 port 5001
[ 65] local 192.168.1.10 port 48800 connected with 192.168.1.20 port 5001
[ 17] local 192.168.1.10 port 48702 connected with 192.168.1.20 port 5001
[ 20] local 192.168.1.10 port 48708 connected with 192.168.1.20 port 5001
[ 44] local 192.168.1.10 port 48758 connected with 192.168.1.20 port 5001
[ 55] local 192.168.1.10 port 48780 connected with 192.168.1.20 port 5001
[ 33] local 192.168.1.10 port 48736 connected with 192.168.1.20 port 5001
[ 62] local 192.168.1.10 port 48794 connected with 192.168.1.20 port 5001
[ 60] local 192.168.1.10 port 48790 connected with 192.168.1.20 port 5001
[ 14] local 192.168.1.10 port 48696 connected with 192.168.1.20 port 5001
[ 28] local 192.168.1.10 port 48726 connected with 192.168.1.20 port 5001
[ 53] local 192.168.1.10 port 48776 connected with 192.168.1.20 port 5001
[ 42] local 192.168.1.10 port 48754 connected with 192.168.1.20 port 5001
[ 16] local 192.168.1.10 port 48700 connected with 192.168.1.20 port 5001
[  3] local 192.168.1.10 port 48678 connected with 192.168.1.20 port 5001
[ 29] local 192.168.1.10 port 48728 connected with 192.168.1.20 port 5001
[ 27] local 192.168.1.10 port 48724 connected with 192.168.1.20 port 5001
[ 38] local 192.168.1.10 port 48746 connected with 192.168.1.20 port 5001
[ 13] local 192.168.1.10 port 48694 connected with 192.168.1.20 port 5001
[ 12] local 192.168.1.10 port 48692 connected with 192.168.1.20 port 5001
[ 41] local 192.168.1.10 port 48752 connected with 192.168.1.20 port 5001
[ 26] local 192.168.1.10 port 48722 connected with 192.168.1.20 port 5001
[ 11] local 192.168.1.10 port 48690 connected with 192.168.1.20 port 5001
[ 24] local 192.168.1.10 port 48718 connected with 192.168.1.20 port 5001
[ 15] local 192.168.1.10 port 48698 connected with 192.168.1.20 port 5001
[ 37] local 192.168.1.10 port 48744 connected with 192.168.1.20 port 5001
[ 36] local 192.168.1.10 port 48742 connected with 192.168.1.20 port 5001
[ 43] local 192.168.1.10 port 48756 connected with 192.168.1.20 port 5001
[ 48] local 192.168.1.10 port 48766 connected with 192.168.1.20 port 5001
[ 45] local 192.168.1.10 port 48760 connected with 192.168.1.20 port 5001
[ 35] local 192.168.1.10 port 48740 connected with 192.168.1.20 port 5001
[  7] local 192.168.1.10 port 48672 connected with 192.168.1.20 port 5001
[ 39] local 192.168.1.10 port 48748 connected with 192.168.1.20 port 5001
[ 40] local 192.168.1.10 port 48750 connected with 192.168.1.20 port 5001
[  8] local 192.168.1.10 port 48682 connected with 192.168.1.20 port 5001
[ 18] local 192.168.1.10 port 48704 connected with 192.168.1.20 port 5001
[  4] local 192.168.1.10 port 48674 connected with 192.168.1.20 port 5001
[ 46] local 192.168.1.10 port 48762 connected with 192.168.1.20 port 5001
[  5] local 192.168.1.10 port 48676 connected with 192.168.1.20 port 5001
[ 49] local 192.168.1.10 port 48768 connected with 192.168.1.20 port 5001
[ 54] local 192.168.1.10 port 48778 connected with 192.168.1.20 port 5001
[ 57] local 192.168.1.10 port 48784 connected with 192.168.1.20 port 5001
[ 51] local 192.168.1.10 port 48772 connected with 192.168.1.20 port 5001
[  9] local 192.168.1.10 port 48686 connected with 192.168.1.20 port 5001
[ 61] local 192.168.1.10 port 48792 connected with 192.168.1.20 port 5001
[  698.284709] arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be
deadlocked
[  699.386010] arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be
deadlocked
[  702.064900] arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be
deadlocked
[ ID] Interval       Transfer     Bandwidth
[ 26]  0.0-10.0 sec   544 MBytes   456 Mbits/sec
[  6]  0.0-10.0 sec   382 MBytes   320 Mbits/sec
[ 22]  0.0-10.1 sec   667 MBytes   556 Mbits/sec
[ 50]  0.0-10.1 sec   245 MBytes   204 Mbits/sec
[ 21]  0.0-10.1 sec   291 MBytes   242 Mbits/sec
[ 56]  0.0-10.1 sec   256 MBytes   213 Mbits/sec
[ 19]  0.0-10.0 sec  17.0 MBytes  14.2 Mbits/sec
[ 47]  0.0-10.0 sec   357 MBytes   299 Mbits/sec
[ 52]  0.0-10.1 sec   121 MBytes   101 Mbits/sec
[ 59]  0.0-10.0 sec   364 MBytes   304 Mbits/sec
[ 30]  0.0-10.0 sec   469 MBytes   391 Mbits/sec
[ 20]  0.0-10.0 sec   435 MBytes   364 Mbits/sec
[ 44]  0.0-10.0 sec   379 MBytes   317 Mbits/sec
[ 33]  0.0-10.0 sec   468 MBytes   392 Mbits/sec
[ 60]  0.0-10.0 sec   178 MBytes   149 Mbits/sec
[ 14]  0.0-10.1 sec   539 MBytes   449 Mbits/sec
[ 28]  0.0-10.1 sec  60.6 MBytes  50.5 Mbits/sec
[ 42]  0.0-10.1 sec   365 MBytes   304 Mbits/sec
[  3]  0.0-10.1 sec   109 MBytes  90.5 Mbits/sec
[ 29]  0.0-10.1 sec   473 MBytes   395 Mbits/sec
[ 38]  0.0-10.0 sec   254 MBytes   212 Mbits/sec
[ 13]  0.0-10.0 sec   523 MBytes   438 Mbits/sec
[ 12]  0.0-10.1 sec   182 MBytes   152 Mbits/sec
[ 11]  0.0-10.1 sec   130 MBytes   109 Mbits/sec
[ 15]  0.0-10.1 sec   174 MBytes   145 Mbits/sec
[ 43]  0.0-10.1 sec   399 MBytes   333 Mbits/sec
[ 48]  0.0-10.1 sec   543 MBytes   452 Mbits/sec
[ 45]  0.0-10.1 sec  69.1 MBytes  57.6 Mbits/sec
[ 35]  0.0-10.1 sec  54.0 MBytes  45.0 Mbits/sec
[  4]  0.0-10.0 sec   116 MBytes  97.4 Mbits/sec
[ 46]  0.0-10.1 sec   300 MBytes   250 Mbits/sec
[ 51]  0.0-10.1 sec  49.8 MBytes  41.5 Mbits/sec
[ 61]  0.0-10.1 sec   102 MBytes  85.0 Mbits/sec
[ 23]  0.0-10.1 sec  1.64 GBytes  1.39 Gbits/sec
[ 10]  0.0-10.1 sec   210 MBytes   174 Mbits/sec
[ 31]  0.0-10.1 sec  1.16 GBytes   988 Mbits/sec
[ 63]  0.0-10.1 sec   468 MBytes   389 Mbits/sec
[ 25]  0.0-10.1 sec   457 MBytes   381 Mbits/sec
[ 34]  0.0-10.1 sec   332 MBytes   276 Mbits/sec
[ 64]  0.0-10.1 sec   280 MBytes   233 Mbits/sec
[ 17]  0.0-10.1 sec   425 MBytes   354 Mbits/sec
[ 62]  0.0-10.1 sec   616 MBytes   513 Mbits/sec
[ 53]  0.0-10.1 sec   289 MBytes   241 Mbits/sec
[ 16]  0.0-10.1 sec   661 MBytes   550 Mbits/sec
[ 27]  0.0-10.1 sec   298 MBytes   249 Mbits/sec
[ 41]  0.0-10.1 sec  11.5 MBytes  9.57 Mbits/sec
[ 37]  0.0-10.1 sec   945 MBytes   786 Mbits/sec
[ 36]  0.0-10.1 sec   164 MBytes   136 Mbits/sec
[ 40]  0.0-10.1 sec   782 MBytes   650 Mbits/sec
[  8]  0.0-10.1 sec   883 MBytes   734 Mbits/sec
[ 18]  0.0-10.1 sec   140 MBytes   117 Mbits/sec
[  5]  0.0-10.1 sec   366 MBytes   305 Mbits/sec
[ 49]  0.0-10.1 sec   229 MBytes   191 Mbits/sec
[ 54]  0.0-10.1 sec   884 MBytes   736 Mbits/sec
[ 57]  0.0-10.1 sec  56.6 MBytes  47.1 Mbits/sec
[  9]  0.0-10.1 sec  72.8 MBytes  60.4 Mbits/sec
[ 66]  0.0-10.1 sec   170 MBytes   141 Mbits/sec
[ 32]  0.0-10.1 sec   201 MBytes   167 Mbits/sec
[ 58]  0.0-10.1 sec   381 MBytes   317 Mbits/sec
[ 65]  0.0-10.1 sec   373 MBytes   310 Mbits/sec
[ 55]  0.0-10.1 sec  98.0 MBytes  81.5 Mbits/sec
[ 24]  0.0-10.1 sec   292 MBytes   243 Mbits/sec
[  7]  0.0-10.1 sec  1.08 GBytes   918 Mbits/sec
[ 39]  0.0-10.1 sec  95.8 MBytes  79.6 Mbits/sec
[SUM]  0.0-10.1 sec  23.2 GBytes  19.7 Gbits/sec


I played with it a bit and can confirm if I have all interrupt affinity
set to CPU0, I then do not see this issue. This tells us that there
still seem to be a race somewhere, when multiple CPUs are involved?

Regards,

Ray

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-07-05 23:24                                             ` Ray Jui
@ 2017-07-06 15:08                                                 ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-06 15:08 UTC (permalink / raw)
  To: Ray Jui
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Ray,

Thanks for testing this, and sorry it didn't help.

On Wed, Jul 05, 2017 at 04:24:22PM -0700, Ray Jui wrote:
> On 7/5/17 1:41 AM, Will Deacon wrote:
> > On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
> >> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
> >> very different L2 throughput with v2 (in general a lot worse with v2 vs.
> >> v1); however, I'm currently unable to reproduce the TLB sync timed out
> >> issue with v2 (without the patch from Will's email).
> >>
> >> It could also be something else that has changed in my setup, but so far
> >> I have not yet been able to spot anything wrong in the setup.
> > 
> > There were fixes, and that initially involved a DSB that was found to be
> > expensive. The patches queued in -next should have that addressed, so please
> > use those (or my for-joerg/arm-smmu/updates branch).
> > 
> 
> That was my bad yesterday. I was in a rush and the setup was incorrect.
> 
> I redo my Ethernet performance test with both PATCH v1 and v2 today, and
> can confirm the performance is consistent between v1 and v2 as expected.
> 
> I also made sure the following message can still be reproduced with
> patch set v2:
> 
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> 
> Then I proceeded to apply your patch that attempt to fix the deadlock
> issue.

[...]

> However, with the fix patch, I can still see the deadlock message when I
> have > 32 iperf TX threads active in the system:

Damn. We've been going over this today and the only plausible theory seems
to be that concurrent TLB syncs are causing the completion to be pushed out,
resulting in timeouts.

Can you try this patch below, instead of the one I sent before, please?

Thanks,

Will

--->8

>From bbf3737c29e3d18f539998a66f42878ac91cde97 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
Date: Thu, 6 Jul 2017 15:55:48 +0100
Subject: [PATCH] iommu/arm-smmu: Reintroduce locking around TLB sync
 operations

Commit 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
removed the locking used to serialise map/unmap calls into the io-pgtable
code from the ARM SMMU driver. This is good for performance, but opens
us up to a nasty race with TLB syncs because the TLB sync register is
shared within a context bank (or even globally for stage-2 on SMMUv1).

There are two cases to consider:

  1. A CPU can be spinning on the completion of a TLB sync, take an
     interrupt which issues a subsequent TLB sync, and then report a
     timeout on return from the interrupt.

  2. A CPU can be spinning on the completion of a TLB sync, but other
     CPUs can continuously issue additional TLB syncs in such a way that
     the backoff logic reports a timeout.

Rather than fix this by spinning for completion of prior TLB syncs before
issuing a new one (which may suffer from fairness issues on large systems),
instead reintroduce locking around TLB sync operations in the ARM SMMU
driver.

Fixes: 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
Cc: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
Reported-by: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/arm-smmu.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index b446183b3015..770abd247f40 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -400,6 +400,8 @@ struct arm_smmu_device {
 
 	u32				cavium_id_base; /* Specific to Cavium */
 
+	spinlock_t			global_sync_lock;
+
 	/* IOMMU core code handle */
 	struct iommu_device		iommu;
 };
@@ -436,7 +438,7 @@ struct arm_smmu_domain {
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
 	struct mutex			init_mutex; /* Protects smmu pointer */
-	spinlock_t			cb_lock; /* Serialises ATS1* ops */
+	spinlock_t			cb_lock; /* Serialises ATS1* ops and TLB syncs */
 	struct iommu_domain		domain;
 };
 
@@ -602,9 +604,12 @@ static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
 static void arm_smmu_tlb_sync_global(struct arm_smmu_device *smmu)
 {
 	void __iomem *base = ARM_SMMU_GR0(smmu);
+	unsigned long flags;
 
+	spin_lock_irqsave(&smmu->global_sync_lock, flags);
 	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_GR0_sTLBGSYNC,
 			    base + ARM_SMMU_GR0_sTLBGSTATUS);
+	spin_unlock_irqrestore(&smmu->global_sync_lock, flags);
 }
 
 static void arm_smmu_tlb_sync_context(void *cookie)
@@ -612,9 +617,12 @@ static void arm_smmu_tlb_sync_context(void *cookie)
 	struct arm_smmu_domain *smmu_domain = cookie;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	void __iomem *base = ARM_SMMU_CB(smmu, smmu_domain->cfg.cbndx);
+	unsigned long flags;
 
+	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
 	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_CB_TLBSYNC,
 			    base + ARM_SMMU_CB_TLBSTATUS);
+	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 }
 
 static void arm_smmu_tlb_sync_vmid(void *cookie)
@@ -1925,6 +1933,7 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 
 	smmu->num_mapping_groups = size;
 	mutex_init(&smmu->stream_map_mutex);
+	spin_lock_init(&smmu->global_sync_lock);
 
 	if (smmu->version < ARM_SMMU_V2 || !(id & ID0_PTFS_NO_AARCH32)) {
 		smmu->features |= ARM_SMMU_FEAT_FMT_AARCH32_L;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-06 15:08                                                 ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-06 15:08 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ray,

Thanks for testing this, and sorry it didn't help.

On Wed, Jul 05, 2017 at 04:24:22PM -0700, Ray Jui wrote:
> On 7/5/17 1:41 AM, Will Deacon wrote:
> > On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
> >> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
> >> very different L2 throughput with v2 (in general a lot worse with v2 vs.
> >> v1); however, I'm currently unable to reproduce the TLB sync timed out
> >> issue with v2 (without the patch from Will's email).
> >>
> >> It could also be something else that has changed in my setup, but so far
> >> I have not yet been able to spot anything wrong in the setup.
> > 
> > There were fixes, and that initially involved a DSB that was found to be
> > expensive. The patches queued in -next should have that addressed, so please
> > use those (or my for-joerg/arm-smmu/updates branch).
> > 
> 
> That was my bad yesterday. I was in a rush and the setup was incorrect.
> 
> I redo my Ethernet performance test with both PATCH v1 and v2 today, and
> can confirm the performance is consistent between v1 and v2 as expected.
> 
> I also made sure the following message can still be reproduced with
> patch set v2:
> 
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
> 
> Then I proceeded to apply your patch that attempt to fix the deadlock
> issue.

[...]

> However, with the fix patch, I can still see the deadlock message when I
> have > 32 iperf TX threads active in the system:

Damn. We've been going over this today and the only plausible theory seems
to be that concurrent TLB syncs are causing the completion to be pushed out,
resulting in timeouts.

Can you try this patch below, instead of the one I sent before, please?

Thanks,

Will

--->8

>From bbf3737c29e3d18f539998a66f42878ac91cde97 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Thu, 6 Jul 2017 15:55:48 +0100
Subject: [PATCH] iommu/arm-smmu: Reintroduce locking around TLB sync
 operations

Commit 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
removed the locking used to serialise map/unmap calls into the io-pgtable
code from the ARM SMMU driver. This is good for performance, but opens
us up to a nasty race with TLB syncs because the TLB sync register is
shared within a context bank (or even globally for stage-2 on SMMUv1).

There are two cases to consider:

  1. A CPU can be spinning on the completion of a TLB sync, take an
     interrupt which issues a subsequent TLB sync, and then report a
     timeout on return from the interrupt.

  2. A CPU can be spinning on the completion of a TLB sync, but other
     CPUs can continuously issue additional TLB syncs in such a way that
     the backoff logic reports a timeout.

Rather than fix this by spinning for completion of prior TLB syncs before
issuing a new one (which may suffer from fairness issues on large systems),
instead reintroduce locking around TLB sync operations in the ARM SMMU
driver.

Fixes: 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
Cc: Robin Murphy <robin.murphy@arm.com>
Reported-by: Ray Jui <ray.jui@broadcom.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index b446183b3015..770abd247f40 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -400,6 +400,8 @@ struct arm_smmu_device {
 
 	u32				cavium_id_base; /* Specific to Cavium */
 
+	spinlock_t			global_sync_lock;
+
 	/* IOMMU core code handle */
 	struct iommu_device		iommu;
 };
@@ -436,7 +438,7 @@ struct arm_smmu_domain {
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
 	struct mutex			init_mutex; /* Protects smmu pointer */
-	spinlock_t			cb_lock; /* Serialises ATS1* ops */
+	spinlock_t			cb_lock; /* Serialises ATS1* ops and TLB syncs */
 	struct iommu_domain		domain;
 };
 
@@ -602,9 +604,12 @@ static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
 static void arm_smmu_tlb_sync_global(struct arm_smmu_device *smmu)
 {
 	void __iomem *base = ARM_SMMU_GR0(smmu);
+	unsigned long flags;
 
+	spin_lock_irqsave(&smmu->global_sync_lock, flags);
 	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_GR0_sTLBGSYNC,
 			    base + ARM_SMMU_GR0_sTLBGSTATUS);
+	spin_unlock_irqrestore(&smmu->global_sync_lock, flags);
 }
 
 static void arm_smmu_tlb_sync_context(void *cookie)
@@ -612,9 +617,12 @@ static void arm_smmu_tlb_sync_context(void *cookie)
 	struct arm_smmu_domain *smmu_domain = cookie;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	void __iomem *base = ARM_SMMU_CB(smmu, smmu_domain->cfg.cbndx);
+	unsigned long flags;
 
+	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
 	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_CB_TLBSYNC,
 			    base + ARM_SMMU_CB_TLBSTATUS);
+	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 }
 
 static void arm_smmu_tlb_sync_vmid(void *cookie)
@@ -1925,6 +1933,7 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
 
 	smmu->num_mapping_groups = size;
 	mutex_init(&smmu->stream_map_mutex);
+	spin_lock_init(&smmu->global_sync_lock);
 
 	if (smmu->version < ARM_SMMU_V2 || !(id & ID0_PTFS_NO_AARCH32)) {
 		smmu->features |= ARM_SMMU_FEAT_FMT_AARCH32_L;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-07-06 15:08                                                 ` Will Deacon
@ 2017-07-06 18:14                                                     ` Ray Jui
  -1 siblings, 0 replies; 50+ messages in thread
From: Ray Jui via iommu @ 2017-07-06 18:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On 7/6/17 8:08 AM, Will Deacon wrote:
> Hi Ray,
> 
> Thanks for testing this, and sorry it didn't help.
> 
> On Wed, Jul 05, 2017 at 04:24:22PM -0700, Ray Jui wrote:
>> On 7/5/17 1:41 AM, Will Deacon wrote:
>>> On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
>>>> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
>>>> very different L2 throughput with v2 (in general a lot worse with v2 vs.
>>>> v1); however, I'm currently unable to reproduce the TLB sync timed out
>>>> issue with v2 (without the patch from Will's email).
>>>>
>>>> It could also be something else that has changed in my setup, but so far
>>>> I have not yet been able to spot anything wrong in the setup.
>>>
>>> There were fixes, and that initially involved a DSB that was found to be
>>> expensive. The patches queued in -next should have that addressed, so please
>>> use those (or my for-joerg/arm-smmu/updates branch).
>>>
>>
>> That was my bad yesterday. I was in a rush and the setup was incorrect.
>>
>> I redo my Ethernet performance test with both PATCH v1 and v2 today, and
>> can confirm the performance is consistent between v1 and v2 as expected.
>>
>> I also made sure the following message can still be reproduced with
>> patch set v2:
>>
>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>>
>> Then I proceeded to apply your patch that attempt to fix the deadlock
>> issue.
> 
> [...]
> 
>> However, with the fix patch, I can still see the deadlock message when I
>> have > 32 iperf TX threads active in the system:
> 
> Damn. We've been going over this today and the only plausible theory seems
> to be that concurrent TLB syncs are causing the completion to be pushed out,
> resulting in timeouts.
> 
> Can you try this patch below, instead of the one I sent before, please?
> 
> Thanks,
> 
> Will
> 
> --->8

Good news! I can confirm that with the new patch below, the error log of
"TLB sync timed out" is now gone. I ran 20 iterations of the iperf
client test with 64 threads spread across 8 CPUs. The error message used
to coming out very quickly with just one iteration. But now it's
completely gone.

At the same time, I do seem to observe a slight impact on performance
when multiple threads are used, but I guess that is expected, given that
a spin lock is added to protect the TLB sync.

Great work and thanks for the fix!

Ray

> 
> From bbf3737c29e3d18f539998a66f42878ac91cde97 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> Date: Thu, 6 Jul 2017 15:55:48 +0100
> Subject: [PATCH] iommu/arm-smmu: Reintroduce locking around TLB sync
>  operations
> 
> Commit 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
> removed the locking used to serialise map/unmap calls into the io-pgtable
> code from the ARM SMMU driver. This is good for performance, but opens
> us up to a nasty race with TLB syncs because the TLB sync register is
> shared within a context bank (or even globally for stage-2 on SMMUv1).
> 
> There are two cases to consider:
> 
>   1. A CPU can be spinning on the completion of a TLB sync, take an
>      interrupt which issues a subsequent TLB sync, and then report a
>      timeout on return from the interrupt.
> 
>   2. A CPU can be spinning on the completion of a TLB sync, but other
>      CPUs can continuously issue additional TLB syncs in such a way that
>      the backoff logic reports a timeout.
> 
> Rather than fix this by spinning for completion of prior TLB syncs before
> issuing a new one (which may suffer from fairness issues on large systems),
> instead reintroduce locking around TLB sync operations in the ARM SMMU
> driver.
> 
> Fixes: 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
> Cc: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> Reported-by: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> ---
>  drivers/iommu/arm-smmu.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index b446183b3015..770abd247f40 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -400,6 +400,8 @@ struct arm_smmu_device {
>  
>  	u32				cavium_id_base; /* Specific to Cavium */
>  
> +	spinlock_t			global_sync_lock;
> +
>  	/* IOMMU core code handle */
>  	struct iommu_device		iommu;
>  };
> @@ -436,7 +438,7 @@ struct arm_smmu_domain {
>  	struct arm_smmu_cfg		cfg;
>  	enum arm_smmu_domain_stage	stage;
>  	struct mutex			init_mutex; /* Protects smmu pointer */
> -	spinlock_t			cb_lock; /* Serialises ATS1* ops */
> +	spinlock_t			cb_lock; /* Serialises ATS1* ops and TLB syncs */
>  	struct iommu_domain		domain;
>  };
>  
> @@ -602,9 +604,12 @@ static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
>  static void arm_smmu_tlb_sync_global(struct arm_smmu_device *smmu)
>  {
>  	void __iomem *base = ARM_SMMU_GR0(smmu);
> +	unsigned long flags;
>  
> +	spin_lock_irqsave(&smmu->global_sync_lock, flags);
>  	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_GR0_sTLBGSYNC,
>  			    base + ARM_SMMU_GR0_sTLBGSTATUS);
> +	spin_unlock_irqrestore(&smmu->global_sync_lock, flags);
>  }
>  
>  static void arm_smmu_tlb_sync_context(void *cookie)
> @@ -612,9 +617,12 @@ static void arm_smmu_tlb_sync_context(void *cookie)
>  	struct arm_smmu_domain *smmu_domain = cookie;
>  	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	void __iomem *base = ARM_SMMU_CB(smmu, smmu_domain->cfg.cbndx);
> +	unsigned long flags;
>  
> +	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
>  	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_CB_TLBSYNC,
>  			    base + ARM_SMMU_CB_TLBSTATUS);
> +	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
>  }
>  
>  static void arm_smmu_tlb_sync_vmid(void *cookie)
> @@ -1925,6 +1933,7 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
>  
>  	smmu->num_mapping_groups = size;
>  	mutex_init(&smmu->stream_map_mutex);
> +	spin_lock_init(&smmu->global_sync_lock);
>  
>  	if (smmu->version < ARM_SMMU_V2 || !(id & ID0_PTFS_NO_AARCH32)) {
>  		smmu->features |= ARM_SMMU_FEAT_FMT_AARCH32_L;
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-06 18:14                                                     ` Ray Jui
  0 siblings, 0 replies; 50+ messages in thread
From: Ray Jui @ 2017-07-06 18:14 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 7/6/17 8:08 AM, Will Deacon wrote:
> Hi Ray,
> 
> Thanks for testing this, and sorry it didn't help.
> 
> On Wed, Jul 05, 2017 at 04:24:22PM -0700, Ray Jui wrote:
>> On 7/5/17 1:41 AM, Will Deacon wrote:
>>> On Tue, Jul 04, 2017 at 06:45:17PM -0700, Ray Jui wrote:
>>>> Has anything functionally changed between PATCH v2 and v1? I'm seeing a
>>>> very different L2 throughput with v2 (in general a lot worse with v2 vs.
>>>> v1); however, I'm currently unable to reproduce the TLB sync timed out
>>>> issue with v2 (without the patch from Will's email).
>>>>
>>>> It could also be something else that has changed in my setup, but so far
>>>> I have not yet been able to spot anything wrong in the setup.
>>>
>>> There were fixes, and that initially involved a DSB that was found to be
>>> expensive. The patches queued in -next should have that addressed, so please
>>> use those (or my for-joerg/arm-smmu/updates branch).
>>>
>>
>> That was my bad yesterday. I was in a rush and the setup was incorrect.
>>
>> I redo my Ethernet performance test with both PATCH v1 and v2 today, and
>> can confirm the performance is consistent between v1 and v2 as expected.
>>
>> I also made sure the following message can still be reproduced with
>> patch set v2:
>>
>> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
>>
>> Then I proceeded to apply your patch that attempt to fix the deadlock
>> issue.
> 
> [...]
> 
>> However, with the fix patch, I can still see the deadlock message when I
>> have > 32 iperf TX threads active in the system:
> 
> Damn. We've been going over this today and the only plausible theory seems
> to be that concurrent TLB syncs are causing the completion to be pushed out,
> resulting in timeouts.
> 
> Can you try this patch below, instead of the one I sent before, please?
> 
> Thanks,
> 
> Will
> 
> --->8

Good news! I can confirm that with the new patch below, the error log of
"TLB sync timed out" is now gone. I ran 20 iterations of the iperf
client test with 64 threads spread across 8 CPUs. The error message used
to coming out very quickly with just one iteration. But now it's
completely gone.

At the same time, I do seem to observe a slight impact on performance
when multiple threads are used, but I guess that is expected, given that
a spin lock is added to protect the TLB sync.

Great work and thanks for the fix!

Ray

> 
> From bbf3737c29e3d18f539998a66f42878ac91cde97 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Thu, 6 Jul 2017 15:55:48 +0100
> Subject: [PATCH] iommu/arm-smmu: Reintroduce locking around TLB sync
>  operations
> 
> Commit 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
> removed the locking used to serialise map/unmap calls into the io-pgtable
> code from the ARM SMMU driver. This is good for performance, but opens
> us up to a nasty race with TLB syncs because the TLB sync register is
> shared within a context bank (or even globally for stage-2 on SMMUv1).
> 
> There are two cases to consider:
> 
>   1. A CPU can be spinning on the completion of a TLB sync, take an
>      interrupt which issues a subsequent TLB sync, and then report a
>      timeout on return from the interrupt.
> 
>   2. A CPU can be spinning on the completion of a TLB sync, but other
>      CPUs can continuously issue additional TLB syncs in such a way that
>      the backoff logic reports a timeout.
> 
> Rather than fix this by spinning for completion of prior TLB syncs before
> issuing a new one (which may suffer from fairness issues on large systems),
> instead reintroduce locking around TLB sync operations in the ARM SMMU
> driver.
> 
> Fixes: 523d7423e21b ("iommu/arm-smmu: Remove io-pgtable spinlock")
> Cc: Robin Murphy <robin.murphy@arm.com>
> Reported-by: Ray Jui <ray.jui@broadcom.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  drivers/iommu/arm-smmu.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index b446183b3015..770abd247f40 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -400,6 +400,8 @@ struct arm_smmu_device {
>  
>  	u32				cavium_id_base; /* Specific to Cavium */
>  
> +	spinlock_t			global_sync_lock;
> +
>  	/* IOMMU core code handle */
>  	struct iommu_device		iommu;
>  };
> @@ -436,7 +438,7 @@ struct arm_smmu_domain {
>  	struct arm_smmu_cfg		cfg;
>  	enum arm_smmu_domain_stage	stage;
>  	struct mutex			init_mutex; /* Protects smmu pointer */
> -	spinlock_t			cb_lock; /* Serialises ATS1* ops */
> +	spinlock_t			cb_lock; /* Serialises ATS1* ops and TLB syncs */
>  	struct iommu_domain		domain;
>  };
>  
> @@ -602,9 +604,12 @@ static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
>  static void arm_smmu_tlb_sync_global(struct arm_smmu_device *smmu)
>  {
>  	void __iomem *base = ARM_SMMU_GR0(smmu);
> +	unsigned long flags;
>  
> +	spin_lock_irqsave(&smmu->global_sync_lock, flags);
>  	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_GR0_sTLBGSYNC,
>  			    base + ARM_SMMU_GR0_sTLBGSTATUS);
> +	spin_unlock_irqrestore(&smmu->global_sync_lock, flags);
>  }
>  
>  static void arm_smmu_tlb_sync_context(void *cookie)
> @@ -612,9 +617,12 @@ static void arm_smmu_tlb_sync_context(void *cookie)
>  	struct arm_smmu_domain *smmu_domain = cookie;
>  	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	void __iomem *base = ARM_SMMU_CB(smmu, smmu_domain->cfg.cbndx);
> +	unsigned long flags;
>  
> +	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
>  	__arm_smmu_tlb_sync(smmu, base + ARM_SMMU_CB_TLBSYNC,
>  			    base + ARM_SMMU_CB_TLBSTATUS);
> +	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
>  }
>  
>  static void arm_smmu_tlb_sync_vmid(void *cookie)
> @@ -1925,6 +1933,7 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
>  
>  	smmu->num_mapping_groups = size;
>  	mutex_init(&smmu->stream_map_mutex);
> +	spin_lock_init(&smmu->global_sync_lock);
>  
>  	if (smmu->version < ARM_SMMU_V2 || !(id & ID0_PTFS_NO_AARCH32)) {
>  		smmu->features |= ARM_SMMU_FEAT_FMT_AARCH32_L;
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] io-pgtable lock removal
  2017-07-06 18:14                                                     ` Ray Jui
@ 2017-07-07 12:46                                                         ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-07 12:46 UTC (permalink / raw)
  To: Ray Jui
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, Jul 06, 2017 at 11:14:00AM -0700, Ray Jui wrote:
> On 7/6/17 8:08 AM, Will Deacon wrote:
> > On Wed, Jul 05, 2017 at 04:24:22PM -0700, Ray Jui wrote:
> >> However, with the fix patch, I can still see the deadlock message when I
> >> have > 32 iperf TX threads active in the system:
> > 
> > Damn. We've been going over this today and the only plausible theory seems
> > to be that concurrent TLB syncs are causing the completion to be pushed out,
> > resulting in timeouts.
> > 
> > Can you try this patch below, instead of the one I sent before, please?
> > 
> > Thanks,
> > 
> > Will
> > 
> > --->8
> 
> Good news! I can confirm that with the new patch below, the error log of
> "TLB sync timed out" is now gone. I ran 20 iterations of the iperf
> client test with 64 threads spread across 8 CPUs. The error message used
> to coming out very quickly with just one iteration. But now it's
> completely gone.

Good, I'll queue this is as a fix then. Thanks for testing!

> At the same time, I do seem to observe a slight impact on performance
> when multiple threads are used, but I guess that is expected, given that
> a spin lock is added to protect the TLB sync.

Hopefully that performance will pick up again once we defer the sync
to the end of iommu_unmap.

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/8] io-pgtable lock removal
@ 2017-07-07 12:46                                                         ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2017-07-07 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 06, 2017 at 11:14:00AM -0700, Ray Jui wrote:
> On 7/6/17 8:08 AM, Will Deacon wrote:
> > On Wed, Jul 05, 2017 at 04:24:22PM -0700, Ray Jui wrote:
> >> However, with the fix patch, I can still see the deadlock message when I
> >> have > 32 iperf TX threads active in the system:
> > 
> > Damn. We've been going over this today and the only plausible theory seems
> > to be that concurrent TLB syncs are causing the completion to be pushed out,
> > resulting in timeouts.
> > 
> > Can you try this patch below, instead of the one I sent before, please?
> > 
> > Thanks,
> > 
> > Will
> > 
> > --->8
> 
> Good news! I can confirm that with the new patch below, the error log of
> "TLB sync timed out" is now gone. I ran 20 iterations of the iperf
> client test with 64 threads spread across 8 CPUs. The error message used
> to coming out very quickly with just one iteration. But now it's
> completely gone.

Good, I'll queue this is as a fix then. Thanks for testing!

> At the same time, I do seem to observe a slight impact on performance
> when multiple threads are used, but I guess that is expected, given that
> a spin lock is added to protect the TLB sync.

Hopefully that performance will pick up again once we defer the sync
to the end of iommu_unmap.

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2017-07-07 12:46 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-08 11:51 [PATCH 0/8] io-pgtable lock removal Robin Murphy
2017-06-08 11:51 ` Robin Murphy
2017-06-08 11:52 ` [PATCH 1/8] iommu/io-pgtable-arm-v7s: Check table PTEs more precisely Robin Murphy
2017-06-08 11:52   ` Robin Murphy
2017-06-08 11:52 ` [PATCH 2/8] iommu/io-pgtable-arm: Improve split_blk_unmap Robin Murphy
2017-06-08 11:52   ` Robin Murphy
2017-06-08 11:52 ` [PATCH 3/8] iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap Robin Murphy
2017-06-08 11:52   ` Robin Murphy
2017-06-08 11:52 ` [PATCH 4/8] iommu/io-pgtable: Introduce explicit coherency Robin Murphy
2017-06-08 11:52   ` Robin Murphy
2017-06-08 11:52 ` [PATCH 5/8] iommu/io-pgtable-arm: Support lockless operation Robin Murphy
2017-06-08 11:52   ` Robin Murphy
2017-06-08 11:52 ` [PATCH 6/8] iommu/io-pgtable-arm-v7s: " Robin Murphy
2017-06-08 11:52   ` Robin Murphy
2017-06-08 11:52 ` [PATCH 7/8] iommu/arm-smmu: Remove io-pgtable spinlock Robin Murphy
2017-06-08 11:52   ` Robin Murphy
2017-06-08 11:52 ` [PATCH 8/8] iommu/arm-smmu-v3: " Robin Murphy
2017-06-08 11:52   ` Robin Murphy
     [not found] ` <cover.1496921366.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2017-06-09 19:28   ` [PATCH 0/8] io-pgtable lock removal Nate Watterson
2017-06-09 19:28     ` Nate Watterson
     [not found]     ` <458ad41d-6679-eeca-3c0f-13ccb6c933b6-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2017-06-15  0:40       ` Ray Jui via iommu
2017-06-15  0:40         ` Ray Jui
     [not found]         ` <b7830be1-9a78-e29f-a29c-4798aaa28c0a-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
2017-06-15 12:25           ` John Garry
2017-06-15 12:25             ` John Garry
2017-06-20 13:37           ` Robin Murphy
2017-06-20 13:37             ` Robin Murphy
     [not found]             ` <cdc1799b-f142-09ed-a7e5-d7fd2e70268f-5wv7dgnIgG8@public.gmane.org>
2017-06-27 16:43               ` Ray Jui via iommu
2017-06-27 16:43                 ` Ray Jui
     [not found]                 ` <e43ba1fe-696e-fabb-a800-52fadaa2fa93-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
2017-06-28 11:46                   ` Will Deacon
2017-06-28 11:46                     ` Will Deacon
     [not found]                     ` <20170628114609.GD11053-5wv7dgnIgG8@public.gmane.org>
2017-06-28 17:02                       ` Ray Jui via iommu
2017-06-28 17:02                         ` Ray Jui
     [not found]                         ` <87d53115-3d80-5a3d-6632-c31986cb7018-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
2017-07-04 17:31                           ` Will Deacon
2017-07-04 17:31                             ` Will Deacon
     [not found]                             ` <20170704173155.GN22175-5wv7dgnIgG8@public.gmane.org>
2017-07-04 17:39                               ` Ray Jui via iommu
2017-07-04 17:39                                 ` Ray Jui
     [not found]                                 ` <6814b246-22f0-bfaa-5002-a269b2735116-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
2017-07-05  1:45                                   ` Ray Jui via iommu
2017-07-05  1:45                                     ` Ray Jui
     [not found]                                     ` <2d5f5ef3-32b1-76c6-6869-ff980557f8e8-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
2017-07-05  8:41                                       ` Will Deacon
2017-07-05  8:41                                         ` Will Deacon
     [not found]                                         ` <20170705084143.GA9378-5wv7dgnIgG8@public.gmane.org>
2017-07-05 23:24                                           ` Ray Jui via iommu
2017-07-05 23:24                                             ` Ray Jui
     [not found]                                             ` <5149280b-a214-249c-c5e2-3712b1f941d2-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
2017-07-06 15:08                                               ` Will Deacon
2017-07-06 15:08                                                 ` Will Deacon
     [not found]                                                 ` <20170706150838.GB15574-5wv7dgnIgG8@public.gmane.org>
2017-07-06 18:14                                                   ` Ray Jui via iommu
2017-07-06 18:14                                                     ` Ray Jui
     [not found]                                                     ` <94ba5d4a-0dae-9394-79ef-90da86e49c86-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
2017-07-07 12:46                                                       ` Will Deacon
2017-07-07 12:46                                                         ` Will Deacon
2017-06-21 15:47           ` Joerg Roedel
2017-06-21 15:47             ` Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.