All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-22 15:53 ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

The feedback has been promising, so v2 is just a final update to cover
a handful of memory ordering and cosmetic tweaks that came up when Will
and I went through this offline.

Thanks,
Robin.


Robin Murphy (8):
  iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  iommu/io-pgtable-arm: Improve split_blk_unmap
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/arm-smmu-v3: Remove io-pgtable spinlock

 drivers/iommu/arm-smmu-v3.c        |  36 ++-----
 drivers/iommu/arm-smmu.c           |  48 ++++------
 drivers/iommu/io-pgtable-arm-v7s.c | 179 +++++++++++++++++++++-------------
 drivers/iommu/io-pgtable-arm.c     | 191 ++++++++++++++++++++++++-------------
 drivers/iommu/io-pgtable.h         |   6 ++
 5 files changed, 274 insertions(+), 186 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-22 15:53 ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

The feedback has been promising, so v2 is just a final update to cover
a handful of memory ordering and cosmetic tweaks that came up when Will
and I went through this offline.

Thanks,
Robin.


Robin Murphy (8):
  iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  iommu/io-pgtable-arm: Improve split_blk_unmap
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/arm-smmu-v3: Remove io-pgtable spinlock

 drivers/iommu/arm-smmu-v3.c        |  36 ++-----
 drivers/iommu/arm-smmu.c           |  48 ++++------
 drivers/iommu/io-pgtable-arm-v7s.c | 179 +++++++++++++++++++++-------------
 drivers/iommu/io-pgtable-arm.c     | 191 ++++++++++++++++++++++++-------------
 drivers/iommu/io-pgtable.h         |   6 ++
 5 files changed, 274 insertions(+), 186 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 1/8] iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

Whilst we don't support the PXN bit at all, so should never encounter a
level 1 section or supersection PTE with it set, it would still be wise
to check both table type bits to resolve any theoretical ambiguity.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2: No change

 drivers/iommu/io-pgtable-arm-v7s.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 8d6ca28c3e1f..a490db032c51 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -92,7 +92,8 @@
 #define ARM_V7S_PTE_TYPE_CONT_PAGE	0x1
 
 #define ARM_V7S_PTE_IS_VALID(pte)	(((pte) & 0x3) != 0)
-#define ARM_V7S_PTE_IS_TABLE(pte, lvl)	(lvl == 1 && ((pte) & ARM_V7S_PTE_TYPE_TABLE))
+#define ARM_V7S_PTE_IS_TABLE(pte, lvl) \
+	((lvl) == 1 && (((pte) & 0x3) == ARM_V7S_PTE_TYPE_TABLE))
 
 /* Page table bits */
 #define ARM_V7S_ATTR_XN(lvl)		BIT(4 * (2 - (lvl)))
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 1/8] iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

Whilst we don't support the PXN bit at all, so should never encounter a
level 1 section or supersection PTE with it set, it would still be wise
to check both table type bits to resolve any theoretical ambiguity.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: No change

 drivers/iommu/io-pgtable-arm-v7s.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 8d6ca28c3e1f..a490db032c51 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -92,7 +92,8 @@
 #define ARM_V7S_PTE_TYPE_CONT_PAGE	0x1
 
 #define ARM_V7S_PTE_IS_VALID(pte)	(((pte) & 0x3) != 0)
-#define ARM_V7S_PTE_IS_TABLE(pte, lvl)	(lvl == 1 && ((pte) & ARM_V7S_PTE_TYPE_TABLE))
+#define ARM_V7S_PTE_IS_TABLE(pte, lvl) \
+	((lvl) == 1 && (((pte) & 0x3) == ARM_V7S_PTE_TYPE_TABLE))
 
 /* Page table bits */
 #define ARM_V7S_ATTR_XN(lvl)		BIT(4 * (2 - (lvl)))
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 2/8] iommu/io-pgtable-arm: Improve split_blk_unmap
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

The current split_blk_unmap implementation suffers from some inscrutable
pointer trickery for creating the tables to replace the block entry, but
more than that it also suffers from hideous inefficiency. For example,
the most pathological case of unmapping a level 3 page from a level 1
block will allocate 513 lower-level tables to remap the entire block at
page granularity, when only 2 are actually needed (the rest can be
covered by level 2 block entries).

Also, we would like to be able to relax the spinlock requirement in
future, for which the roll-back-and-try-again logic for race resolution
would be pretty hideous under the current paradigm.

Both issues can be resolved most neatly by turning things sideways:
instead of repeatedly recursing into __arm_lpae_map() map to build up an
entire new sub-table depth-first, we can directly replace the block
entry with a next-level table of block/page entries, then repeat by
unmapping at the next level if necessary. With a little refactoring of
some helper functions, the code ends up not much bigger than before, but
considerably easier to follow and to adapt in future.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2: Make the address increment in the loop nicer, which also allows the
    factoring of the init_pte helpers to be a little cleaner too

 drivers/iommu/io-pgtable-arm.c | 118 ++++++++++++++++++++++++-----------------
 1 file changed, 69 insertions(+), 49 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6e5df5e0a3bd..dd7477010291 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -264,19 +264,38 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			    unsigned long iova, size_t size, int lvl,
 			    arm_lpae_iopte *ptep);
 
+static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+				phys_addr_t paddr, arm_lpae_iopte prot,
+				int lvl, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte = prot;
+
+	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		pte |= ARM_LPAE_PTE_NS;
+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	__arm_lpae_set_pte(ptep, pte, &data->iop.cfg);
+}
+
 static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			     unsigned long iova, phys_addr_t paddr,
 			     arm_lpae_iopte prot, int lvl,
 			     arm_lpae_iopte *ptep)
 {
-	arm_lpae_iopte pte = prot;
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte = *ptep;
 
-	if (iopte_leaf(*ptep, lvl)) {
+	if (iopte_leaf(pte, lvl)) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
-	} else if (iopte_type(*ptep, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
+	} else if (iopte_type(pte, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
 		/*
 		 * We need to unmap and free the old table before
 		 * overwriting it with a block entry.
@@ -289,21 +308,24 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			return -EINVAL;
 	}
 
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-		pte |= ARM_LPAE_PTE_NS;
-
-	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
-		pte |= ARM_LPAE_PTE_TYPE_PAGE;
-	else
-		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
-
-	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
-	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
-
-	__arm_lpae_set_pte(ptep, pte, cfg);
+	__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
 	return 0;
 }
 
+static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
+					     arm_lpae_iopte *ptep,
+					     struct io_pgtable_cfg *cfg)
+{
+	arm_lpae_iopte new;
+
+	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_LPAE_PTE_NSTABLE;
+
+	__arm_lpae_set_pte(ptep, new, cfg);
+	return new;
+}
+
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
 			  int lvl, arm_lpae_iopte *ptep)
@@ -331,10 +353,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_LPAE_PTE_NSTABLE;
-		__arm_lpae_set_pte(ptep, pte, cfg);
+		arm_lpae_install_table(cptep, ptep, cfg);
 	} else if (!iopte_leaf(pte, lvl)) {
 		cptep = iopte_deref(pte, data);
 	} else {
@@ -452,40 +471,43 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 
 static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 				    unsigned long iova, size_t size,
-				    arm_lpae_iopte prot, int lvl,
-				    arm_lpae_iopte *ptep, size_t blk_size)
+				    arm_lpae_iopte blk_pte, int lvl,
+				    arm_lpae_iopte *ptep)
 {
-	unsigned long blk_start, blk_end;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte, *tablep;
 	phys_addr_t blk_paddr;
-	arm_lpae_iopte table = 0;
+	size_t tablesz = ARM_LPAE_GRANULE(data);
+	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int i, unmap_idx = -1;
 
-	blk_start = iova & ~(blk_size - 1);
-	blk_end = blk_start + blk_size;
-	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_lpae_iopte *tablep;
+	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
+	if (size == split_sz)
+		unmap_idx = ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	blk_paddr = iopte_to_pfn(blk_pte, data) << data->pg_shift;
+	pte = iopte_prot(blk_pte);
+
+	for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_lpae_map expects a pointer to the start of the table */
-		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
-		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
-				   tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, data);
-				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, &tablep[i]);
 	}
 
-	__arm_lpae_set_pte(ptep, table, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_lpae_install_table(tablep, ptep, cfg);
+
+	if (unmap_idx < 0)
+		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -495,7 +517,6 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 {
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
-	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
 	/* Something went horribly wrong and we ran out of page table */
 	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -507,7 +528,7 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		return 0;
 
 	/* If the size matches this level, we're in the right place */
-	if (size == blk_size) {
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
 		__arm_lpae_set_pte(ptep, 0, &iop->cfg);
 
 		if (!iopte_leaf(pte, lvl)) {
@@ -527,9 +548,8 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_lpae_split_blk_unmap(data, iova, size,
-						iopte_prot(pte), lvl, ptep,
-						blk_size);
+		return arm_lpae_split_blk_unmap(data, iova, size, pte,
+						lvl + 1, ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 2/8] iommu/io-pgtable-arm: Improve split_blk_unmap
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

The current split_blk_unmap implementation suffers from some inscrutable
pointer trickery for creating the tables to replace the block entry, but
more than that it also suffers from hideous inefficiency. For example,
the most pathological case of unmapping a level 3 page from a level 1
block will allocate 513 lower-level tables to remap the entire block at
page granularity, when only 2 are actually needed (the rest can be
covered by level 2 block entries).

Also, we would like to be able to relax the spinlock requirement in
future, for which the roll-back-and-try-again logic for race resolution
would be pretty hideous under the current paradigm.

Both issues can be resolved most neatly by turning things sideways:
instead of repeatedly recursing into __arm_lpae_map() map to build up an
entire new sub-table depth-first, we can directly replace the block
entry with a next-level table of block/page entries, then repeat by
unmapping at the next level if necessary. With a little refactoring of
some helper functions, the code ends up not much bigger than before, but
considerably easier to follow and to adapt in future.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: Make the address increment in the loop nicer, which also allows the
    factoring of the init_pte helpers to be a little cleaner too

 drivers/iommu/io-pgtable-arm.c | 118 ++++++++++++++++++++++++-----------------
 1 file changed, 69 insertions(+), 49 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6e5df5e0a3bd..dd7477010291 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -264,19 +264,38 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			    unsigned long iova, size_t size, int lvl,
 			    arm_lpae_iopte *ptep);
 
+static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+				phys_addr_t paddr, arm_lpae_iopte prot,
+				int lvl, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte = prot;
+
+	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		pte |= ARM_LPAE_PTE_NS;
+
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
+	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+
+	__arm_lpae_set_pte(ptep, pte, &data->iop.cfg);
+}
+
 static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			     unsigned long iova, phys_addr_t paddr,
 			     arm_lpae_iopte prot, int lvl,
 			     arm_lpae_iopte *ptep)
 {
-	arm_lpae_iopte pte = prot;
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte = *ptep;
 
-	if (iopte_leaf(*ptep, lvl)) {
+	if (iopte_leaf(pte, lvl)) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
-	} else if (iopte_type(*ptep, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
+	} else if (iopte_type(pte, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
 		/*
 		 * We need to unmap and free the old table before
 		 * overwriting it with a block entry.
@@ -289,21 +308,24 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			return -EINVAL;
 	}
 
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-		pte |= ARM_LPAE_PTE_NS;
-
-	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
-		pte |= ARM_LPAE_PTE_TYPE_PAGE;
-	else
-		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
-
-	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
-	pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
-
-	__arm_lpae_set_pte(ptep, pte, cfg);
+	__arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
 	return 0;
 }
 
+static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
+					     arm_lpae_iopte *ptep,
+					     struct io_pgtable_cfg *cfg)
+{
+	arm_lpae_iopte new;
+
+	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_LPAE_PTE_NSTABLE;
+
+	__arm_lpae_set_pte(ptep, new, cfg);
+	return new;
+}
+
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 			  phys_addr_t paddr, size_t size, arm_lpae_iopte prot,
 			  int lvl, arm_lpae_iopte *ptep)
@@ -331,10 +353,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = __pa(cptep) | ARM_LPAE_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_LPAE_PTE_NSTABLE;
-		__arm_lpae_set_pte(ptep, pte, cfg);
+		arm_lpae_install_table(cptep, ptep, cfg);
 	} else if (!iopte_leaf(pte, lvl)) {
 		cptep = iopte_deref(pte, data);
 	} else {
@@ -452,40 +471,43 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 
 static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 				    unsigned long iova, size_t size,
-				    arm_lpae_iopte prot, int lvl,
-				    arm_lpae_iopte *ptep, size_t blk_size)
+				    arm_lpae_iopte blk_pte, int lvl,
+				    arm_lpae_iopte *ptep)
 {
-	unsigned long blk_start, blk_end;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte, *tablep;
 	phys_addr_t blk_paddr;
-	arm_lpae_iopte table = 0;
+	size_t tablesz = ARM_LPAE_GRANULE(data);
+	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int i, unmap_idx = -1;
 
-	blk_start = iova & ~(blk_size - 1);
-	blk_end = blk_start + blk_size;
-	blk_paddr = iopte_to_pfn(*ptep, data) << data->pg_shift;
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_lpae_iopte *tablep;
+	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
+	if (size == split_sz)
+		unmap_idx = ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+	blk_paddr = iopte_to_pfn(blk_pte, data) << data->pg_shift;
+	pte = iopte_prot(blk_pte);
+
+	for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_lpae_map expects a pointer to the start of the table */
-		tablep = &table - ARM_LPAE_LVL_IDX(blk_start, lvl, data);
-		if (__arm_lpae_map(data, blk_start, blk_paddr, size, prot, lvl,
-				   tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, data);
-				__arm_lpae_free_pgtable(data, lvl + 1, tablep);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, &tablep[i]);
 	}
 
-	__arm_lpae_set_pte(ptep, table, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_lpae_install_table(tablep, ptep, cfg);
+
+	if (unmap_idx < 0)
+		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -495,7 +517,6 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 {
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
-	size_t blk_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
 	/* Something went horribly wrong and we ran out of page table */
 	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -507,7 +528,7 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		return 0;
 
 	/* If the size matches this level, we're in the right place */
-	if (size == blk_size) {
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
 		__arm_lpae_set_pte(ptep, 0, &iop->cfg);
 
 		if (!iopte_leaf(pte, lvl)) {
@@ -527,9 +548,8 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_lpae_split_blk_unmap(data, iova, size,
-						iopte_prot(pte), lvl, ptep,
-						blk_size);
+		return arm_lpae_split_blk_unmap(data, iova, size, pte,
+						lvl + 1, ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 3/8] iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

Whilst the short-descriptor format's split_blk_unmap implementation has
no need to be recursive, it followed the pattern of the LPAE version
anyway for the sake of consistency. With the latter now reworked for
both efficiency and future scalability improvements, tweak the former
similarly, not least to make it less obtuse.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2: One trivial cosmetic tweak to the loop increment

 drivers/iommu/io-pgtable-arm-v7s.c | 85 ++++++++++++++++++++------------------
 1 file changed, 45 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index a490db032c51..692621f1022a 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -281,6 +281,13 @@ static arm_v7s_iopte arm_v7s_prot_to_pte(int prot, int lvl,
 	else if (prot & IOMMU_CACHE)
 		pte |= ARM_V7S_ATTR_B | ARM_V7S_ATTR_C;
 
+	pte |= ARM_V7S_PTE_TYPE_PAGE;
+	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
+		pte |= ARM_V7S_ATTR_NS_SECTION;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
+		pte |= ARM_V7S_ATTR_MTK_4GB;
+
 	return pte;
 }
 
@@ -353,7 +360,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			    int lvl, int num_entries, arm_v7s_iopte *ptep)
 {
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	arm_v7s_iopte pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
+	arm_v7s_iopte pte;
 	int i;
 
 	for (i = 0; i < num_entries; i++)
@@ -375,13 +382,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			return -EEXIST;
 		}
 
-	pte |= ARM_V7S_PTE_TYPE_PAGE;
-	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
-		pte |= ARM_V7S_ATTR_NS_SECTION;
-
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
-		pte |= ARM_V7S_ATTR_MTK_4GB;
-
+	pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
 	if (num_entries > 1)
 		pte = arm_v7s_pte_to_cont(pte, lvl);
 
@@ -391,6 +392,20 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 	return 0;
 }
 
+static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
+					   arm_v7s_iopte *ptep,
+					   struct io_pgtable_cfg *cfg)
+{
+	arm_v7s_iopte new;
+
+	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_V7S_ATTR_NS_TABLE;
+
+	__arm_v7s_set_pte(ptep, new, 1, cfg);
+	return new;
+}
+
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 			 phys_addr_t paddr, size_t size, int prot,
 			 int lvl, arm_v7s_iopte *ptep)
@@ -418,11 +433,7 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = virt_to_phys(cptep) | ARM_V7S_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_V7S_ATTR_NS_TABLE;
-
-		__arm_v7s_set_pte(ptep, pte, 1, cfg);
+		arm_v7s_install_table(cptep, ptep, cfg);
 	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
 		cptep = iopte_deref(pte, lvl);
 	} else {
@@ -503,41 +514,35 @@ static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 				   unsigned long iova, size_t size,
-				   arm_v7s_iopte *ptep)
+				   arm_v7s_iopte blk_pte, arm_v7s_iopte *ptep)
 {
-	unsigned long blk_start, blk_end, blk_size;
-	phys_addr_t blk_paddr;
-	arm_v7s_iopte table = 0;
-	int prot = arm_v7s_pte_to_prot(*ptep, 1);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_v7s_iopte pte, *tablep;
+	int i, unmap_idx, num_entries, num_ptes;
 
-	blk_size = ARM_V7S_BLOCK_SIZE(1);
-	blk_start = iova & ARM_V7S_LVL_MASK(1);
-	blk_end = blk_start + ARM_V7S_BLOCK_SIZE(1);
-	blk_paddr = *ptep & ARM_V7S_LVL_MASK(1);
+	tablep = __arm_v7s_alloc_table(2, GFP_ATOMIC, data);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_v7s_iopte *tablep;
+	num_ptes = ARM_V7S_PTES_PER_LVL(2);
+	num_entries = size >> ARM_V7S_LVL_SHIFT(2);
+	unmap_idx = ARM_V7S_LVL_IDX(iova, 2);
 
+	pte = arm_v7s_prot_to_pte(arm_v7s_pte_to_prot(blk_pte, 1), 2, cfg);
+	if (num_entries > 1)
+		pte = arm_v7s_pte_to_cont(pte, 2);
+
+	for (i = 0; i < num_ptes; i += num_entries, pte += size) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_v7s_map expects a pointer to the start of the table */
-		tablep = &table - ARM_V7S_LVL_IDX(blk_start, 1);
-		if (__arm_v7s_map(data, blk_start, blk_paddr, size, prot, 1,
-				  tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, 1);
-				__arm_v7s_free_table(tablep, 2, data);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	__arm_v7s_set_pte(ptep, table, 1, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_v7s_install_table(tablep, ptep, cfg);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -594,7 +599,7 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_v7s_split_blk_unmap(data, iova, size, ptep);
+		return arm_v7s_split_blk_unmap(data, iova, size, pte[0], ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 3/8] iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

Whilst the short-descriptor format's split_blk_unmap implementation has
no need to be recursive, it followed the pattern of the LPAE version
anyway for the sake of consistency. With the latter now reworked for
both efficiency and future scalability improvements, tweak the former
similarly, not least to make it less obtuse.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: One trivial cosmetic tweak to the loop increment

 drivers/iommu/io-pgtable-arm-v7s.c | 85 ++++++++++++++++++++------------------
 1 file changed, 45 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index a490db032c51..692621f1022a 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -281,6 +281,13 @@ static arm_v7s_iopte arm_v7s_prot_to_pte(int prot, int lvl,
 	else if (prot & IOMMU_CACHE)
 		pte |= ARM_V7S_ATTR_B | ARM_V7S_ATTR_C;
 
+	pte |= ARM_V7S_PTE_TYPE_PAGE;
+	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
+		pte |= ARM_V7S_ATTR_NS_SECTION;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
+		pte |= ARM_V7S_ATTR_MTK_4GB;
+
 	return pte;
 }
 
@@ -353,7 +360,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			    int lvl, int num_entries, arm_v7s_iopte *ptep)
 {
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	arm_v7s_iopte pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
+	arm_v7s_iopte pte;
 	int i;
 
 	for (i = 0; i < num_entries; i++)
@@ -375,13 +382,7 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 			return -EEXIST;
 		}
 
-	pte |= ARM_V7S_PTE_TYPE_PAGE;
-	if (lvl == 1 && (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS))
-		pte |= ARM_V7S_ATTR_NS_SECTION;
-
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB)
-		pte |= ARM_V7S_ATTR_MTK_4GB;
-
+	pte = arm_v7s_prot_to_pte(prot, lvl, cfg);
 	if (num_entries > 1)
 		pte = arm_v7s_pte_to_cont(pte, lvl);
 
@@ -391,6 +392,20 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 	return 0;
 }
 
+static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
+					   arm_v7s_iopte *ptep,
+					   struct io_pgtable_cfg *cfg)
+{
+	arm_v7s_iopte new;
+
+	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_V7S_ATTR_NS_TABLE;
+
+	__arm_v7s_set_pte(ptep, new, 1, cfg);
+	return new;
+}
+
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 			 phys_addr_t paddr, size_t size, int prot,
 			 int lvl, arm_v7s_iopte *ptep)
@@ -418,11 +433,7 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		if (!cptep)
 			return -ENOMEM;
 
-		pte = virt_to_phys(cptep) | ARM_V7S_PTE_TYPE_TABLE;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-			pte |= ARM_V7S_ATTR_NS_TABLE;
-
-		__arm_v7s_set_pte(ptep, pte, 1, cfg);
+		arm_v7s_install_table(cptep, ptep, cfg);
 	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
 		cptep = iopte_deref(pte, lvl);
 	} else {
@@ -503,41 +514,35 @@ static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 				   unsigned long iova, size_t size,
-				   arm_v7s_iopte *ptep)
+				   arm_v7s_iopte blk_pte, arm_v7s_iopte *ptep)
 {
-	unsigned long blk_start, blk_end, blk_size;
-	phys_addr_t blk_paddr;
-	arm_v7s_iopte table = 0;
-	int prot = arm_v7s_pte_to_prot(*ptep, 1);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_v7s_iopte pte, *tablep;
+	int i, unmap_idx, num_entries, num_ptes;
 
-	blk_size = ARM_V7S_BLOCK_SIZE(1);
-	blk_start = iova & ARM_V7S_LVL_MASK(1);
-	blk_end = blk_start + ARM_V7S_BLOCK_SIZE(1);
-	blk_paddr = *ptep & ARM_V7S_LVL_MASK(1);
+	tablep = __arm_v7s_alloc_table(2, GFP_ATOMIC, data);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
 
-	for (; blk_start < blk_end; blk_start += size, blk_paddr += size) {
-		arm_v7s_iopte *tablep;
+	num_ptes = ARM_V7S_PTES_PER_LVL(2);
+	num_entries = size >> ARM_V7S_LVL_SHIFT(2);
+	unmap_idx = ARM_V7S_LVL_IDX(iova, 2);
 
+	pte = arm_v7s_prot_to_pte(arm_v7s_pte_to_prot(blk_pte, 1), 2, cfg);
+	if (num_entries > 1)
+		pte = arm_v7s_pte_to_cont(pte, 2);
+
+	for (i = 0; i < num_ptes; i += num_entries, pte += size) {
 		/* Unmap! */
-		if (blk_start == iova)
+		if (i == unmap_idx)
 			continue;
 
-		/* __arm_v7s_map expects a pointer to the start of the table */
-		tablep = &table - ARM_V7S_LVL_IDX(blk_start, 1);
-		if (__arm_v7s_map(data, blk_start, blk_paddr, size, prot, 1,
-				  tablep) < 0) {
-			if (table) {
-				/* Free the table we allocated */
-				tablep = iopte_deref(table, 1);
-				__arm_v7s_free_table(tablep, 2, data);
-			}
-			return 0; /* Bytes unmapped */
-		}
+		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	__arm_v7s_set_pte(ptep, table, 1, &data->iop.cfg);
-	iova &= ~(blk_size - 1);
-	io_pgtable_tlb_add_flush(&data->iop, iova, blk_size, blk_size, true);
+	arm_v7s_install_table(tablep, ptep, cfg);
+
+	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
 }
 
@@ -594,7 +599,7 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 		 * Insert a table at the next level to map the old region,
 		 * minus the part we want to unmap
 		 */
-		return arm_v7s_split_blk_unmap(data, iova, size, ptep);
+		return arm_v7s_split_blk_unmap(data, iova, size, pte[0], ptep);
 	}
 
 	/* Keep on walkin' */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 4/8] iommu/io-pgtable: Introduce explicit coherency
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

Once we remove the serialising spinlock, a potential race opens up for
non-coherent IOMMUs whereby a caller of .map() can be sure that cache
maintenance has been performed on their new PTE, but will have no
guarantee that such maintenance for table entries above it has actually
completed (e.g. if another CPU took an interrupt immediately after
writing the table entry, but before initiating the DMA sync).

Handling this race safely will add some potentially non-trivial overhead
to installing a table entry, which we would much rather avoid on
coherent systems where it will be unnecessary, and where we are stirivng
to minimise latency by removing the locking in the first place.

To that end, let's introduce an explicit notion of cache-coherency to
io-pgtable, such that we will be able to avoid penalising IOMMUs which
know enough to know when they are coherent.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2: No change

 drivers/iommu/arm-smmu-v3.c        |  3 +++
 drivers/iommu/arm-smmu.c           |  3 +++
 drivers/iommu/io-pgtable-arm-v7s.c | 17 ++++++++++-------
 drivers/iommu/io-pgtable-arm.c     | 11 ++++++-----
 drivers/iommu/io-pgtable.h         |  6 ++++++
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969aa60d5..b9c4cf4ccca2 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1555,6 +1555,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops)
 		return -ENOMEM;
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ec30b08b3bd..1f42f339a284 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1010,6 +1010,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	smmu_domain->smmu = smmu;
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops) {
diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 692621f1022a..1e283be39668 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -187,7 +187,8 @@ static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl)
 static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 				   struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	dma_addr_t dma;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 	void *table = NULL;
@@ -196,7 +197,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 		table = (void *)__get_dma_pages(__GFP_ZERO, get_order(size));
 	else if (lvl == 2)
 		table = kmem_cache_zalloc(data->l2_tables, gfp | GFP_DMA);
-	if (table && !selftest_running) {
+	if (table && !(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, table, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -225,10 +226,11 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 static void __arm_v7s_free_table(void *table, int lvl,
 				 struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(dev, __arm_v7s_dma_addr(table), size,
 				 DMA_TO_DEVICE);
 	if (lvl == 1)
@@ -240,7 +242,7 @@ static void __arm_v7s_free_table(void *table, int lvl,
 static void __arm_v7s_pte_sync(arm_v7s_iopte *ptep, int num_entries,
 			       struct io_pgtable_cfg *cfg)
 {
-	if (selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		return;
 
 	dma_sync_single_for_device(cfg->iommu_dev, __arm_v7s_dma_addr(ptep),
@@ -657,7 +659,8 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
 			    IO_PGTABLE_QUIRK_NO_PERMS |
 			    IO_PGTABLE_QUIRK_TLBI_ON_MAP |
-			    IO_PGTABLE_QUIRK_ARM_MTK_4GB))
+			    IO_PGTABLE_QUIRK_ARM_MTK_4GB |
+			    IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	/* If ARM_MTK_4GB is enabled, the NO_PERMS is also expected. */
@@ -774,7 +777,7 @@ static int __init arm_v7s_do_selftests(void)
 		.tlb = &dummy_tlb_ops,
 		.oas = 32,
 		.ias = 32,
-		.quirks = IO_PGTABLE_QUIRK_ARM_NS,
+		.quirks = IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA,
 		.pgsize_bitmap = SZ_4K | SZ_64K | SZ_1M | SZ_16M,
 	};
 	unsigned int iova, size, iova_start;
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index dd7477010291..6334f51912ea 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -217,7 +217,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 	if (!pages)
 		return NULL;
 
-	if (!selftest_running) {
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -243,7 +243,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 static void __arm_lpae_free_pages(void *pages, size_t size,
 				  struct io_pgtable_cfg *cfg)
 {
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
 				 size, DMA_TO_DEVICE);
 	free_pages_exact(pages, size);
@@ -254,7 +254,7 @@ static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 {
 	*ptep = pte;
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_sync_single_for_device(cfg->iommu_dev,
 					   __arm_lpae_dma_addr(ptep),
 					   sizeof(pte), DMA_TO_DEVICE);
@@ -693,7 +693,7 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	u64 reg;
 	struct arm_lpae_io_pgtable *data;
 
-	if (cfg->quirks & ~IO_PGTABLE_QUIRK_ARM_NS)
+	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -782,7 +782,7 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 	struct arm_lpae_io_pgtable *data;
 
 	/* The NS quirk doesn't apply at stage 2 */
-	if (cfg->quirks)
+	if (cfg->quirks & ~IO_PGTABLE_QUIRK_NO_DMA)
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -1086,6 +1086,7 @@ static int __init arm_lpae_do_selftests(void)
 	struct io_pgtable_cfg cfg = {
 		.tlb = &dummy_tlb_ops,
 		.oas = 48,
+		.quirks = IO_PGTABLE_QUIRK_NO_DMA,
 	};
 
 	for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 969d82cc92ca..524263a7ae6f 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -65,11 +65,17 @@ struct io_pgtable_cfg {
 	 *	PTEs, for Mediatek IOMMUs which treat it as a 33rd address bit
 	 *	when the SoC is in "4GB mode" and they can only access the high
 	 *	remap of DRAM (0x1_00000000 to 0x1_ffffffff).
+	 *
+	 * IO_PGTABLE_QUIRK_NO_DMA: Guarantees that the tables will only ever
+	 *	be accessed by a fully cache-coherent IOMMU or CPU (e.g. for a
+	 *	software-emulated IOMMU), such that pagetable updates need not
+	 *	be treated as explicit DMA data.
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS		BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS	BIT(1)
 	#define IO_PGTABLE_QUIRK_TLBI_ON_MAP	BIT(2)
 	#define IO_PGTABLE_QUIRK_ARM_MTK_4GB	BIT(3)
+	#define IO_PGTABLE_QUIRK_NO_DMA		BIT(4)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 4/8] iommu/io-pgtable: Introduce explicit coherency
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

Once we remove the serialising spinlock, a potential race opens up for
non-coherent IOMMUs whereby a caller of .map() can be sure that cache
maintenance has been performed on their new PTE, but will have no
guarantee that such maintenance for table entries above it has actually
completed (e.g. if another CPU took an interrupt immediately after
writing the table entry, but before initiating the DMA sync).

Handling this race safely will add some potentially non-trivial overhead
to installing a table entry, which we would much rather avoid on
coherent systems where it will be unnecessary, and where we are stirivng
to minimise latency by removing the locking in the first place.

To that end, let's introduce an explicit notion of cache-coherency to
io-pgtable, such that we will be able to avoid penalising IOMMUs which
know enough to know when they are coherent.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: No change

 drivers/iommu/arm-smmu-v3.c        |  3 +++
 drivers/iommu/arm-smmu.c           |  3 +++
 drivers/iommu/io-pgtable-arm-v7s.c | 17 ++++++++++-------
 drivers/iommu/io-pgtable-arm.c     | 11 ++++++-----
 drivers/iommu/io-pgtable.h         |  6 ++++++
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969aa60d5..b9c4cf4ccca2 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1555,6 +1555,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops)
 		return -ENOMEM;
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ec30b08b3bd..1f42f339a284 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1010,6 +1010,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
+		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
+
 	smmu_domain->smmu = smmu;
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops) {
diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 692621f1022a..1e283be39668 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -187,7 +187,8 @@ static arm_v7s_iopte *iopte_deref(arm_v7s_iopte pte, int lvl)
 static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 				   struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	dma_addr_t dma;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 	void *table = NULL;
@@ -196,7 +197,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 		table = (void *)__get_dma_pages(__GFP_ZERO, get_order(size));
 	else if (lvl == 2)
 		table = kmem_cache_zalloc(data->l2_tables, gfp | GFP_DMA);
-	if (table && !selftest_running) {
+	if (table && !(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, table, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -225,10 +226,11 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
 static void __arm_v7s_free_table(void *table, int lvl,
 				 struct arm_v7s_io_pgtable *data)
 {
-	struct device *dev = data->iop.cfg.iommu_dev;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct device *dev = cfg->iommu_dev;
 	size_t size = ARM_V7S_TABLE_SIZE(lvl);
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(dev, __arm_v7s_dma_addr(table), size,
 				 DMA_TO_DEVICE);
 	if (lvl == 1)
@@ -240,7 +242,7 @@ static void __arm_v7s_free_table(void *table, int lvl,
 static void __arm_v7s_pte_sync(arm_v7s_iopte *ptep, int num_entries,
 			       struct io_pgtable_cfg *cfg)
 {
-	if (selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		return;
 
 	dma_sync_single_for_device(cfg->iommu_dev, __arm_v7s_dma_addr(ptep),
@@ -657,7 +659,8 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
 			    IO_PGTABLE_QUIRK_NO_PERMS |
 			    IO_PGTABLE_QUIRK_TLBI_ON_MAP |
-			    IO_PGTABLE_QUIRK_ARM_MTK_4GB))
+			    IO_PGTABLE_QUIRK_ARM_MTK_4GB |
+			    IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	/* If ARM_MTK_4GB is enabled, the NO_PERMS is also expected. */
@@ -774,7 +777,7 @@ static int __init arm_v7s_do_selftests(void)
 		.tlb = &dummy_tlb_ops,
 		.oas = 32,
 		.ias = 32,
-		.quirks = IO_PGTABLE_QUIRK_ARM_NS,
+		.quirks = IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA,
 		.pgsize_bitmap = SZ_4K | SZ_64K | SZ_1M | SZ_16M,
 	};
 	unsigned int iova, size, iova_start;
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index dd7477010291..6334f51912ea 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -217,7 +217,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 	if (!pages)
 		return NULL;
 
-	if (!selftest_running) {
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
 		dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, dma))
 			goto out_free;
@@ -243,7 +243,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 static void __arm_lpae_free_pages(void *pages, size_t size,
 				  struct io_pgtable_cfg *cfg)
 {
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
 				 size, DMA_TO_DEVICE);
 	free_pages_exact(pages, size);
@@ -254,7 +254,7 @@ static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 {
 	*ptep = pte;
 
-	if (!selftest_running)
+	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
 		dma_sync_single_for_device(cfg->iommu_dev,
 					   __arm_lpae_dma_addr(ptep),
 					   sizeof(pte), DMA_TO_DEVICE);
@@ -693,7 +693,7 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	u64 reg;
 	struct arm_lpae_io_pgtable *data;
 
-	if (cfg->quirks & ~IO_PGTABLE_QUIRK_ARM_NS)
+	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -782,7 +782,7 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 	struct arm_lpae_io_pgtable *data;
 
 	/* The NS quirk doesn't apply at stage 2 */
-	if (cfg->quirks)
+	if (cfg->quirks & ~IO_PGTABLE_QUIRK_NO_DMA)
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -1086,6 +1086,7 @@ static int __init arm_lpae_do_selftests(void)
 	struct io_pgtable_cfg cfg = {
 		.tlb = &dummy_tlb_ops,
 		.oas = 48,
+		.quirks = IO_PGTABLE_QUIRK_NO_DMA,
 	};
 
 	for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 969d82cc92ca..524263a7ae6f 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -65,11 +65,17 @@ struct io_pgtable_cfg {
 	 *	PTEs, for Mediatek IOMMUs which treat it as a 33rd address bit
 	 *	when the SoC is in "4GB mode" and they can only access the high
 	 *	remap of DRAM (0x1_00000000 to 0x1_ffffffff).
+	 *
+	 * IO_PGTABLE_QUIRK_NO_DMA: Guarantees that the tables will only ever
+	 *	be accessed by a fully cache-coherent IOMMU or CPU (e.g. for a
+	 *	software-emulated IOMMU), such that pagetable updates need not
+	 *	be treated as explicit DMA data.
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS		BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS	BIT(1)
 	#define IO_PGTABLE_QUIRK_TLBI_ON_MAP	BIT(2)
 	#define IO_PGTABLE_QUIRK_ARM_MTK_4GB	BIT(3)
+	#define IO_PGTABLE_QUIRK_NO_DMA		BIT(4)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

For parallel I/O with multiple concurrent threads servicing the same
device (or devices, if several share a domain), serialising page table
updates becomes a massive bottleneck. On reflection, though, we don't
strictly need to do that - for valid IOMMU API usage, there are in fact
only two races that we need to guard against: multiple map requests for
different blocks within the same region, when the intermediate-level
table for that region does not yet exist; and multiple unmaps of
different parts of the same block entry. Both of those are fairly easily
solved by using a cmpxchg to install the new table, such that if we then
find that someone else's table got there first, we can simply free ours
and continue.

Make the requisite changes such that we can withstand being called
without the caller maintaining a lock. In theory, this opens up a few
corners in which wildly misbehaving callers making nonsensical
overlapping requests might lead to crashes instead of just unpredictable
results, but correct code really does not deserve to pay a significant
performance cost for the sake of masking bugs in theoretical broken code.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2:
 - Fix barriers in install_table
 - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
 - Minor cosmetics

 drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
 1 file changed, 57 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6334f51912ea..52700fa958c2 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -20,6 +20,7 @@
 
 #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/iommu.h>
 #include <linux/kernel.h>
 #include <linux/sizes.h>
@@ -99,6 +100,8 @@
 #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
 #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
 					 ARM_LPAE_PTE_ATTR_HI_MASK)
+/* Software bit for solving coherency races */
+#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
 
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
@@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
 	free_pages_exact(pages, size);
 }
 
+static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
+				struct io_pgtable_cfg *cfg)
+{
+	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
+				   sizeof(*ptep), DMA_TO_DEVICE);
+}
+
 static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 			       struct io_pgtable_cfg *cfg)
 {
 	*ptep = pte;
 
 	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
-		dma_sync_single_for_device(cfg->iommu_dev,
-					   __arm_lpae_dma_addr(ptep),
-					   sizeof(pte), DMA_TO_DEVICE);
+		__arm_lpae_sync_pte(ptep, cfg);
 }
 
 static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 
 static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 					     arm_lpae_iopte *ptep,
+					     arm_lpae_iopte curr,
 					     struct io_pgtable_cfg *cfg)
 {
-	arm_lpae_iopte new;
+	arm_lpae_iopte old, new;
 
 	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_LPAE_PTE_NSTABLE;
 
-	__arm_lpae_set_pte(ptep, new, cfg);
-	return new;
+	/* Ensure the table itself is visible before its PTE can be */
+	wmb();
+
+	old = cmpxchg64_relaxed(ptep, curr, new);
+
+	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
+	    (old & ARM_LPAE_PTE_SW_SYNC))
+		return old;
+
+	/* Even if it's not ours, there's no point waiting; just kick it */
+	__arm_lpae_sync_pte(ptep, cfg);
+	if (old == curr)
+		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
+
+	return old;
 }
 
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
@@ -332,6 +354,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 {
 	arm_lpae_iopte *cptep, pte;
 	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	size_t tblsz = ARM_LPAE_GRANULE(data);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 
 	/* Find our entry at the current level */
@@ -346,17 +369,23 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
-		cptep = __arm_lpae_alloc_pages(ARM_LPAE_GRANULE(data),
-					       GFP_ATOMIC, cfg);
+		cptep = __arm_lpae_alloc_pages(tblsz, GFP_ATOMIC, cfg);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_lpae_install_table(cptep, ptep, cfg);
-	} else if (!iopte_leaf(pte, lvl)) {
+		pte = arm_lpae_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_lpae_free_pages(cptep, tblsz, cfg);
+	} else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
+		   !(pte & ARM_LPAE_PTE_SW_SYNC)) {
+		__arm_lpae_sync_pte(ptep, cfg);
+	}
+
+	if (pte && !iopte_leaf(pte, lvl)) {
 		cptep = iopte_deref(pte, data);
-	} else {
+	} else if (pte) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
@@ -502,7 +531,19 @@ static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, &tablep[i]);
 	}
 
-	arm_lpae_install_table(tablep, ptep, cfg);
+	pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_lpae_free_pages(tablep, tablesz, cfg);
+		/*
+		 * We may race against someone unmapping another part of this
+		 * block, but anything else is invalid. We can't misinterpret
+		 * a page entry here since we're never at the last level.
+		 */
+		if (iopte_type(pte, lvl - 1) != ARM_LPAE_PTE_TYPE_TABLE)
+			return 0;
+
+		tablep = iopte_deref(pte, data);
+	}
 
 	if (unmap_idx < 0)
 		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
@@ -523,7 +564,7 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		return 0;
 
 	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (WARN_ON(!pte))
 		return 0;
 
@@ -585,7 +626,8 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
 			return 0;
 
 		/* Grab the IOPTE we're interested in */
-		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
+		ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+		pte = READ_ONCE(*ptep);
 
 		/* Valid entry? */
 		if (!pte)
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

For parallel I/O with multiple concurrent threads servicing the same
device (or devices, if several share a domain), serialising page table
updates becomes a massive bottleneck. On reflection, though, we don't
strictly need to do that - for valid IOMMU API usage, there are in fact
only two races that we need to guard against: multiple map requests for
different blocks within the same region, when the intermediate-level
table for that region does not yet exist; and multiple unmaps of
different parts of the same block entry. Both of those are fairly easily
solved by using a cmpxchg to install the new table, such that if we then
find that someone else's table got there first, we can simply free ours
and continue.

Make the requisite changes such that we can withstand being called
without the caller maintaining a lock. In theory, this opens up a few
corners in which wildly misbehaving callers making nonsensical
overlapping requests might lead to crashes instead of just unpredictable
results, but correct code really does not deserve to pay a significant
performance cost for the sake of masking bugs in theoretical broken code.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2:
 - Fix barriers in install_table
 - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
 - Minor cosmetics

 drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
 1 file changed, 57 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6334f51912ea..52700fa958c2 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -20,6 +20,7 @@
 
 #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/iommu.h>
 #include <linux/kernel.h>
 #include <linux/sizes.h>
@@ -99,6 +100,8 @@
 #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
 #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
 					 ARM_LPAE_PTE_ATTR_HI_MASK)
+/* Software bit for solving coherency races */
+#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
 
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
@@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
 	free_pages_exact(pages, size);
 }
 
+static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
+				struct io_pgtable_cfg *cfg)
+{
+	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
+				   sizeof(*ptep), DMA_TO_DEVICE);
+}
+
 static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
 			       struct io_pgtable_cfg *cfg)
 {
 	*ptep = pte;
 
 	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
-		dma_sync_single_for_device(cfg->iommu_dev,
-					   __arm_lpae_dma_addr(ptep),
-					   sizeof(pte), DMA_TO_DEVICE);
+		__arm_lpae_sync_pte(ptep, cfg);
 }
 
 static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 
 static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 					     arm_lpae_iopte *ptep,
+					     arm_lpae_iopte curr,
 					     struct io_pgtable_cfg *cfg)
 {
-	arm_lpae_iopte new;
+	arm_lpae_iopte old, new;
 
 	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_LPAE_PTE_NSTABLE;
 
-	__arm_lpae_set_pte(ptep, new, cfg);
-	return new;
+	/* Ensure the table itself is visible before its PTE can be */
+	wmb();
+
+	old = cmpxchg64_relaxed(ptep, curr, new);
+
+	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
+	    (old & ARM_LPAE_PTE_SW_SYNC))
+		return old;
+
+	/* Even if it's not ours, there's no point waiting; just kick it */
+	__arm_lpae_sync_pte(ptep, cfg);
+	if (old == curr)
+		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
+
+	return old;
 }
 
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
@@ -332,6 +354,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 {
 	arm_lpae_iopte *cptep, pte;
 	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	size_t tblsz = ARM_LPAE_GRANULE(data);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 
 	/* Find our entry at the current level */
@@ -346,17 +369,23 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
-		cptep = __arm_lpae_alloc_pages(ARM_LPAE_GRANULE(data),
-					       GFP_ATOMIC, cfg);
+		cptep = __arm_lpae_alloc_pages(tblsz, GFP_ATOMIC, cfg);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_lpae_install_table(cptep, ptep, cfg);
-	} else if (!iopte_leaf(pte, lvl)) {
+		pte = arm_lpae_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_lpae_free_pages(cptep, tblsz, cfg);
+	} else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
+		   !(pte & ARM_LPAE_PTE_SW_SYNC)) {
+		__arm_lpae_sync_pte(ptep, cfg);
+	}
+
+	if (pte && !iopte_leaf(pte, lvl)) {
 		cptep = iopte_deref(pte, data);
-	} else {
+	} else if (pte) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
@@ -502,7 +531,19 @@ static int arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, &tablep[i]);
 	}
 
-	arm_lpae_install_table(tablep, ptep, cfg);
+	pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_lpae_free_pages(tablep, tablesz, cfg);
+		/*
+		 * We may race against someone unmapping another part of this
+		 * block, but anything else is invalid. We can't misinterpret
+		 * a page entry here since we're never at the last level.
+		 */
+		if (iopte_type(pte, lvl - 1) != ARM_LPAE_PTE_TYPE_TABLE)
+			return 0;
+
+		tablep = iopte_deref(pte, data);
+	}
 
 	if (unmap_idx < 0)
 		return __arm_lpae_unmap(data, iova, size, lvl, tablep);
@@ -523,7 +564,7 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		return 0;
 
 	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (WARN_ON(!pte))
 		return 0;
 
@@ -585,7 +626,8 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
 			return 0;
 
 		/* Grab the IOPTE we're interested in */
-		pte = *(ptep + ARM_LPAE_LVL_IDX(iova, lvl, data));
+		ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+		pte = READ_ONCE(*ptep);
 
 		/* Valid entry? */
 		if (!pte)
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 6/8] iommu/io-pgtable-arm-v7s: Support lockless operation
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

Mirroring the LPAE implementation, rework the v7s code to be robust
against concurrent operations. The same two potential races exist, and
are solved in the same manner, with the fixed 2-level structure making
life ever so slightly simpler.

What complicates matters compared to LPAE, however, is large page
entries, since we can't update a block of 16 PTEs atomically, nor assume
available software bits to do clever things with. As most users are
never likely to do partial unmaps anyway (due to DMA API rules), it
doesn't seem unreasonable for this case to remain behind a serialising
lock; we just pull said lock down into the bowels of the implementation
so it's well out of the way of the normal call paths.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2:
 - Fix barriers in install_table
 - Use READ_ONCE() in iova_to_phys just in case

 drivers/iommu/io-pgtable-arm-v7s.c | 84 ++++++++++++++++++++++++++++----------
 1 file changed, 63 insertions(+), 21 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 1e283be39668..a55fd3858897 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -32,6 +32,7 @@
 
 #define pr_fmt(fmt)	"arm-v7s io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/dma-mapping.h>
 #include <linux/gfp.h>
 #include <linux/iommu.h>
@@ -39,6 +40,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/spinlock.h>
 #include <linux/types.h>
 
 #include <asm/barrier.h>
@@ -168,6 +170,7 @@ struct arm_v7s_io_pgtable {
 
 	arm_v7s_iopte		*pgd;
 	struct kmem_cache	*l2_tables;
+	spinlock_t		split_lock;
 };
 
 static dma_addr_t __arm_v7s_dma_addr(void *pages)
@@ -396,16 +399,22 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 
 static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
 					   arm_v7s_iopte *ptep,
+					   arm_v7s_iopte curr,
 					   struct io_pgtable_cfg *cfg)
 {
-	arm_v7s_iopte new;
+	arm_v7s_iopte old, new;
 
 	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_V7S_ATTR_NS_TABLE;
 
-	__arm_v7s_set_pte(ptep, new, 1, cfg);
-	return new;
+	/* Ensure the table itself is visible before its PTE can be */
+	wmb();
+
+	old = cmpxchg_relaxed(ptep, curr, new);
+	__arm_v7s_pte_sync(ptep, 1, cfg);
+
+	return old;
 }
 
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
@@ -429,16 +438,23 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
 		cptep = __arm_v7s_alloc_table(lvl + 1, GFP_ATOMIC, data);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_v7s_install_table(cptep, ptep, cfg);
-	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
-		cptep = iopte_deref(pte, lvl);
+		pte = arm_v7s_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_v7s_free_table(cptep, lvl + 1, data);
 	} else {
+		/* We've no easy way of knowing if it's synced yet, so... */
+		__arm_v7s_pte_sync(ptep, 1, cfg);
+	}
+
+	if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
+		cptep = iopte_deref(pte, lvl);
+	} else if (pte) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
@@ -491,27 +507,31 @@ static void arm_v7s_free_pgtable(struct io_pgtable *iop)
 	kfree(data);
 }
 
-static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
-			       unsigned long iova, int idx, int lvl,
-			       arm_v7s_iopte *ptep)
+static arm_v7s_iopte arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
+					unsigned long iova, int idx, int lvl,
+					arm_v7s_iopte *ptep)
 {
 	struct io_pgtable *iop = &data->iop;
 	arm_v7s_iopte pte;
 	size_t size = ARM_V7S_BLOCK_SIZE(lvl);
 	int i;
 
+	/* Check that we didn't lose a race to get the lock */
+	pte = *ptep;
+	if (!arm_v7s_pte_is_cont(pte, lvl))
+		return pte;
+
 	ptep -= idx & (ARM_V7S_CONT_PAGES - 1);
-	pte = arm_v7s_cont_to_pte(*ptep, lvl);
-	for (i = 0; i < ARM_V7S_CONT_PAGES; i++) {
-		ptep[i] = pte;
-		pte += size;
-	}
+	pte = arm_v7s_cont_to_pte(pte, lvl);
+	for (i = 0; i < ARM_V7S_CONT_PAGES; i++)
+		ptep[i] = pte + i * size;
 
 	__arm_v7s_pte_sync(ptep, ARM_V7S_CONT_PAGES, &iop->cfg);
 
 	size *= ARM_V7S_CONT_PAGES;
 	io_pgtable_tlb_add_flush(iop, iova, size, size, true);
 	io_pgtable_tlb_sync(iop);
+	return pte;
 }
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
@@ -542,7 +562,16 @@ static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	arm_v7s_install_table(tablep, ptep, cfg);
+	pte = arm_v7s_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_v7s_free_table(tablep, 2, data);
+
+		if (!ARM_V7S_PTE_IS_TABLE(pte, 1))
+			return 0;
+
+		tablep = iopte_deref(pte, 1);
+		return __arm_v7s_unmap(data, iova, size, 2, tablep);
+	}
 
 	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
@@ -563,17 +592,28 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 	idx = ARM_V7S_LVL_IDX(iova, lvl);
 	ptep += idx;
 	do {
-		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(ptep[i])))
+		pte[i] = READ_ONCE(ptep[i]);
+		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(pte[i])))
 			return 0;
-		pte[i] = ptep[i];
 	} while (++i < num_entries);
 
 	/*
 	 * If we've hit a contiguous 'large page' entry at this level, it
 	 * needs splitting first, unless we're unmapping the whole lot.
+	 *
+	 * For splitting, we can't rewrite 16 PTEs atomically, and since we
+	 * can't necessarily assume TEX remap we don't have a software bit to
+	 * mark live entries being split. In practice (i.e. DMA API code), we
+	 * will never be splitting large pages anyway, so just wrap this edge
+	 * case in a lock for the sake of correctness and be done with it.
 	 */
-	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl))
-		arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&data->split_lock, flags);
+		pte[0] = arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+		spin_unlock_irqrestore(&data->split_lock, flags);
+	}
 
 	/* If the size matches this level, we're in the right place */
 	if (num_entries) {
@@ -631,7 +671,8 @@ static phys_addr_t arm_v7s_iova_to_phys(struct io_pgtable_ops *ops,
 	u32 mask;
 
 	do {
-		pte = ptep[ARM_V7S_LVL_IDX(iova, ++lvl)];
+		ptep += ARM_V7S_LVL_IDX(iova, ++lvl);
+		pte = READ_ONCE(*ptep);
 		ptep = iopte_deref(pte, lvl);
 	} while (ARM_V7S_PTE_IS_TABLE(pte, lvl));
 
@@ -672,6 +713,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (!data)
 		return NULL;
 
+	spin_lock_init(&data->split_lock);
 	data->l2_tables = kmem_cache_create("io-pgtable_armv7s_l2",
 					    ARM_V7S_TABLE_SIZE(2),
 					    ARM_V7S_TABLE_SIZE(2),
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 6/8] iommu/io-pgtable-arm-v7s: Support lockless operation
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

Mirroring the LPAE implementation, rework the v7s code to be robust
against concurrent operations. The same two potential races exist, and
are solved in the same manner, with the fixed 2-level structure making
life ever so slightly simpler.

What complicates matters compared to LPAE, however, is large page
entries, since we can't update a block of 16 PTEs atomically, nor assume
available software bits to do clever things with. As most users are
never likely to do partial unmaps anyway (due to DMA API rules), it
doesn't seem unreasonable for this case to remain behind a serialising
lock; we just pull said lock down into the bowels of the implementation
so it's well out of the way of the normal call paths.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2:
 - Fix barriers in install_table
 - Use READ_ONCE() in iova_to_phys just in case

 drivers/iommu/io-pgtable-arm-v7s.c | 84 ++++++++++++++++++++++++++++----------
 1 file changed, 63 insertions(+), 21 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 1e283be39668..a55fd3858897 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -32,6 +32,7 @@
 
 #define pr_fmt(fmt)	"arm-v7s io-pgtable: " fmt
 
+#include <linux/atomic.h>
 #include <linux/dma-mapping.h>
 #include <linux/gfp.h>
 #include <linux/iommu.h>
@@ -39,6 +40,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/spinlock.h>
 #include <linux/types.h>
 
 #include <asm/barrier.h>
@@ -168,6 +170,7 @@ struct arm_v7s_io_pgtable {
 
 	arm_v7s_iopte		*pgd;
 	struct kmem_cache	*l2_tables;
+	spinlock_t		split_lock;
 };
 
 static dma_addr_t __arm_v7s_dma_addr(void *pages)
@@ -396,16 +399,22 @@ static int arm_v7s_init_pte(struct arm_v7s_io_pgtable *data,
 
 static arm_v7s_iopte arm_v7s_install_table(arm_v7s_iopte *table,
 					   arm_v7s_iopte *ptep,
+					   arm_v7s_iopte curr,
 					   struct io_pgtable_cfg *cfg)
 {
-	arm_v7s_iopte new;
+	arm_v7s_iopte old, new;
 
 	new = virt_to_phys(table) | ARM_V7S_PTE_TYPE_TABLE;
 	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
 		new |= ARM_V7S_ATTR_NS_TABLE;
 
-	__arm_v7s_set_pte(ptep, new, 1, cfg);
-	return new;
+	/* Ensure the table itself is visible before its PTE can be */
+	wmb();
+
+	old = cmpxchg_relaxed(ptep, curr, new);
+	__arm_v7s_pte_sync(ptep, 1, cfg);
+
+	return old;
 }
 
 static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
@@ -429,16 +438,23 @@ static int __arm_v7s_map(struct arm_v7s_io_pgtable *data, unsigned long iova,
 		return -EINVAL;
 
 	/* Grab a pointer to the next level */
-	pte = *ptep;
+	pte = READ_ONCE(*ptep);
 	if (!pte) {
 		cptep = __arm_v7s_alloc_table(lvl + 1, GFP_ATOMIC, data);
 		if (!cptep)
 			return -ENOMEM;
 
-		arm_v7s_install_table(cptep, ptep, cfg);
-	} else if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
-		cptep = iopte_deref(pte, lvl);
+		pte = arm_v7s_install_table(cptep, ptep, 0, cfg);
+		if (pte)
+			__arm_v7s_free_table(cptep, lvl + 1, data);
 	} else {
+		/* We've no easy way of knowing if it's synced yet, so... */
+		__arm_v7s_pte_sync(ptep, 1, cfg);
+	}
+
+	if (ARM_V7S_PTE_IS_TABLE(pte, lvl)) {
+		cptep = iopte_deref(pte, lvl);
+	} else if (pte) {
 		/* We require an unmap first */
 		WARN_ON(!selftest_running);
 		return -EEXIST;
@@ -491,27 +507,31 @@ static void arm_v7s_free_pgtable(struct io_pgtable *iop)
 	kfree(data);
 }
 
-static void arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
-			       unsigned long iova, int idx, int lvl,
-			       arm_v7s_iopte *ptep)
+static arm_v7s_iopte arm_v7s_split_cont(struct arm_v7s_io_pgtable *data,
+					unsigned long iova, int idx, int lvl,
+					arm_v7s_iopte *ptep)
 {
 	struct io_pgtable *iop = &data->iop;
 	arm_v7s_iopte pte;
 	size_t size = ARM_V7S_BLOCK_SIZE(lvl);
 	int i;
 
+	/* Check that we didn't lose a race to get the lock */
+	pte = *ptep;
+	if (!arm_v7s_pte_is_cont(pte, lvl))
+		return pte;
+
 	ptep -= idx & (ARM_V7S_CONT_PAGES - 1);
-	pte = arm_v7s_cont_to_pte(*ptep, lvl);
-	for (i = 0; i < ARM_V7S_CONT_PAGES; i++) {
-		ptep[i] = pte;
-		pte += size;
-	}
+	pte = arm_v7s_cont_to_pte(pte, lvl);
+	for (i = 0; i < ARM_V7S_CONT_PAGES; i++)
+		ptep[i] = pte + i * size;
 
 	__arm_v7s_pte_sync(ptep, ARM_V7S_CONT_PAGES, &iop->cfg);
 
 	size *= ARM_V7S_CONT_PAGES;
 	io_pgtable_tlb_add_flush(iop, iova, size, size, true);
 	io_pgtable_tlb_sync(iop);
+	return pte;
 }
 
 static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
@@ -542,7 +562,16 @@ static int arm_v7s_split_blk_unmap(struct arm_v7s_io_pgtable *data,
 		__arm_v7s_set_pte(&tablep[i], pte, num_entries, cfg);
 	}
 
-	arm_v7s_install_table(tablep, ptep, cfg);
+	pte = arm_v7s_install_table(tablep, ptep, blk_pte, cfg);
+	if (pte != blk_pte) {
+		__arm_v7s_free_table(tablep, 2, data);
+
+		if (!ARM_V7S_PTE_IS_TABLE(pte, 1))
+			return 0;
+
+		tablep = iopte_deref(pte, 1);
+		return __arm_v7s_unmap(data, iova, size, 2, tablep);
+	}
 
 	io_pgtable_tlb_add_flush(&data->iop, iova, size, size, true);
 	return size;
@@ -563,17 +592,28 @@ static int __arm_v7s_unmap(struct arm_v7s_io_pgtable *data,
 	idx = ARM_V7S_LVL_IDX(iova, lvl);
 	ptep += idx;
 	do {
-		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(ptep[i])))
+		pte[i] = READ_ONCE(ptep[i]);
+		if (WARN_ON(!ARM_V7S_PTE_IS_VALID(pte[i])))
 			return 0;
-		pte[i] = ptep[i];
 	} while (++i < num_entries);
 
 	/*
 	 * If we've hit a contiguous 'large page' entry@this level, it
 	 * needs splitting first, unless we're unmapping the whole lot.
+	 *
+	 * For splitting, we can't rewrite 16 PTEs atomically, and since we
+	 * can't necessarily assume TEX remap we don't have a software bit to
+	 * mark live entries being split. In practice (i.e. DMA API code), we
+	 * will never be splitting large pages anyway, so just wrap this edge
+	 * case in a lock for the sake of correctness and be done with it.
 	 */
-	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl))
-		arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+	if (num_entries <= 1 && arm_v7s_pte_is_cont(pte[0], lvl)) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&data->split_lock, flags);
+		pte[0] = arm_v7s_split_cont(data, iova, idx, lvl, ptep);
+		spin_unlock_irqrestore(&data->split_lock, flags);
+	}
 
 	/* If the size matches this level, we're in the right place */
 	if (num_entries) {
@@ -631,7 +671,8 @@ static phys_addr_t arm_v7s_iova_to_phys(struct io_pgtable_ops *ops,
 	u32 mask;
 
 	do {
-		pte = ptep[ARM_V7S_LVL_IDX(iova, ++lvl)];
+		ptep += ARM_V7S_LVL_IDX(iova, ++lvl);
+		pte = READ_ONCE(*ptep);
 		ptep = iopte_deref(pte, lvl);
 	} while (ARM_V7S_PTE_IS_TABLE(pte, lvl));
 
@@ -672,6 +713,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 	if (!data)
 		return NULL;
 
+	spin_lock_init(&data->split_lock);
 	data->l2_tables = kmem_cache_create("io-pgtable_armv7s_l2",
 					    ARM_V7S_TABLE_SIZE(2),
 					    ARM_V7S_TABLE_SIZE(2),
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 7/8] iommu/arm-smmu: Remove io-pgtable spinlock
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

With the io-pgtable code now robust against (valid) races, we no longer
need to serialise all operations with a lock. This might make broken
callers who issue concurrent operations on overlapping addresses go even
more wrong than before, but hey, they already had little hope of useful
or deterministic results.

We do however still have to keep a lock around to serialise the ATS1*
translation ops, as parallel iova_to_phys() calls could lead to
unpredictable hardware behaviour otherwise.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2: No change

 drivers/iommu/arm-smmu.c | 45 ++++++++++++++-------------------------------
 1 file changed, 14 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 1f42f339a284..b8d069a2b31d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -425,10 +425,10 @@ enum arm_smmu_domain_stage {
 struct arm_smmu_domain {
 	struct arm_smmu_device		*smmu;
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
 	struct mutex			init_mutex; /* Protects smmu pointer */
+	spinlock_t			cb_lock; /* Serialises ATS1* ops */
 	struct iommu_domain		domain;
 };
 
@@ -1105,7 +1105,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
+	spin_lock_init(&smmu_domain->cb_lock);
 
 	return &smmu_domain->domain;
 }
@@ -1383,35 +1383,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
@@ -1425,10 +1413,11 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	void __iomem *cb_base;
 	u32 tmp;
 	u64 phys;
-	unsigned long va;
+	unsigned long va, flags;
 
 	cb_base = ARM_SMMU_CB(smmu, cfg->cbndx);
 
+	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
 	/* ATS1 registers can only be written atomically */
 	va = iova & ~0xfffUL;
 	if (smmu->version == ARM_SMMU_V2)
@@ -1438,6 +1427,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 
 	if (readl_poll_timeout_atomic(cb_base + ARM_SMMU_CB_ATSR, tmp,
 				      !(tmp & ATSR_ACTIVE), 5, 50)) {
+		spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 		dev_err(dev,
 			"iova to phys timed out on %pad. Falling back to software table walk.\n",
 			&iova);
@@ -1445,6 +1435,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	}
 
 	phys = readq_relaxed(cb_base + ARM_SMMU_CB_PAR);
+	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 	if (phys & CB_PAR_F) {
 		dev_err(dev, "translation fault!\n");
 		dev_err(dev, "PAR = 0x%llx\n", phys);
@@ -1457,10 +1448,8 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 					dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1468,17 +1457,11 @@ static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
 	if (smmu_domain->smmu->features & ARM_SMMU_FEAT_TRANS_OPS &&
-			smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		ret = arm_smmu_iova_to_phys_hard(domain, iova);
-	} else {
-		ret = ops->iova_to_phys(ops, iova);
-	}
+			smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
+		return arm_smmu_iova_to_phys_hard(domain, iova);
 
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static bool arm_smmu_capable(enum iommu_cap cap)
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 7/8] iommu/arm-smmu: Remove io-pgtable spinlock
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

With the io-pgtable code now robust against (valid) races, we no longer
need to serialise all operations with a lock. This might make broken
callers who issue concurrent operations on overlapping addresses go even
more wrong than before, but hey, they already had little hope of useful
or deterministic results.

We do however still have to keep a lock around to serialise the ATS1*
translation ops, as parallel iova_to_phys() calls could lead to
unpredictable hardware behaviour otherwise.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: No change

 drivers/iommu/arm-smmu.c | 45 ++++++++++++++-------------------------------
 1 file changed, 14 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 1f42f339a284..b8d069a2b31d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -425,10 +425,10 @@ enum arm_smmu_domain_stage {
 struct arm_smmu_domain {
 	struct arm_smmu_device		*smmu;
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
 	struct mutex			init_mutex; /* Protects smmu pointer */
+	spinlock_t			cb_lock; /* Serialises ATS1* ops */
 	struct iommu_domain		domain;
 };
 
@@ -1105,7 +1105,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
+	spin_lock_init(&smmu_domain->cb_lock);
 
 	return &smmu_domain->domain;
 }
@@ -1383,35 +1383,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
@@ -1425,10 +1413,11 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	void __iomem *cb_base;
 	u32 tmp;
 	u64 phys;
-	unsigned long va;
+	unsigned long va, flags;
 
 	cb_base = ARM_SMMU_CB(smmu, cfg->cbndx);
 
+	spin_lock_irqsave(&smmu_domain->cb_lock, flags);
 	/* ATS1 registers can only be written atomically */
 	va = iova & ~0xfffUL;
 	if (smmu->version == ARM_SMMU_V2)
@@ -1438,6 +1427,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 
 	if (readl_poll_timeout_atomic(cb_base + ARM_SMMU_CB_ATSR, tmp,
 				      !(tmp & ATSR_ACTIVE), 5, 50)) {
+		spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 		dev_err(dev,
 			"iova to phys timed out on %pad. Falling back to software table walk.\n",
 			&iova);
@@ -1445,6 +1435,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 	}
 
 	phys = readq_relaxed(cb_base + ARM_SMMU_CB_PAR);
+	spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
 	if (phys & CB_PAR_F) {
 		dev_err(dev, "translation fault!\n");
 		dev_err(dev, "PAR = 0x%llx\n", phys);
@@ -1457,10 +1448,8 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 					dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1468,17 +1457,11 @@ static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
 	if (smmu_domain->smmu->features & ARM_SMMU_FEAT_TRANS_OPS &&
-			smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		ret = arm_smmu_iova_to_phys_hard(domain, iova);
-	} else {
-		ret = ops->iova_to_phys(ops, iova);
-	}
+			smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
+		return arm_smmu_iova_to_phys_hard(domain, iova);
 
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static bool arm_smmu_capable(enum iommu_cap cap)
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 8/8] iommu/arm-smmu-v3: Remove io-pgtable spinlock
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-22 15:53     ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

As for SMMUv2, take advantage of io-pgtable's newfound tolerance for
concurrency. Unfortunately in this case the command queue lock remains a
point of serialisation for the unmap path, but there may be a little
more we can do to ameliorate that in future.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---

v2: No change

 drivers/iommu/arm-smmu-v3.c | 33 ++++++---------------------------
 1 file changed, 6 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b9c4cf4ccca2..291da5f918d5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -645,7 +645,6 @@ struct arm_smmu_domain {
 	struct mutex			init_mutex; /* Protects smmu pointer */
 
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 
 	enum arm_smmu_domain_stage	stage;
 	union {
@@ -1406,7 +1405,6 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
 	return &smmu_domain->domain;
 }
 
@@ -1678,44 +1676,29 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t
 arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t
 arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1723,11 +1706,7 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->iova_to_phys(ops, iova);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static struct platform_driver arm_smmu_driver;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 8/8] iommu/arm-smmu-v3: Remove io-pgtable spinlock
@ 2017-06-22 15:53     ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-22 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

As for SMMUv2, take advantage of io-pgtable's newfound tolerance for
concurrency. Unfortunately in this case the command queue lock remains a
point of serialisation for the unmap path, but there may be a little
more we can do to ameliorate that in future.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: No change

 drivers/iommu/arm-smmu-v3.c | 33 ++++++---------------------------
 1 file changed, 6 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b9c4cf4ccca2..291da5f918d5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -645,7 +645,6 @@ struct arm_smmu_domain {
 	struct mutex			init_mutex; /* Protects smmu pointer */
 
 	struct io_pgtable_ops		*pgtbl_ops;
-	spinlock_t			pgtbl_lock;
 
 	enum arm_smmu_domain_stage	stage;
 	union {
@@ -1406,7 +1405,6 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
-	spin_lock_init(&smmu_domain->pgtbl_lock);
 	return &smmu_domain->domain;
 }
 
@@ -1678,44 +1676,29 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 			phys_addr_t paddr, size_t size, int prot)
 {
-	int ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return -ENODEV;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->map(ops, iova, paddr, size, prot);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->map(ops, iova, paddr, size, prot);
 }
 
 static size_t
 arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 {
-	size_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->unmap(ops, iova, size);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-	return ret;
+	return ops->unmap(ops, iova, size);
 }
 
 static phys_addr_t
 arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 {
-	phys_addr_t ret;
-	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
 
 	if (domain->type == IOMMU_DOMAIN_IDENTITY)
 		return iova;
@@ -1723,11 +1706,7 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 	if (!ops)
 		return 0;
 
-	spin_lock_irqsave(&smmu_domain->pgtbl_lock, flags);
-	ret = ops->iova_to_phys(ops, iova);
-	spin_unlock_irqrestore(&smmu_domain->pgtbl_lock, flags);
-
-	return ret;
+	return ops->iova_to_phys(ops, iova);
 }
 
 static struct platform_driver arm_smmu_driver;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-22 15:53     ` Robin Murphy
@ 2017-06-23  5:53         ` Linu Cherian
  -1 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-23  5:53 UTC (permalink / raw)
  To: Robin Murphy
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA


Robin,
Was trying to understand the new changes. Had few questions on 
arm_lpae_install_table. 

On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> For parallel I/O with multiple concurrent threads servicing the same
> device (or devices, if several share a domain), serialising page table
> updates becomes a massive bottleneck. On reflection, though, we don't
> strictly need to do that - for valid IOMMU API usage, there are in fact
> only two races that we need to guard against: multiple map requests for
> different blocks within the same region, when the intermediate-level
> table for that region does not yet exist; and multiple unmaps of
> different parts of the same block entry. Both of those are fairly easily
> solved by using a cmpxchg to install the new table, such that if we then
> find that someone else's table got there first, we can simply free ours
> and continue.
> 
> Make the requisite changes such that we can withstand being called
> without the caller maintaining a lock. In theory, this opens up a few
> corners in which wildly misbehaving callers making nonsensical
> overlapping requests might lead to crashes instead of just unpredictable
> results, but correct code really does not deserve to pay a significant
> performance cost for the sake of masking bugs in theoretical broken code.
> 
> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> ---
> 
> v2:
>  - Fix barriers in install_table
>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
>  - Minor cosmetics
> 
>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 57 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 6334f51912ea..52700fa958c2 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -20,6 +20,7 @@
>  
>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
>  
> +#include <linux/atomic.h>
>  #include <linux/iommu.h>
>  #include <linux/kernel.h>
>  #include <linux/sizes.h>
> @@ -99,6 +100,8 @@
>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> +/* Software bit for solving coherency races */
> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
>  
>  /* Stage-1 PTE */
>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
>  	free_pages_exact(pages, size);
>  }
>  
> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> +				struct io_pgtable_cfg *cfg)
> +{
> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> +				   sizeof(*ptep), DMA_TO_DEVICE);
> +}
> +
>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
>  			       struct io_pgtable_cfg *cfg)
>  {
>  	*ptep = pte;
>  
>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> -		dma_sync_single_for_device(cfg->iommu_dev,
> -					   __arm_lpae_dma_addr(ptep),
> -					   sizeof(pte), DMA_TO_DEVICE);
> +		__arm_lpae_sync_pte(ptep, cfg);
>  }
>  
>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>  
>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>  					     arm_lpae_iopte *ptep,
> +					     arm_lpae_iopte curr,
>  					     struct io_pgtable_cfg *cfg)
>  {
> -	arm_lpae_iopte new;
> +	arm_lpae_iopte old, new;
>  
>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
>  		new |= ARM_LPAE_PTE_NSTABLE;
>  
> -	__arm_lpae_set_pte(ptep, new, cfg);
> -	return new;
> +	/* Ensure the table itself is visible before its PTE can be */
> +	wmb();

Could you please give more hints on why this is required.


> +
> +	old = cmpxchg64_relaxed(ptep, curr, new);
> +
> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> +		return old;
> +
> +	/* Even if it's not ours, there's no point waiting; just kick it */
> +	__arm_lpae_sync_pte(ptep, cfg);
> +	if (old == curr)
> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);

How is it ensured that the cache operations are completed before we flag them with
	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.


> +
> +	return old;
>  }
>


Thanks.  

-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-23  5:53         ` Linu Cherian
  0 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-23  5:53 UTC (permalink / raw)
  To: linux-arm-kernel


Robin,
Was trying to understand the new changes. Had few questions on 
arm_lpae_install_table. 

On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> For parallel I/O with multiple concurrent threads servicing the same
> device (or devices, if several share a domain), serialising page table
> updates becomes a massive bottleneck. On reflection, though, we don't
> strictly need to do that - for valid IOMMU API usage, there are in fact
> only two races that we need to guard against: multiple map requests for
> different blocks within the same region, when the intermediate-level
> table for that region does not yet exist; and multiple unmaps of
> different parts of the same block entry. Both of those are fairly easily
> solved by using a cmpxchg to install the new table, such that if we then
> find that someone else's table got there first, we can simply free ours
> and continue.
> 
> Make the requisite changes such that we can withstand being called
> without the caller maintaining a lock. In theory, this opens up a few
> corners in which wildly misbehaving callers making nonsensical
> overlapping requests might lead to crashes instead of just unpredictable
> results, but correct code really does not deserve to pay a significant
> performance cost for the sake of masking bugs in theoretical broken code.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
> 
> v2:
>  - Fix barriers in install_table
>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
>  - Minor cosmetics
> 
>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 57 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 6334f51912ea..52700fa958c2 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -20,6 +20,7 @@
>  
>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
>  
> +#include <linux/atomic.h>
>  #include <linux/iommu.h>
>  #include <linux/kernel.h>
>  #include <linux/sizes.h>
> @@ -99,6 +100,8 @@
>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> +/* Software bit for solving coherency races */
> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
>  
>  /* Stage-1 PTE */
>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
>  	free_pages_exact(pages, size);
>  }
>  
> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> +				struct io_pgtable_cfg *cfg)
> +{
> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> +				   sizeof(*ptep), DMA_TO_DEVICE);
> +}
> +
>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
>  			       struct io_pgtable_cfg *cfg)
>  {
>  	*ptep = pte;
>  
>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> -		dma_sync_single_for_device(cfg->iommu_dev,
> -					   __arm_lpae_dma_addr(ptep),
> -					   sizeof(pte), DMA_TO_DEVICE);
> +		__arm_lpae_sync_pte(ptep, cfg);
>  }
>  
>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>  
>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>  					     arm_lpae_iopte *ptep,
> +					     arm_lpae_iopte curr,
>  					     struct io_pgtable_cfg *cfg)
>  {
> -	arm_lpae_iopte new;
> +	arm_lpae_iopte old, new;
>  
>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
>  		new |= ARM_LPAE_PTE_NSTABLE;
>  
> -	__arm_lpae_set_pte(ptep, new, cfg);
> -	return new;
> +	/* Ensure the table itself is visible before its PTE can be */
> +	wmb();

Could you please give more hints on why this is required.


> +
> +	old = cmpxchg64_relaxed(ptep, curr, new);
> +
> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> +		return old;
> +
> +	/* Even if it's not ours, there's no point waiting; just kick it */
> +	__arm_lpae_sync_pte(ptep, cfg);
> +	if (old == curr)
> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);

How is it ensured that the cache operations are completed before we flag them with
	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.


> +
> +	return old;
>  }
>


Thanks.  

-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/8] io-pgtable lock removal
  2017-06-22 15:53 ` Robin Murphy
@ 2017-06-23  8:47     ` John Garry
  -1 siblings, 0 replies; 44+ messages in thread
From: John Garry @ 2017-06-23  8:47 UTC (permalink / raw)
  To: Robin Murphy, will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q, Linuxarm,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

On 22/06/2017 16:53, Robin Murphy wrote:
> The feedback has been promising, so v2 is just a final update to cover
> a handful of memory ordering and cosmetic tweaks that came up when Will
> and I went through this offline.
>
> Thanks,
> Robin.

Hi Robin,

Is it worth us retesting this patchset?

If yes, as for a branch (for convenience), does this one now match v2:
git://linux-arm.org/linux-rm  iommu/pgtable

I saw it was rewritten recently.

Thanks,
John

>
>
> Robin Murphy (8):
>   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>   iommu/io-pgtable-arm: Improve split_blk_unmap
>   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>   iommu/io-pgtable: Introduce explicit coherency
>   iommu/io-pgtable-arm: Support lockless operation
>   iommu/io-pgtable-arm-v7s: Support lockless operation
>   iommu/arm-smmu: Remove io-pgtable spinlock
>   iommu/arm-smmu-v3: Remove io-pgtable spinlock
>
>  drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>  drivers/iommu/arm-smmu.c           |  48 ++++------
>  drivers/iommu/io-pgtable-arm-v7s.c | 179 +++++++++++++++++++++-------------
>  drivers/iommu/io-pgtable-arm.c     | 191 ++++++++++++++++++++++++-------------
>  drivers/iommu/io-pgtable.h         |   6 ++
>  5 files changed, 274 insertions(+), 186 deletions(-)
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-23  8:47     ` John Garry
  0 siblings, 0 replies; 44+ messages in thread
From: John Garry @ 2017-06-23  8:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 22/06/2017 16:53, Robin Murphy wrote:
> The feedback has been promising, so v2 is just a final update to cover
> a handful of memory ordering and cosmetic tweaks that came up when Will
> and I went through this offline.
>
> Thanks,
> Robin.

Hi Robin,

Is it worth us retesting this patchset?

If yes, as for a branch (for convenience), does this one now match v2:
git://linux-arm.org/linux-rm  iommu/pgtable

I saw it was rewritten recently.

Thanks,
John

>
>
> Robin Murphy (8):
>   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>   iommu/io-pgtable-arm: Improve split_blk_unmap
>   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>   iommu/io-pgtable: Introduce explicit coherency
>   iommu/io-pgtable-arm: Support lockless operation
>   iommu/io-pgtable-arm-v7s: Support lockless operation
>   iommu/arm-smmu: Remove io-pgtable spinlock
>   iommu/arm-smmu-v3: Remove io-pgtable spinlock
>
>  drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>  drivers/iommu/arm-smmu.c           |  48 ++++------
>  drivers/iommu/io-pgtable-arm-v7s.c | 179 +++++++++++++++++++++-------------
>  drivers/iommu/io-pgtable-arm.c     | 191 ++++++++++++++++++++++++-------------
>  drivers/iommu/io-pgtable.h         |   6 ++
>  5 files changed, 274 insertions(+), 186 deletions(-)
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-23  5:53         ` Linu Cherian
@ 2017-06-23  8:56           ` Linu Cherian
  -1 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-23  8:56 UTC (permalink / raw)
  To: Robin Murphy
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA

On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> 
> Robin,
> Was trying to understand the new changes. Had few questions on 
> arm_lpae_install_table. 
> 
> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> > For parallel I/O with multiple concurrent threads servicing the same
> > device (or devices, if several share a domain), serialising page table
> > updates becomes a massive bottleneck. On reflection, though, we don't
> > strictly need to do that - for valid IOMMU API usage, there are in fact
> > only two races that we need to guard against: multiple map requests for
> > different blocks within the same region, when the intermediate-level
> > table for that region does not yet exist; and multiple unmaps of
> > different parts of the same block entry. Both of those are fairly easily
> > solved by using a cmpxchg to install the new table, such that if we then
> > find that someone else's table got there first, we can simply free ours
> > and continue.
> > 
> > Make the requisite changes such that we can withstand being called
> > without the caller maintaining a lock. In theory, this opens up a few
> > corners in which wildly misbehaving callers making nonsensical
> > overlapping requests might lead to crashes instead of just unpredictable
> > results, but correct code really does not deserve to pay a significant
> > performance cost for the sake of masking bugs in theoretical broken code.
> > 
> > Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> > ---
> > 
> > v2:
> >  - Fix barriers in install_table
> >  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> >  - Minor cosmetics
> > 
> >  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> >  1 file changed, 57 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > index 6334f51912ea..52700fa958c2 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -20,6 +20,7 @@
> >  
> >  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> >  
> > +#include <linux/atomic.h>
> >  #include <linux/iommu.h>
> >  #include <linux/kernel.h>
> >  #include <linux/sizes.h>
> > @@ -99,6 +100,8 @@
> >  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> >  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> >  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> > +/* Software bit for solving coherency races */
> > +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> >  
> >  /* Stage-1 PTE */
> >  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> > @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> >  	free_pages_exact(pages, size);
> >  }
> >  
> > +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> > +				struct io_pgtable_cfg *cfg)
> > +{
> > +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> > +				   sizeof(*ptep), DMA_TO_DEVICE);
> > +}
> > +
> >  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> >  			       struct io_pgtable_cfg *cfg)
> >  {
> >  	*ptep = pte;
> >  
> >  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> > -		dma_sync_single_for_device(cfg->iommu_dev,
> > -					   __arm_lpae_dma_addr(ptep),
> > -					   sizeof(pte), DMA_TO_DEVICE);
> > +		__arm_lpae_sync_pte(ptep, cfg);
> >  }
> >  
> >  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> >  
> >  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> >  					     arm_lpae_iopte *ptep,
> > +					     arm_lpae_iopte curr,
> >  					     struct io_pgtable_cfg *cfg)
> >  {
> > -	arm_lpae_iopte new;
> > +	arm_lpae_iopte old, new;
> >  
> >  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> >  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> >  		new |= ARM_LPAE_PTE_NSTABLE;
> >  
> > -	__arm_lpae_set_pte(ptep, new, cfg);
> > -	return new;
> > +	/* Ensure the table itself is visible before its PTE can be */
> > +	wmb();
> 
> Could you please give more hints on why this is required.
> 
> 
> > +
> > +	old = cmpxchg64_relaxed(ptep, curr, new);
> > +
> > +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> > +	    (old & ARM_LPAE_PTE_SW_SYNC))
> > +		return old;
> > +
> > +	/* Even if it's not ours, there's no point waiting; just kick it */
> > +	__arm_lpae_sync_pte(ptep, cfg);
> > +	if (old == curr)
> > +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> 
> How is it ensured that the cache operations are completed before we flag them with
> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> 
> 
> > +
> > +	return old;
> >  }
> >
> 


Observed a performance drop of close to 1G, 
while testing on the 10G network interface with this patch series compared to v1.
Moving the wmb() as in v1 restores it back.

@@ -332,7 +332,6 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
                new |= ARM_LPAE_PTE_NSTABLE;
 
        /* Ensure the table itself is visible before its PTE can be */
-       wmb();
 
        old = cmpxchg64_relaxed(ptep, curr, new);
 
@@ -342,6 +341,7 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 
        /* Even if it's not ours, there's no point waiting; just kick it */
        __arm_lpae_sync_pte(ptep, cfg);
+       wmb();
        if (old == curr)
                WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
 



-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-23  8:56           ` Linu Cherian
  0 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-23  8:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> 
> Robin,
> Was trying to understand the new changes. Had few questions on 
> arm_lpae_install_table. 
> 
> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> > For parallel I/O with multiple concurrent threads servicing the same
> > device (or devices, if several share a domain), serialising page table
> > updates becomes a massive bottleneck. On reflection, though, we don't
> > strictly need to do that - for valid IOMMU API usage, there are in fact
> > only two races that we need to guard against: multiple map requests for
> > different blocks within the same region, when the intermediate-level
> > table for that region does not yet exist; and multiple unmaps of
> > different parts of the same block entry. Both of those are fairly easily
> > solved by using a cmpxchg to install the new table, such that if we then
> > find that someone else's table got there first, we can simply free ours
> > and continue.
> > 
> > Make the requisite changes such that we can withstand being called
> > without the caller maintaining a lock. In theory, this opens up a few
> > corners in which wildly misbehaving callers making nonsensical
> > overlapping requests might lead to crashes instead of just unpredictable
> > results, but correct code really does not deserve to pay a significant
> > performance cost for the sake of masking bugs in theoretical broken code.
> > 
> > Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> > ---
> > 
> > v2:
> >  - Fix barriers in install_table
> >  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> >  - Minor cosmetics
> > 
> >  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> >  1 file changed, 57 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > index 6334f51912ea..52700fa958c2 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -20,6 +20,7 @@
> >  
> >  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> >  
> > +#include <linux/atomic.h>
> >  #include <linux/iommu.h>
> >  #include <linux/kernel.h>
> >  #include <linux/sizes.h>
> > @@ -99,6 +100,8 @@
> >  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> >  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> >  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> > +/* Software bit for solving coherency races */
> > +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> >  
> >  /* Stage-1 PTE */
> >  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> > @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> >  	free_pages_exact(pages, size);
> >  }
> >  
> > +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> > +				struct io_pgtable_cfg *cfg)
> > +{
> > +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> > +				   sizeof(*ptep), DMA_TO_DEVICE);
> > +}
> > +
> >  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> >  			       struct io_pgtable_cfg *cfg)
> >  {
> >  	*ptep = pte;
> >  
> >  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> > -		dma_sync_single_for_device(cfg->iommu_dev,
> > -					   __arm_lpae_dma_addr(ptep),
> > -					   sizeof(pte), DMA_TO_DEVICE);
> > +		__arm_lpae_sync_pte(ptep, cfg);
> >  }
> >  
> >  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> >  
> >  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> >  					     arm_lpae_iopte *ptep,
> > +					     arm_lpae_iopte curr,
> >  					     struct io_pgtable_cfg *cfg)
> >  {
> > -	arm_lpae_iopte new;
> > +	arm_lpae_iopte old, new;
> >  
> >  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> >  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> >  		new |= ARM_LPAE_PTE_NSTABLE;
> >  
> > -	__arm_lpae_set_pte(ptep, new, cfg);
> > -	return new;
> > +	/* Ensure the table itself is visible before its PTE can be */
> > +	wmb();
> 
> Could you please give more hints on why this is required.
> 
> 
> > +
> > +	old = cmpxchg64_relaxed(ptep, curr, new);
> > +
> > +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> > +	    (old & ARM_LPAE_PTE_SW_SYNC))
> > +		return old;
> > +
> > +	/* Even if it's not ours, there's no point waiting; just kick it */
> > +	__arm_lpae_sync_pte(ptep, cfg);
> > +	if (old == curr)
> > +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> 
> How is it ensured that the cache operations are completed before we flag them with
> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> 
> 
> > +
> > +	return old;
> >  }
> >
> 


Observed a performance drop of close to 1G, 
while testing on the 10G network interface with this patch series compared to v1.
Moving the wmb() as in v1 restores it back.

@@ -332,7 +332,6 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
                new |= ARM_LPAE_PTE_NSTABLE;
 
        /* Ensure the table itself is visible before its PTE can be */
-       wmb();
 
        old = cmpxchg64_relaxed(ptep, curr, new);
 
@@ -342,6 +341,7 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 
        /* Even if it's not ours, there's no point waiting; just kick it */
        __arm_lpae_sync_pte(ptep, cfg);
+       wmb();
        if (old == curr)
                WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
 



-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/8] io-pgtable lock removal
  2017-06-23  8:47     ` John Garry
@ 2017-06-23  9:58         ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-23  9:58 UTC (permalink / raw)
  To: John Garry, will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q, Linuxarm,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA

On 23/06/17 09:47, John Garry wrote:
> On 22/06/2017 16:53, Robin Murphy wrote:
>> The feedback has been promising, so v2 is just a final update to cover
>> a handful of memory ordering and cosmetic tweaks that came up when Will
>> and I went through this offline.
>>
>> Thanks,
>> Robin.
> 
> Hi Robin,
> 
> Is it worth us retesting this patchset?
> 
> If yes, as for a branch (for convenience), does this one now match v2:
> git://linux-arm.org/linux-rm  iommu/pgtable
> 
> I saw it was rewritten recently.

Indeed, I pushed the branch out before posting. Functionally it ought to
be identical to v1, so this was more just a heads-up - feel free to
double-check I've not broken anything, but I wouldn't say you need to
throw the full test lab at it again.

Thanks,
Robin.

> 
> Thanks,
> John
> 
>>
>>
>> Robin Murphy (8):
>>   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>   iommu/io-pgtable-arm: Improve split_blk_unmap
>>   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>   iommu/io-pgtable: Introduce explicit coherency
>>   iommu/io-pgtable-arm: Support lockless operation
>>   iommu/io-pgtable-arm-v7s: Support lockless operation
>>   iommu/arm-smmu: Remove io-pgtable spinlock
>>   iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>
>>  drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>  drivers/iommu/arm-smmu.c           |  48 ++++------
>>  drivers/iommu/io-pgtable-arm-v7s.c | 179
>> +++++++++++++++++++++-------------
>>  drivers/iommu/io-pgtable-arm.c     | 191
>> ++++++++++++++++++++++++-------------
>>  drivers/iommu/io-pgtable.h         |   6 ++
>>  5 files changed, 274 insertions(+), 186 deletions(-)
>>
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-23  9:58         ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-23  9:58 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/06/17 09:47, John Garry wrote:
> On 22/06/2017 16:53, Robin Murphy wrote:
>> The feedback has been promising, so v2 is just a final update to cover
>> a handful of memory ordering and cosmetic tweaks that came up when Will
>> and I went through this offline.
>>
>> Thanks,
>> Robin.
> 
> Hi Robin,
> 
> Is it worth us retesting this patchset?
> 
> If yes, as for a branch (for convenience), does this one now match v2:
> git://linux-arm.org/linux-rm  iommu/pgtable
> 
> I saw it was rewritten recently.

Indeed, I pushed the branch out before posting. Functionally it ought to
be identical to v1, so this was more just a heads-up - feel free to
double-check I've not broken anything, but I wouldn't say you need to
throw the full test lab at it again.

Thanks,
Robin.

> 
> Thanks,
> John
> 
>>
>>
>> Robin Murphy (8):
>>   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>   iommu/io-pgtable-arm: Improve split_blk_unmap
>>   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>   iommu/io-pgtable: Introduce explicit coherency
>>   iommu/io-pgtable-arm: Support lockless operation
>>   iommu/io-pgtable-arm-v7s: Support lockless operation
>>   iommu/arm-smmu: Remove io-pgtable spinlock
>>   iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>
>>  drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>  drivers/iommu/arm-smmu.c           |  48 ++++------
>>  drivers/iommu/io-pgtable-arm-v7s.c | 179
>> +++++++++++++++++++++-------------
>>  drivers/iommu/io-pgtable-arm.c     | 191
>> ++++++++++++++++++++++++-------------
>>  drivers/iommu/io-pgtable.h         |   6 ++
>>  5 files changed, 274 insertions(+), 186 deletions(-)
>>
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-23  8:56           ` Linu Cherian
@ 2017-06-23 10:35             ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-23 10:35 UTC (permalink / raw)
  To: Linu Cherian
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA

On 23/06/17 09:56, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
>>
>> Robin,
>> Was trying to understand the new changes. Had few questions on 
>> arm_lpae_install_table. 
>>
>> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
>>> For parallel I/O with multiple concurrent threads servicing the same
>>> device (or devices, if several share a domain), serialising page table
>>> updates becomes a massive bottleneck. On reflection, though, we don't
>>> strictly need to do that - for valid IOMMU API usage, there are in fact
>>> only two races that we need to guard against: multiple map requests for
>>> different blocks within the same region, when the intermediate-level
>>> table for that region does not yet exist; and multiple unmaps of
>>> different parts of the same block entry. Both of those are fairly easily
>>> solved by using a cmpxchg to install the new table, such that if we then
>>> find that someone else's table got there first, we can simply free ours
>>> and continue.
>>>
>>> Make the requisite changes such that we can withstand being called
>>> without the caller maintaining a lock. In theory, this opens up a few
>>> corners in which wildly misbehaving callers making nonsensical
>>> overlapping requests might lead to crashes instead of just unpredictable
>>> results, but correct code really does not deserve to pay a significant
>>> performance cost for the sake of masking bugs in theoretical broken code.
>>>
>>> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
>>> ---
>>>
>>> v2:
>>>  - Fix barriers in install_table
>>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
>>>  - Minor cosmetics
>>>
>>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
>>>  1 file changed, 57 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>>> index 6334f51912ea..52700fa958c2 100644
>>> --- a/drivers/iommu/io-pgtable-arm.c
>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>> @@ -20,6 +20,7 @@
>>>  
>>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
>>>  
>>> +#include <linux/atomic.h>
>>>  #include <linux/iommu.h>
>>>  #include <linux/kernel.h>
>>>  #include <linux/sizes.h>
>>> @@ -99,6 +100,8 @@
>>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
>>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
>>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
>>> +/* Software bit for solving coherency races */
>>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
>>>  
>>>  /* Stage-1 PTE */
>>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
>>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
>>>  	free_pages_exact(pages, size);
>>>  }
>>>  
>>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
>>> +				struct io_pgtable_cfg *cfg)
>>> +{
>>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
>>> +				   sizeof(*ptep), DMA_TO_DEVICE);
>>> +}
>>> +
>>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
>>>  			       struct io_pgtable_cfg *cfg)
>>>  {
>>>  	*ptep = pte;
>>>  
>>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
>>> -		dma_sync_single_for_device(cfg->iommu_dev,
>>> -					   __arm_lpae_dma_addr(ptep),
>>> -					   sizeof(pte), DMA_TO_DEVICE);
>>> +		__arm_lpae_sync_pte(ptep, cfg);
>>>  }
>>>  
>>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>>>  
>>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>>>  					     arm_lpae_iopte *ptep,
>>> +					     arm_lpae_iopte curr,
>>>  					     struct io_pgtable_cfg *cfg)
>>>  {
>>> -	arm_lpae_iopte new;
>>> +	arm_lpae_iopte old, new;
>>>  
>>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
>>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
>>>  		new |= ARM_LPAE_PTE_NSTABLE;
>>>  
>>> -	__arm_lpae_set_pte(ptep, new, cfg);
>>> -	return new;
>>> +	/* Ensure the table itself is visible before its PTE can be */
>>> +	wmb();
>>
>> Could you please give more hints on why this is required.

In theory it's possible for the write to ptep to become visible before
the previous writes filling out the PTEs in table - if the IOMMU
happened to speculatively walk ptep while parts of table were still sat
in a write buffer somewhere, it could see the old contents of that page
and potentially allocate bogus TLB entries if the stale data happened to
look like valid PTEs. In the non-coherent case the DMA cache maintenance
is sufficient, but otherwise we need a barrier to order writing the
next-level PTEs strictly before writing the table PTE pointing to them,
such that the IOMMU cannot at any point see a partially-initialised table.

>>> +
>>> +	old = cmpxchg64_relaxed(ptep, curr, new);
>>> +
>>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
>>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
>>> +		return old;
>>> +
>>> +	/* Even if it's not ours, there's no point waiting; just kick it */
>>> +	__arm_lpae_sync_pte(ptep, cfg);
>>> +	if (old == curr)
>>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
>>
>> How is it ensured that the cache operations are completed before we flag them with
>> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.

The wmb() here was a silly oversight on my part - as Will reminded me,
dma_sync_*() already has its own barrier to ensure completion, which is
pretty obvious in retrospect because the entire streaming DMA would be
totally broken otherwise.

>>> +
>>> +	return old;
>>>  }
>>>
>>
> 
> 
> Observed a performance drop of close to 1G, 
> while testing on the 10G network interface with this patch series compared to v1.

Is that consistent over multiple runs? I wouldn't expect many workloads
to be thrashing table and hugepage mappings in the same IOVA region, so
after a point we should tend to reach a fairly steady state where we're
only changing leaf PTEs.

> Moving the wmb() as in v1 restores it back.

Note that on a coherent platform like ThunderX that's as good as just
deleting it, because you'll never execute the case below. However, on
reflection I think it can at least safely be downgraded to dma_wmb()
(i.e. DMB) rather than a full DSB - would you be able to test what
difference that makes?

Robin.

> @@ -332,7 +332,6 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>                 new |= ARM_LPAE_PTE_NSTABLE;
>  
>         /* Ensure the table itself is visible before its PTE can be */
> -       wmb()>
>         old = cmpxchg64_relaxed(ptep, curr, new);
>  
> @@ -342,6 +341,7 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>  
>         /* Even if it's not ours, there's no point waiting; just kick it */
>         __arm_lpae_sync_pte(ptep, cfg);
> +       wmb();
>         if (old == curr)
>                 WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-23 10:35             ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-23 10:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/06/17 09:56, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
>>
>> Robin,
>> Was trying to understand the new changes. Had few questions on 
>> arm_lpae_install_table. 
>>
>> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
>>> For parallel I/O with multiple concurrent threads servicing the same
>>> device (or devices, if several share a domain), serialising page table
>>> updates becomes a massive bottleneck. On reflection, though, we don't
>>> strictly need to do that - for valid IOMMU API usage, there are in fact
>>> only two races that we need to guard against: multiple map requests for
>>> different blocks within the same region, when the intermediate-level
>>> table for that region does not yet exist; and multiple unmaps of
>>> different parts of the same block entry. Both of those are fairly easily
>>> solved by using a cmpxchg to install the new table, such that if we then
>>> find that someone else's table got there first, we can simply free ours
>>> and continue.
>>>
>>> Make the requisite changes such that we can withstand being called
>>> without the caller maintaining a lock. In theory, this opens up a few
>>> corners in which wildly misbehaving callers making nonsensical
>>> overlapping requests might lead to crashes instead of just unpredictable
>>> results, but correct code really does not deserve to pay a significant
>>> performance cost for the sake of masking bugs in theoretical broken code.
>>>
>>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>>> ---
>>>
>>> v2:
>>>  - Fix barriers in install_table
>>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
>>>  - Minor cosmetics
>>>
>>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
>>>  1 file changed, 57 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>>> index 6334f51912ea..52700fa958c2 100644
>>> --- a/drivers/iommu/io-pgtable-arm.c
>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>> @@ -20,6 +20,7 @@
>>>  
>>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
>>>  
>>> +#include <linux/atomic.h>
>>>  #include <linux/iommu.h>
>>>  #include <linux/kernel.h>
>>>  #include <linux/sizes.h>
>>> @@ -99,6 +100,8 @@
>>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
>>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
>>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
>>> +/* Software bit for solving coherency races */
>>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
>>>  
>>>  /* Stage-1 PTE */
>>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
>>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
>>>  	free_pages_exact(pages, size);
>>>  }
>>>  
>>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
>>> +				struct io_pgtable_cfg *cfg)
>>> +{
>>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
>>> +				   sizeof(*ptep), DMA_TO_DEVICE);
>>> +}
>>> +
>>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
>>>  			       struct io_pgtable_cfg *cfg)
>>>  {
>>>  	*ptep = pte;
>>>  
>>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
>>> -		dma_sync_single_for_device(cfg->iommu_dev,
>>> -					   __arm_lpae_dma_addr(ptep),
>>> -					   sizeof(pte), DMA_TO_DEVICE);
>>> +		__arm_lpae_sync_pte(ptep, cfg);
>>>  }
>>>  
>>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>>>  
>>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>>>  					     arm_lpae_iopte *ptep,
>>> +					     arm_lpae_iopte curr,
>>>  					     struct io_pgtable_cfg *cfg)
>>>  {
>>> -	arm_lpae_iopte new;
>>> +	arm_lpae_iopte old, new;
>>>  
>>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
>>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
>>>  		new |= ARM_LPAE_PTE_NSTABLE;
>>>  
>>> -	__arm_lpae_set_pte(ptep, new, cfg);
>>> -	return new;
>>> +	/* Ensure the table itself is visible before its PTE can be */
>>> +	wmb();
>>
>> Could you please give more hints on why this is required.

In theory it's possible for the write to ptep to become visible before
the previous writes filling out the PTEs in table - if the IOMMU
happened to speculatively walk ptep while parts of table were still sat
in a write buffer somewhere, it could see the old contents of that page
and potentially allocate bogus TLB entries if the stale data happened to
look like valid PTEs. In the non-coherent case the DMA cache maintenance
is sufficient, but otherwise we need a barrier to order writing the
next-level PTEs strictly before writing the table PTE pointing to them,
such that the IOMMU cannot at any point see a partially-initialised table.

>>> +
>>> +	old = cmpxchg64_relaxed(ptep, curr, new);
>>> +
>>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
>>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
>>> +		return old;
>>> +
>>> +	/* Even if it's not ours, there's no point waiting; just kick it */
>>> +	__arm_lpae_sync_pte(ptep, cfg);
>>> +	if (old == curr)
>>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
>>
>> How is it ensured that the cache operations are completed before we flag them with
>> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.

The wmb() here was a silly oversight on my part - as Will reminded me,
dma_sync_*() already has its own barrier to ensure completion, which is
pretty obvious in retrospect because the entire streaming DMA would be
totally broken otherwise.

>>> +
>>> +	return old;
>>>  }
>>>
>>
> 
> 
> Observed a performance drop of close to 1G, 
> while testing on the 10G network interface with this patch series compared to v1.

Is that consistent over multiple runs? I wouldn't expect many workloads
to be thrashing table and hugepage mappings in the same IOVA region, so
after a point we should tend to reach a fairly steady state where we're
only changing leaf PTEs.

> Moving the wmb() as in v1 restores it back.

Note that on a coherent platform like ThunderX that's as good as just
deleting it, because you'll never execute the case below. However, on
reflection I think it can at least safely be downgraded to dma_wmb()
(i.e. DMB) rather than a full DSB - would you be able to test what
difference that makes?

Robin.

> @@ -332,7 +332,6 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>                 new |= ARM_LPAE_PTE_NSTABLE;
>  
>         /* Ensure the table itself is visible before its PTE can be */
> -       wmb()>
>         old = cmpxchg64_relaxed(ptep, curr, new);
>  
> @@ -342,6 +341,7 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>  
>         /* Even if it's not ours, there's no point waiting; just kick it */
>         __arm_lpae_sync_pte(ptep, cfg);
> +       wmb();
>         if (old == curr)
>                 WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-23 10:35             ` Robin Murphy
@ 2017-06-23 11:34                 ` Linu Cherian
  -1 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-23 11:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA

On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> On 23/06/17 09:56, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> >>
> >> Robin,
> >> Was trying to understand the new changes. Had few questions on 
> >> arm_lpae_install_table. 
> >>
> >> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> >>> For parallel I/O with multiple concurrent threads servicing the same
> >>> device (or devices, if several share a domain), serialising page table
> >>> updates becomes a massive bottleneck. On reflection, though, we don't
> >>> strictly need to do that - for valid IOMMU API usage, there are in fact
> >>> only two races that we need to guard against: multiple map requests for
> >>> different blocks within the same region, when the intermediate-level
> >>> table for that region does not yet exist; and multiple unmaps of
> >>> different parts of the same block entry. Both of those are fairly easily
> >>> solved by using a cmpxchg to install the new table, such that if we then
> >>> find that someone else's table got there first, we can simply free ours
> >>> and continue.
> >>>
> >>> Make the requisite changes such that we can withstand being called
> >>> without the caller maintaining a lock. In theory, this opens up a few
> >>> corners in which wildly misbehaving callers making nonsensical
> >>> overlapping requests might lead to crashes instead of just unpredictable
> >>> results, but correct code really does not deserve to pay a significant
> >>> performance cost for the sake of masking bugs in theoretical broken code.
> >>>
> >>> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> >>> ---
> >>>
> >>> v2:
> >>>  - Fix barriers in install_table
> >>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> >>>  - Minor cosmetics
> >>>
> >>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> >>>  1 file changed, 57 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> >>> index 6334f51912ea..52700fa958c2 100644
> >>> --- a/drivers/iommu/io-pgtable-arm.c
> >>> +++ b/drivers/iommu/io-pgtable-arm.c
> >>> @@ -20,6 +20,7 @@
> >>>  
> >>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> >>>  
> >>> +#include <linux/atomic.h>
> >>>  #include <linux/iommu.h>
> >>>  #include <linux/kernel.h>
> >>>  #include <linux/sizes.h>
> >>> @@ -99,6 +100,8 @@
> >>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> >>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> >>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> >>> +/* Software bit for solving coherency races */
> >>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> >>>  
> >>>  /* Stage-1 PTE */
> >>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> >>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> >>>  	free_pages_exact(pages, size);
> >>>  }
> >>>  
> >>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> >>> +				struct io_pgtable_cfg *cfg)
> >>> +{
> >>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> >>> +				   sizeof(*ptep), DMA_TO_DEVICE);
> >>> +}
> >>> +
> >>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> >>>  			       struct io_pgtable_cfg *cfg)
> >>>  {
> >>>  	*ptep = pte;
> >>>  
> >>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> >>> -		dma_sync_single_for_device(cfg->iommu_dev,
> >>> -					   __arm_lpae_dma_addr(ptep),
> >>> -					   sizeof(pte), DMA_TO_DEVICE);
> >>> +		__arm_lpae_sync_pte(ptep, cfg);
> >>>  }
> >>>  
> >>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> >>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> >>>  
> >>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> >>>  					     arm_lpae_iopte *ptep,
> >>> +					     arm_lpae_iopte curr,
> >>>  					     struct io_pgtable_cfg *cfg)
> >>>  {
> >>> -	arm_lpae_iopte new;
> >>> +	arm_lpae_iopte old, new;
> >>>  
> >>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> >>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> >>>  		new |= ARM_LPAE_PTE_NSTABLE;
> >>>  
> >>> -	__arm_lpae_set_pte(ptep, new, cfg);
> >>> -	return new;
> >>> +	/* Ensure the table itself is visible before its PTE can be */
> >>> +	wmb();
> >>
> >> Could you please give more hints on why this is required.
> 
> In theory it's possible for the write to ptep to become visible before
> the previous writes filling out the PTEs in table - if the IOMMU
> happened to speculatively walk ptep while parts of table were still sat
> in a write buffer somewhere, it could see the old contents of that page
> and potentially allocate bogus TLB entries if the stale data happened to
> look like valid PTEs. In the non-coherent case the DMA cache maintenance
> is sufficient, but otherwise we need a barrier to order writing the
> next-level PTEs strictly before writing the table PTE pointing to them,
> such that the IOMMU cannot at any point see a partially-initialised table.

Got it. Thanks a lot for explainig that.

> 
> >>> +
> >>> +	old = cmpxchg64_relaxed(ptep, curr, new);
> >>> +
> >>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> >>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> >>> +		return old;
> >>> +
> >>> +	/* Even if it's not ours, there's no point waiting; just kick it */
> >>> +	__arm_lpae_sync_pte(ptep, cfg);
> >>> +	if (old == curr)
> >>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> >>
> >> How is it ensured that the cache operations are completed before we flag them with
> >> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> 
> The wmb() here was a silly oversight on my part - as Will reminded me,
> dma_sync_*() already has its own barrier to ensure completion, which is
> pretty obvious in retrospect because the entire streaming DMA would be
> totally broken otherwise.
> 
> >>> +
> >>> +	return old;
> >>>  }
> >>>
> >>
> > 
> > 
> > Observed a performance drop of close to 1G, 
> > while testing on the 10G network interface with this patch series compared to v1.
> 
> Is that consistent over multiple runs? I wouldn't expect many workloads
> to be thrashing table and hugepage mappings in the same IOVA region, so
> after a point we should tend to reach a fairly steady state where we're
> only changing leaf PTEs.
> 
> > Moving the wmb() as in v1 restores it back.
> 
> Note that on a coherent platform like ThunderX that's as good as just
> deleting it, because you'll never execute the case below. However, on
> reflection I think it can at least safely be downgraded to dma_wmb()
> (i.e. DMB) rather than a full DSB - would you be able to test what
> difference that makes?

The testing was done on Thunderx 1, which has a non coherent page table walk.
Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.



Thanks.

-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-23 11:34                 ` Linu Cherian
  0 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-23 11:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> On 23/06/17 09:56, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> >>
> >> Robin,
> >> Was trying to understand the new changes. Had few questions on 
> >> arm_lpae_install_table. 
> >>
> >> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> >>> For parallel I/O with multiple concurrent threads servicing the same
> >>> device (or devices, if several share a domain), serialising page table
> >>> updates becomes a massive bottleneck. On reflection, though, we don't
> >>> strictly need to do that - for valid IOMMU API usage, there are in fact
> >>> only two races that we need to guard against: multiple map requests for
> >>> different blocks within the same region, when the intermediate-level
> >>> table for that region does not yet exist; and multiple unmaps of
> >>> different parts of the same block entry. Both of those are fairly easily
> >>> solved by using a cmpxchg to install the new table, such that if we then
> >>> find that someone else's table got there first, we can simply free ours
> >>> and continue.
> >>>
> >>> Make the requisite changes such that we can withstand being called
> >>> without the caller maintaining a lock. In theory, this opens up a few
> >>> corners in which wildly misbehaving callers making nonsensical
> >>> overlapping requests might lead to crashes instead of just unpredictable
> >>> results, but correct code really does not deserve to pay a significant
> >>> performance cost for the sake of masking bugs in theoretical broken code.
> >>>
> >>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> >>> ---
> >>>
> >>> v2:
> >>>  - Fix barriers in install_table
> >>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> >>>  - Minor cosmetics
> >>>
> >>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> >>>  1 file changed, 57 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> >>> index 6334f51912ea..52700fa958c2 100644
> >>> --- a/drivers/iommu/io-pgtable-arm.c
> >>> +++ b/drivers/iommu/io-pgtable-arm.c
> >>> @@ -20,6 +20,7 @@
> >>>  
> >>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> >>>  
> >>> +#include <linux/atomic.h>
> >>>  #include <linux/iommu.h>
> >>>  #include <linux/kernel.h>
> >>>  #include <linux/sizes.h>
> >>> @@ -99,6 +100,8 @@
> >>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> >>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> >>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> >>> +/* Software bit for solving coherency races */
> >>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> >>>  
> >>>  /* Stage-1 PTE */
> >>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> >>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> >>>  	free_pages_exact(pages, size);
> >>>  }
> >>>  
> >>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> >>> +				struct io_pgtable_cfg *cfg)
> >>> +{
> >>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> >>> +				   sizeof(*ptep), DMA_TO_DEVICE);
> >>> +}
> >>> +
> >>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> >>>  			       struct io_pgtable_cfg *cfg)
> >>>  {
> >>>  	*ptep = pte;
> >>>  
> >>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> >>> -		dma_sync_single_for_device(cfg->iommu_dev,
> >>> -					   __arm_lpae_dma_addr(ptep),
> >>> -					   sizeof(pte), DMA_TO_DEVICE);
> >>> +		__arm_lpae_sync_pte(ptep, cfg);
> >>>  }
> >>>  
> >>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> >>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> >>>  
> >>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> >>>  					     arm_lpae_iopte *ptep,
> >>> +					     arm_lpae_iopte curr,
> >>>  					     struct io_pgtable_cfg *cfg)
> >>>  {
> >>> -	arm_lpae_iopte new;
> >>> +	arm_lpae_iopte old, new;
> >>>  
> >>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> >>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> >>>  		new |= ARM_LPAE_PTE_NSTABLE;
> >>>  
> >>> -	__arm_lpae_set_pte(ptep, new, cfg);
> >>> -	return new;
> >>> +	/* Ensure the table itself is visible before its PTE can be */
> >>> +	wmb();
> >>
> >> Could you please give more hints on why this is required.
> 
> In theory it's possible for the write to ptep to become visible before
> the previous writes filling out the PTEs in table - if the IOMMU
> happened to speculatively walk ptep while parts of table were still sat
> in a write buffer somewhere, it could see the old contents of that page
> and potentially allocate bogus TLB entries if the stale data happened to
> look like valid PTEs. In the non-coherent case the DMA cache maintenance
> is sufficient, but otherwise we need a barrier to order writing the
> next-level PTEs strictly before writing the table PTE pointing to them,
> such that the IOMMU cannot at any point see a partially-initialised table.

Got it. Thanks a lot for explainig that.

> 
> >>> +
> >>> +	old = cmpxchg64_relaxed(ptep, curr, new);
> >>> +
> >>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> >>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> >>> +		return old;
> >>> +
> >>> +	/* Even if it's not ours, there's no point waiting; just kick it */
> >>> +	__arm_lpae_sync_pte(ptep, cfg);
> >>> +	if (old == curr)
> >>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> >>
> >> How is it ensured that the cache operations are completed before we flag them with
> >> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> 
> The wmb() here was a silly oversight on my part - as Will reminded me,
> dma_sync_*() already has its own barrier to ensure completion, which is
> pretty obvious in retrospect because the entire streaming DMA would be
> totally broken otherwise.
> 
> >>> +
> >>> +	return old;
> >>>  }
> >>>
> >>
> > 
> > 
> > Observed a performance drop of close to 1G, 
> > while testing on the 10G network interface with this patch series compared to v1.
> 
> Is that consistent over multiple runs? I wouldn't expect many workloads
> to be thrashing table and hugepage mappings in the same IOVA region, so
> after a point we should tend to reach a fairly steady state where we're
> only changing leaf PTEs.
> 
> > Moving the wmb() as in v1 restores it back.
> 
> Note that on a coherent platform like ThunderX that's as good as just
> deleting it, because you'll never execute the case below. However, on
> reflection I think it can at least safely be downgraded to dma_wmb()
> (i.e. DMB) rather than a full DSB - would you be able to test what
> difference that makes?

The testing was done on Thunderx 1, which has a non coherent page table walk.
Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.



Thanks.

-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/8] io-pgtable lock removal
  2017-06-23  9:58         ` Robin Murphy
@ 2017-06-26 11:35           ` John Garry
  -1 siblings, 0 replies; 44+ messages in thread
From: John Garry @ 2017-06-26 11:35 UTC (permalink / raw)
  To: Robin Murphy, will.deacon, joro
  Cc: salil.mehta, sunil.goutham, thunder.leizhen, Hanjun Guo,
	wangzhou1, Shameerali Kolothum Thodi, Linuxarm, iommu, ray.jui,
	linux-arm-kernel, linu.cherian, nwatters

On 23/06/2017 10:58, Robin Murphy wrote:
> On 23/06/17 09:47, John Garry wrote:
>> On 22/06/2017 16:53, Robin Murphy wrote:
>>> The feedback has been promising, so v2 is just a final update to cover
>>> a handful of memory ordering and cosmetic tweaks that came up when Will
>>> and I went through this offline.
>>>
>>> Thanks,
>>> Robin.
>>
>> Hi Robin,
>>
>> Is it worth us retesting this patchset?
>>
>> If yes, as for a branch (for convenience), does this one now match v2:
>> git://linux-arm.org/linux-rm  iommu/pgtable
>>
>> I saw it was rewritten recently.
>
> Indeed, I pushed the branch out before posting. Functionally it ought to
> be identical to v1, so this was more just a heads-up - feel free to
> double-check I've not broken anything, but I wouldn't say you need to
> throw the full test lab at it again.
>

I saw Will has already sent the pull request. But, FWIW, we are seeing 
roughly the same performance as v1 patchset. For PCI NIC, Zhou again 
found performance drop goes from ~15->8% with SMMU enabled, and for 
integrated storage controller [platform device], we still see a drop of 
about 50%, depending on datarates (Leizhen has been working on fixing this).

John

> Thanks,
> Robin.
>
>>
>> Thanks,
>> John
>>
>>>
>>>
>>> Robin Murphy (8):
>>>   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>>   iommu/io-pgtable-arm: Improve split_blk_unmap
>>>   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>>   iommu/io-pgtable: Introduce explicit coherency
>>>   iommu/io-pgtable-arm: Support lockless operation
>>>   iommu/io-pgtable-arm-v7s: Support lockless operation
>>>   iommu/arm-smmu: Remove io-pgtable spinlock
>>>   iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>>  drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>>  drivers/iommu/arm-smmu.c           |  48 ++++------
>>>  drivers/iommu/io-pgtable-arm-v7s.c | 179
>>> +++++++++++++++++++++-------------
>>>  drivers/iommu/io-pgtable-arm.c     | 191
>>> ++++++++++++++++++++++++-------------
>>>  drivers/iommu/io-pgtable.h         |   6 ++
>>>  5 files changed, 274 insertions(+), 186 deletions(-)
>>>
>>
>>
>
>
> .
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-26 11:35           ` John Garry
  0 siblings, 0 replies; 44+ messages in thread
From: John Garry @ 2017-06-26 11:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/06/2017 10:58, Robin Murphy wrote:
> On 23/06/17 09:47, John Garry wrote:
>> On 22/06/2017 16:53, Robin Murphy wrote:
>>> The feedback has been promising, so v2 is just a final update to cover
>>> a handful of memory ordering and cosmetic tweaks that came up when Will
>>> and I went through this offline.
>>>
>>> Thanks,
>>> Robin.
>>
>> Hi Robin,
>>
>> Is it worth us retesting this patchset?
>>
>> If yes, as for a branch (for convenience), does this one now match v2:
>> git://linux-arm.org/linux-rm  iommu/pgtable
>>
>> I saw it was rewritten recently.
>
> Indeed, I pushed the branch out before posting. Functionally it ought to
> be identical to v1, so this was more just a heads-up - feel free to
> double-check I've not broken anything, but I wouldn't say you need to
> throw the full test lab at it again.
>

I saw Will has already sent the pull request. But, FWIW, we are seeing 
roughly the same performance as v1 patchset. For PCI NIC, Zhou again 
found performance drop goes from ~15->8% with SMMU enabled, and for 
integrated storage controller [platform device], we still see a drop of 
about 50%, depending on datarates (Leizhen has been working on fixing this).

John

> Thanks,
> Robin.
>
>>
>> Thanks,
>> John
>>
>>>
>>>
>>> Robin Murphy (8):
>>>   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>>   iommu/io-pgtable-arm: Improve split_blk_unmap
>>>   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>>   iommu/io-pgtable: Introduce explicit coherency
>>>   iommu/io-pgtable-arm: Support lockless operation
>>>   iommu/io-pgtable-arm-v7s: Support lockless operation
>>>   iommu/arm-smmu: Remove io-pgtable spinlock
>>>   iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>>  drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>>  drivers/iommu/arm-smmu.c           |  48 ++++------
>>>  drivers/iommu/io-pgtable-arm-v7s.c | 179
>>> +++++++++++++++++++++-------------
>>>  drivers/iommu/io-pgtable-arm.c     | 191
>>> ++++++++++++++++++++++++-------------
>>>  drivers/iommu/io-pgtable.h         |   6 ++
>>>  5 files changed, 274 insertions(+), 186 deletions(-)
>>>
>>
>>
>
>
> .
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/8] io-pgtable lock removal
  2017-06-26 11:35           ` John Garry
@ 2017-06-26 12:31               ` Robin Murphy
  -1 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-26 12:31 UTC (permalink / raw)
  To: John Garry, will.deacon-5wv7dgnIgG8, joro-zLv9SwRftAIdnm+yROfE0A
  Cc: linu.cherian-YGCgFSpz5w/QT0dZR+AlfA,
	sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA, Linuxarm,
	Shameerali Kolothum Thodi, wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w, Hanjun Guo,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 26/06/17 12:35, John Garry wrote:
> On 23/06/2017 10:58, Robin Murphy wrote:
>> On 23/06/17 09:47, John Garry wrote:
>>> On 22/06/2017 16:53, Robin Murphy wrote:
>>>> The feedback has been promising, so v2 is just a final update to cover
>>>> a handful of memory ordering and cosmetic tweaks that came up when Will
>>>> and I went through this offline.
>>>>
>>>> Thanks,
>>>> Robin.
>>>
>>> Hi Robin,
>>>
>>> Is it worth us retesting this patchset?
>>>
>>> If yes, as for a branch (for convenience), does this one now match v2:
>>> git://linux-arm.org/linux-rm  iommu/pgtable
>>>
>>> I saw it was rewritten recently.
>>
>> Indeed, I pushed the branch out before posting. Functionally it ought to
>> be identical to v1, so this was more just a heads-up - feel free to
>> double-check I've not broken anything, but I wouldn't say you need to
>> throw the full test lab at it again.
>>
> 
> I saw Will has already sent the pull request. But, FWIW, we are seeing
> roughly the same performance as v1 patchset. For PCI NIC, Zhou again
> found performance drop goes from ~15->8% with SMMU enabled, and for
> integrated storage controller [platform device], we still see a drop of
> about 50%, depending on datarates (Leizhen has been working on fixing
> this).

Thanks for confirming. Following Joerg's suggestion that the storage
workloads may still depend on rbtree performance - it had slipped my
mind that even with small block sizes those could well be grouped into
scatterlists large enough to trigger a >64-page IOVA allocation - I've
taken the liberty of cooking up a simplified version of Leizhen's rbtree
optimisation series in the iommu/iova branch of my tree. I'll follow up
on that after the merge window, but if anyone wants to play with it in
the meantime feel free.

Robin.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-26 12:31               ` Robin Murphy
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Murphy @ 2017-06-26 12:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 26/06/17 12:35, John Garry wrote:
> On 23/06/2017 10:58, Robin Murphy wrote:
>> On 23/06/17 09:47, John Garry wrote:
>>> On 22/06/2017 16:53, Robin Murphy wrote:
>>>> The feedback has been promising, so v2 is just a final update to cover
>>>> a handful of memory ordering and cosmetic tweaks that came up when Will
>>>> and I went through this offline.
>>>>
>>>> Thanks,
>>>> Robin.
>>>
>>> Hi Robin,
>>>
>>> Is it worth us retesting this patchset?
>>>
>>> If yes, as for a branch (for convenience), does this one now match v2:
>>> git://linux-arm.org/linux-rm  iommu/pgtable
>>>
>>> I saw it was rewritten recently.
>>
>> Indeed, I pushed the branch out before posting. Functionally it ought to
>> be identical to v1, so this was more just a heads-up - feel free to
>> double-check I've not broken anything, but I wouldn't say you need to
>> throw the full test lab at it again.
>>
> 
> I saw Will has already sent the pull request. But, FWIW, we are seeing
> roughly the same performance as v1 patchset. For PCI NIC, Zhou again
> found performance drop goes from ~15->8% with SMMU enabled, and for
> integrated storage controller [platform device], we still see a drop of
> about 50%, depending on datarates (Leizhen has been working on fixing
> this).

Thanks for confirming. Following Joerg's suggestion that the storage
workloads may still depend on rbtree performance - it had slipped my
mind that even with small block sizes those could well be grouped into
scatterlists large enough to trigger a >64-page IOVA allocation - I've
taken the liberty of cooking up a simplified version of Leizhen's rbtree
optimisation series in the iommu/iova branch of my tree. I'll follow up
on that after the merge window, but if anyone wants to play with it in
the meantime feel free.

Robin.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/8] io-pgtable lock removal
  2017-06-26 12:31               ` Robin Murphy
@ 2017-06-26 13:12                 ` John Garry
  -1 siblings, 0 replies; 44+ messages in thread
From: John Garry @ 2017-06-26 13:12 UTC (permalink / raw)
  To: Robin Murphy, will.deacon, joro
  Cc: salil.mehta, linu.cherian, sunil.goutham, thunder.leizhen,
	Linuxarm, Shameerali Kolothum Thodi, wangzhou1, iommu, ray.jui,
	nwatters, Hanjun Guo, linux-arm-kernel


>>
>> I saw Will has already sent the pull request. But, FWIW, we are seeing
>> roughly the same performance as v1 patchset. For PCI NIC, Zhou again
>> found performance drop goes from ~15->8% with SMMU enabled, and for
>> integrated storage controller [platform device], we still see a drop of
>> about 50%, depending on datarates (Leizhen has been working on fixing
>> this).
>
> Thanks for confirming. Following Joerg's suggestion that the storage
> workloads may still depend on rbtree performance - it had slipped my
> mind that even with small block sizes those could well be grouped into
> scatterlists large enough to trigger a >64-page IOVA allocation - I've
> taken the liberty of cooking up a simplified version of Leizhen's rbtree
> optimisation series in the iommu/iova branch of my tree. I'll follow up
> on that after the merge window, but if anyone wants to play with it in
> the meantime feel free.

Just a reminder that we did also see poor performance with our 
integrated NIC on your v1 patchset also (I can push for v2 patchset 
testing, but expect the same).

We might be able to now include a LSI 3108 PCI SAS card in our testing 
also to give a broader set of results.

John

>
> Robin.
>
> .
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-26 13:12                 ` John Garry
  0 siblings, 0 replies; 44+ messages in thread
From: John Garry @ 2017-06-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel


>>
>> I saw Will has already sent the pull request. But, FWIW, we are seeing
>> roughly the same performance as v1 patchset. For PCI NIC, Zhou again
>> found performance drop goes from ~15->8% with SMMU enabled, and for
>> integrated storage controller [platform device], we still see a drop of
>> about 50%, depending on datarates (Leizhen has been working on fixing
>> this).
>
> Thanks for confirming. Following Joerg's suggestion that the storage
> workloads may still depend on rbtree performance - it had slipped my
> mind that even with small block sizes those could well be grouped into
> scatterlists large enough to trigger a >64-page IOVA allocation - I've
> taken the liberty of cooking up a simplified version of Leizhen's rbtree
> optimisation series in the iommu/iova branch of my tree. I'll follow up
> on that after the merge window, but if anyone wants to play with it in
> the meantime feel free.

Just a reminder that we did also see poor performance with our 
integrated NIC on your v1 patchset also (I can push for v2 patchset 
testing, but expect the same).

We might be able to now include a LSI 3108 PCI SAS card in our testing 
also to give a broader set of results.

John

>
> Robin.
>
> .
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/8] io-pgtable lock removal
  2017-06-26 13:12                 ` John Garry
@ 2017-06-26 13:19                     ` Leizhen (ThunderTown)
  -1 siblings, 0 replies; 44+ messages in thread
From: Leizhen (ThunderTown) @ 2017-06-26 13:19 UTC (permalink / raw)
  To: John Garry, Robin Murphy, will.deacon-5wv7dgnIgG8,
	joro-zLv9SwRftAIdnm+yROfE0A
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	linu.cherian-YGCgFSpz5w/QT0dZR+AlfA, Linuxarm,
	Shameerali Kolothum Thodi, wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w, Hanjun Guo,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 2017/6/26 21:12, John Garry wrote:
> 
>>>
>>> I saw Will has already sent the pull request. But, FWIW, we are seeing
>>> roughly the same performance as v1 patchset. For PCI NIC, Zhou again
>>> found performance drop goes from ~15->8% with SMMU enabled, and for
>>> integrated storage controller [platform device], we still see a drop of
>>> about 50%, depending on datarates (Leizhen has been working on fixing
>>> this).
>>
>> Thanks for confirming. Following Joerg's suggestion that the storage
>> workloads may still depend on rbtree performance - it had slipped my
>> mind that even with small block sizes those could well be grouped into
>> scatterlists large enough to trigger a >64-page IOVA allocation - I've
>> taken the liberty of cooking up a simplified version of Leizhen's rbtree
>> optimisation series in the iommu/iova branch of my tree. I'll follow up
>> on that after the merge window, but if anyone wants to play with it in
>> the meantime feel free.
The main problem is lock confliction of cmd queue. I have prepared my patchset,
I will send it later.

> 
> Just a reminder that we did also see poor performance with our integrated NIC on your v1 patchset also (I can push for v2 patchset testing, but expect the same).
> 
> We might be able to now include a LSI 3108 PCI SAS card in our testing also to give a broader set of results.
> 
> John
> 
>>
>> Robin.
>>
>> .
>>
> 
> 
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/8] io-pgtable lock removal
@ 2017-06-26 13:19                     ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 44+ messages in thread
From: Leizhen (ThunderTown) @ 2017-06-26 13:19 UTC (permalink / raw)
  To: linux-arm-kernel



On 2017/6/26 21:12, John Garry wrote:
> 
>>>
>>> I saw Will has already sent the pull request. But, FWIW, we are seeing
>>> roughly the same performance as v1 patchset. For PCI NIC, Zhou again
>>> found performance drop goes from ~15->8% with SMMU enabled, and for
>>> integrated storage controller [platform device], we still see a drop of
>>> about 50%, depending on datarates (Leizhen has been working on fixing
>>> this).
>>
>> Thanks for confirming. Following Joerg's suggestion that the storage
>> workloads may still depend on rbtree performance - it had slipped my
>> mind that even with small block sizes those could well be grouped into
>> scatterlists large enough to trigger a >64-page IOVA allocation - I've
>> taken the liberty of cooking up a simplified version of Leizhen's rbtree
>> optimisation series in the iommu/iova branch of my tree. I'll follow up
>> on that after the merge window, but if anyone wants to play with it in
>> the meantime feel free.
The main problem is lock confliction of cmd queue. I have prepared my patchset,
I will send it later.

> 
> Just a reminder that we did also see poor performance with our integrated NIC on your v1 patchset also (I can push for v2 patchset testing, but expect the same).
> 
> We might be able to now include a LSI 3108 PCI SAS card in our testing also to give a broader set of results.
> 
> John
> 
>>
>> Robin.
>>
>> .
>>
> 
> 
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-23 11:34                 ` Linu Cherian
@ 2017-06-27  5:11                   ` Linu Cherian
  -1 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-27  5:11 UTC (permalink / raw)
  To: Robin Murphy
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA

On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > On 23/06/17 09:56, Linu Cherian wrote:
> > > On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> > >>
> > >> Robin,
> > >> Was trying to understand the new changes. Had few questions on 
> > >> arm_lpae_install_table. 
> > >>
> > >> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> > >>> For parallel I/O with multiple concurrent threads servicing the same
> > >>> device (or devices, if several share a domain), serialising page table
> > >>> updates becomes a massive bottleneck. On reflection, though, we don't
> > >>> strictly need to do that - for valid IOMMU API usage, there are in fact
> > >>> only two races that we need to guard against: multiple map requests for
> > >>> different blocks within the same region, when the intermediate-level
> > >>> table for that region does not yet exist; and multiple unmaps of
> > >>> different parts of the same block entry. Both of those are fairly easily
> > >>> solved by using a cmpxchg to install the new table, such that if we then
> > >>> find that someone else's table got there first, we can simply free ours
> > >>> and continue.
> > >>>
> > >>> Make the requisite changes such that we can withstand being called
> > >>> without the caller maintaining a lock. In theory, this opens up a few
> > >>> corners in which wildly misbehaving callers making nonsensical
> > >>> overlapping requests might lead to crashes instead of just unpredictable
> > >>> results, but correct code really does not deserve to pay a significant
> > >>> performance cost for the sake of masking bugs in theoretical broken code.
> > >>>
> > >>> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> > >>> ---
> > >>>
> > >>> v2:
> > >>>  - Fix barriers in install_table
> > >>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> > >>>  - Minor cosmetics
> > >>>
> > >>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> > >>>  1 file changed, 57 insertions(+), 15 deletions(-)
> > >>>
> > >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > >>> index 6334f51912ea..52700fa958c2 100644
> > >>> --- a/drivers/iommu/io-pgtable-arm.c
> > >>> +++ b/drivers/iommu/io-pgtable-arm.c
> > >>> @@ -20,6 +20,7 @@
> > >>>  
> > >>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> > >>>  
> > >>> +#include <linux/atomic.h>
> > >>>  #include <linux/iommu.h>
> > >>>  #include <linux/kernel.h>
> > >>>  #include <linux/sizes.h>
> > >>> @@ -99,6 +100,8 @@
> > >>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> > >>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> > >>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> > >>> +/* Software bit for solving coherency races */
> > >>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> > >>>  
> > >>>  /* Stage-1 PTE */
> > >>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> > >>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> > >>>  	free_pages_exact(pages, size);
> > >>>  }
> > >>>  
> > >>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> > >>> +				struct io_pgtable_cfg *cfg)
> > >>> +{
> > >>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> > >>> +				   sizeof(*ptep), DMA_TO_DEVICE);
> > >>> +}
> > >>> +
> > >>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> > >>>  			       struct io_pgtable_cfg *cfg)
> > >>>  {
> > >>>  	*ptep = pte;
> > >>>  
> > >>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> > >>> -		dma_sync_single_for_device(cfg->iommu_dev,
> > >>> -					   __arm_lpae_dma_addr(ptep),
> > >>> -					   sizeof(pte), DMA_TO_DEVICE);
> > >>> +		__arm_lpae_sync_pte(ptep, cfg);
> > >>>  }
> > >>>  
> > >>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > >>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> > >>>  
> > >>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> > >>>  					     arm_lpae_iopte *ptep,
> > >>> +					     arm_lpae_iopte curr,
> > >>>  					     struct io_pgtable_cfg *cfg)
> > >>>  {
> > >>> -	arm_lpae_iopte new;
> > >>> +	arm_lpae_iopte old, new;
> > >>>  
> > >>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> > >>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> > >>>  		new |= ARM_LPAE_PTE_NSTABLE;
> > >>>  
> > >>> -	__arm_lpae_set_pte(ptep, new, cfg);
> > >>> -	return new;
> > >>> +	/* Ensure the table itself is visible before its PTE can be */
> > >>> +	wmb();
> > >>
> > >> Could you please give more hints on why this is required.
> > 
> > In theory it's possible for the write to ptep to become visible before
> > the previous writes filling out the PTEs in table - if the IOMMU
> > happened to speculatively walk ptep while parts of table were still sat
> > in a write buffer somewhere, it could see the old contents of that page
> > and potentially allocate bogus TLB entries if the stale data happened to
> > look like valid PTEs. In the non-coherent case the DMA cache maintenance
> > is sufficient, but otherwise we need a barrier to order writing the
> > next-level PTEs strictly before writing the table PTE pointing to them,
> > such that the IOMMU cannot at any point see a partially-initialised table.
> 
> Got it. Thanks a lot for explainig that.
> 
> > 
> > >>> +
> > >>> +	old = cmpxchg64_relaxed(ptep, curr, new);
> > >>> +
> > >>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> > >>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> > >>> +		return old;
> > >>> +
> > >>> +	/* Even if it's not ours, there's no point waiting; just kick it */
> > >>> +	__arm_lpae_sync_pte(ptep, cfg);
> > >>> +	if (old == curr)
> > >>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> > >>
> > >> How is it ensured that the cache operations are completed before we flag them with
> > >> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> > 
> > The wmb() here was a silly oversight on my part - as Will reminded me,
> > dma_sync_*() already has its own barrier to ensure completion, which is
> > pretty obvious in retrospect because the entire streaming DMA would be
> > totally broken otherwise.
> > 
> > >>> +
> > >>> +	return old;
> > >>>  }
> > >>>
> > >>
> > > 
> > > 
> > > Observed a performance drop of close to 1G, 
> > > while testing on the 10G network interface with this patch series compared to v1.
> > 
> > Is that consistent over multiple runs? I wouldn't expect many workloads
> > to be thrashing table and hugepage mappings in the same IOVA region, so
> > after a point we should tend to reach a fairly steady state where we're
> > only changing leaf PTEs.
> > 
> > > Moving the wmb() as in v1 restores it back.
> > 
> > Note that on a coherent platform like ThunderX that's as good as just
> > deleting it, because you'll never execute the case below. However, on
> > reflection I think it can at least safely be downgraded to dma_wmb()
> > (i.e. DMB) rather than a full DSB - would you be able to test what
> > difference that makes?
> 
> The testing was done on Thunderx 1, which has a non coherent page table walk.
> Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> 
> 
> 

Should i send a patch for this ?

Thanks.



-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-27  5:11                   ` Linu Cherian
  0 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-27  5:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > On 23/06/17 09:56, Linu Cherian wrote:
> > > On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> > >>
> > >> Robin,
> > >> Was trying to understand the new changes. Had few questions on 
> > >> arm_lpae_install_table. 
> > >>
> > >> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> > >>> For parallel I/O with multiple concurrent threads servicing the same
> > >>> device (or devices, if several share a domain), serialising page table
> > >>> updates becomes a massive bottleneck. On reflection, though, we don't
> > >>> strictly need to do that - for valid IOMMU API usage, there are in fact
> > >>> only two races that we need to guard against: multiple map requests for
> > >>> different blocks within the same region, when the intermediate-level
> > >>> table for that region does not yet exist; and multiple unmaps of
> > >>> different parts of the same block entry. Both of those are fairly easily
> > >>> solved by using a cmpxchg to install the new table, such that if we then
> > >>> find that someone else's table got there first, we can simply free ours
> > >>> and continue.
> > >>>
> > >>> Make the requisite changes such that we can withstand being called
> > >>> without the caller maintaining a lock. In theory, this opens up a few
> > >>> corners in which wildly misbehaving callers making nonsensical
> > >>> overlapping requests might lead to crashes instead of just unpredictable
> > >>> results, but correct code really does not deserve to pay a significant
> > >>> performance cost for the sake of masking bugs in theoretical broken code.
> > >>>
> > >>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> > >>> ---
> > >>>
> > >>> v2:
> > >>>  - Fix barriers in install_table
> > >>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> > >>>  - Minor cosmetics
> > >>>
> > >>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> > >>>  1 file changed, 57 insertions(+), 15 deletions(-)
> > >>>
> > >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > >>> index 6334f51912ea..52700fa958c2 100644
> > >>> --- a/drivers/iommu/io-pgtable-arm.c
> > >>> +++ b/drivers/iommu/io-pgtable-arm.c
> > >>> @@ -20,6 +20,7 @@
> > >>>  
> > >>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> > >>>  
> > >>> +#include <linux/atomic.h>
> > >>>  #include <linux/iommu.h>
> > >>>  #include <linux/kernel.h>
> > >>>  #include <linux/sizes.h>
> > >>> @@ -99,6 +100,8 @@
> > >>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> > >>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> > >>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> > >>> +/* Software bit for solving coherency races */
> > >>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> > >>>  
> > >>>  /* Stage-1 PTE */
> > >>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> > >>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> > >>>  	free_pages_exact(pages, size);
> > >>>  }
> > >>>  
> > >>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> > >>> +				struct io_pgtable_cfg *cfg)
> > >>> +{
> > >>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> > >>> +				   sizeof(*ptep), DMA_TO_DEVICE);
> > >>> +}
> > >>> +
> > >>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> > >>>  			       struct io_pgtable_cfg *cfg)
> > >>>  {
> > >>>  	*ptep = pte;
> > >>>  
> > >>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> > >>> -		dma_sync_single_for_device(cfg->iommu_dev,
> > >>> -					   __arm_lpae_dma_addr(ptep),
> > >>> -					   sizeof(pte), DMA_TO_DEVICE);
> > >>> +		__arm_lpae_sync_pte(ptep, cfg);
> > >>>  }
> > >>>  
> > >>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > >>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> > >>>  
> > >>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> > >>>  					     arm_lpae_iopte *ptep,
> > >>> +					     arm_lpae_iopte curr,
> > >>>  					     struct io_pgtable_cfg *cfg)
> > >>>  {
> > >>> -	arm_lpae_iopte new;
> > >>> +	arm_lpae_iopte old, new;
> > >>>  
> > >>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> > >>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> > >>>  		new |= ARM_LPAE_PTE_NSTABLE;
> > >>>  
> > >>> -	__arm_lpae_set_pte(ptep, new, cfg);
> > >>> -	return new;
> > >>> +	/* Ensure the table itself is visible before its PTE can be */
> > >>> +	wmb();
> > >>
> > >> Could you please give more hints on why this is required.
> > 
> > In theory it's possible for the write to ptep to become visible before
> > the previous writes filling out the PTEs in table - if the IOMMU
> > happened to speculatively walk ptep while parts of table were still sat
> > in a write buffer somewhere, it could see the old contents of that page
> > and potentially allocate bogus TLB entries if the stale data happened to
> > look like valid PTEs. In the non-coherent case the DMA cache maintenance
> > is sufficient, but otherwise we need a barrier to order writing the
> > next-level PTEs strictly before writing the table PTE pointing to them,
> > such that the IOMMU cannot at any point see a partially-initialised table.
> 
> Got it. Thanks a lot for explainig that.
> 
> > 
> > >>> +
> > >>> +	old = cmpxchg64_relaxed(ptep, curr, new);
> > >>> +
> > >>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> > >>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> > >>> +		return old;
> > >>> +
> > >>> +	/* Even if it's not ours, there's no point waiting; just kick it */
> > >>> +	__arm_lpae_sync_pte(ptep, cfg);
> > >>> +	if (old == curr)
> > >>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> > >>
> > >> How is it ensured that the cache operations are completed before we flag them with
> > >> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> > 
> > The wmb() here was a silly oversight on my part - as Will reminded me,
> > dma_sync_*() already has its own barrier to ensure completion, which is
> > pretty obvious in retrospect because the entire streaming DMA would be
> > totally broken otherwise.
> > 
> > >>> +
> > >>> +	return old;
> > >>>  }
> > >>>
> > >>
> > > 
> > > 
> > > Observed a performance drop of close to 1G, 
> > > while testing on the 10G network interface with this patch series compared to v1.
> > 
> > Is that consistent over multiple runs? I wouldn't expect many workloads
> > to be thrashing table and hugepage mappings in the same IOVA region, so
> > after a point we should tend to reach a fairly steady state where we're
> > only changing leaf PTEs.
> > 
> > > Moving the wmb() as in v1 restores it back.
> > 
> > Note that on a coherent platform like ThunderX that's as good as just
> > deleting it, because you'll never execute the case below. However, on
> > reflection I think it can at least safely be downgraded to dma_wmb()
> > (i.e. DMB) rather than a full DSB - would you be able to test what
> > difference that makes?
> 
> The testing was done on Thunderx 1, which has a non coherent page table walk.
> Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> 
> 
> 

Should i send a patch for this ?

Thanks.



-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-27  5:11                   ` Linu Cherian
@ 2017-06-27  8:39                     ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-27  8:39 UTC (permalink / raw)
  To: Linu Cherian
  Cc: sunil.goutham-YGCgFSpz5w/QT0dZR+AlfA,
	wangzhou1-C8/M+/jPZTeaMJb+Lgu22Q,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA

On Tue, Jun 27, 2017 at 10:41:55AM +0530, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > > Note that on a coherent platform like ThunderX that's as good as just
> > > deleting it, because you'll never execute the case below. However, on
> > > reflection I think it can at least safely be downgraded to dma_wmb()
> > > (i.e. DMB) rather than a full DSB - would you be able to test what
> > > difference that makes?
> > 
> > The testing was done on Thunderx 1, which has a non coherent page table walk.
> > Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> > 
> 
> Should i send a patch for this ?

I already did it:

https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=for-joerg/arm-smmu/updates&id=77f3445866c39d8866b31d8d9fa47c7c20938e05

Will

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-27  8:39                     ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-27  8:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 27, 2017 at 10:41:55AM +0530, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > > Note that on a coherent platform like ThunderX that's as good as just
> > > deleting it, because you'll never execute the case below. However, on
> > > reflection I think it can at least safely be downgraded to dma_wmb()
> > > (i.e. DMB) rather than a full DSB - would you be able to test what
> > > difference that makes?
> > 
> > The testing was done on Thunderx 1, which has a non coherent page table walk.
> > Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> > 
> 
> Should i send a patch for this ?

I already did it:

https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=for-joerg/arm-smmu/updates&id=77f3445866c39d8866b31d8d9fa47c7c20938e05

Will

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
  2017-06-27  8:39                     ` Will Deacon
@ 2017-06-27  9:08                       ` Linu Cherian
  -1 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-27  9:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: salil.mehta, sunil.goutham, thunder.leizhen, joro, john.garry,
	iommu, wangzhou1, ray.jui, nwatters, Linu Cherian, Robin Murphy,
	linux-arm-kernel

On Tue Jun 27, 2017 at 09:39:13AM +0100, Will Deacon wrote:
> On Tue, Jun 27, 2017 at 10:41:55AM +0530, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> > > On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > > > Note that on a coherent platform like ThunderX that's as good as just
> > > > deleting it, because you'll never execute the case below. However, on
> > > > reflection I think it can at least safely be downgraded to dma_wmb()
> > > > (i.e. DMB) rather than a full DSB - would you be able to test what
> > > > difference that makes?
> > > 
> > > The testing was done on Thunderx 1, which has a non coherent page table walk.
> > > Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> > > 
> > 
> > Should i send a patch for this ?
> 
> I already did it:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=for-joerg/arm-smmu/updates&id=77f3445866c39d8866b31d8d9fa47c7c20938e05
> 
> Will
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Thanks Will.

-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation
@ 2017-06-27  9:08                       ` Linu Cherian
  0 siblings, 0 replies; 44+ messages in thread
From: Linu Cherian @ 2017-06-27  9:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue Jun 27, 2017 at 09:39:13AM +0100, Will Deacon wrote:
> On Tue, Jun 27, 2017 at 10:41:55AM +0530, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> > > On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > > > Note that on a coherent platform like ThunderX that's as good as just
> > > > deleting it, because you'll never execute the case below. However, on
> > > > reflection I think it can at least safely be downgraded to dma_wmb()
> > > > (i.e. DMB) rather than a full DSB - would you be able to test what
> > > > difference that makes?
> > > 
> > > The testing was done on Thunderx 1, which has a non coherent page table walk.
> > > Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> > > 
> > 
> > Should i send a patch for this ?
> 
> I already did it:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=for-joerg/arm-smmu/updates&id=77f3445866c39d8866b31d8d9fa47c7c20938e05
> 
> Will
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Thanks Will.

-- 
Linu cherian

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-06-27  9:08 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-22 15:53 [PATCH v2 0/8] io-pgtable lock removal Robin Murphy
2017-06-22 15:53 ` Robin Murphy
     [not found] ` <cover.1498145008.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2017-06-22 15:53   ` [PATCH v2 1/8] iommu/io-pgtable-arm-v7s: Check table PTEs more precisely Robin Murphy
2017-06-22 15:53     ` Robin Murphy
2017-06-22 15:53   ` [PATCH v2 2/8] iommu/io-pgtable-arm: Improve split_blk_unmap Robin Murphy
2017-06-22 15:53     ` Robin Murphy
2017-06-22 15:53   ` [PATCH v2 3/8] iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap Robin Murphy
2017-06-22 15:53     ` Robin Murphy
2017-06-22 15:53   ` [PATCH v2 4/8] iommu/io-pgtable: Introduce explicit coherency Robin Murphy
2017-06-22 15:53     ` Robin Murphy
2017-06-22 15:53   ` [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation Robin Murphy
2017-06-22 15:53     ` Robin Murphy
     [not found]     ` <af61929e3984a31588cab08ed84a237602b5263e.1498145008.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2017-06-23  5:53       ` Linu Cherian
2017-06-23  5:53         ` Linu Cherian
2017-06-23  8:56         ` Linu Cherian
2017-06-23  8:56           ` Linu Cherian
2017-06-23 10:35           ` Robin Murphy
2017-06-23 10:35             ` Robin Murphy
     [not found]             ` <7a8da378-19e2-2f35-877a-cc6ce389301a-5wv7dgnIgG8@public.gmane.org>
2017-06-23 11:34               ` Linu Cherian
2017-06-23 11:34                 ` Linu Cherian
2017-06-27  5:11                 ` Linu Cherian
2017-06-27  5:11                   ` Linu Cherian
2017-06-27  8:39                   ` Will Deacon
2017-06-27  8:39                     ` Will Deacon
2017-06-27  9:08                     ` Linu Cherian
2017-06-27  9:08                       ` Linu Cherian
2017-06-22 15:53   ` [PATCH v2 6/8] iommu/io-pgtable-arm-v7s: " Robin Murphy
2017-06-22 15:53     ` Robin Murphy
2017-06-22 15:53   ` [PATCH v2 7/8] iommu/arm-smmu: Remove io-pgtable spinlock Robin Murphy
2017-06-22 15:53     ` Robin Murphy
2017-06-22 15:53   ` [PATCH v2 8/8] iommu/arm-smmu-v3: " Robin Murphy
2017-06-22 15:53     ` Robin Murphy
2017-06-23  8:47   ` [PATCH v2 0/8] io-pgtable lock removal John Garry
2017-06-23  8:47     ` John Garry
     [not found]     ` <61b7b953-5bf4-eb45-c3e8-b4491e8fdca7-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-06-23  9:58       ` Robin Murphy
2017-06-23  9:58         ` Robin Murphy
2017-06-26 11:35         ` John Garry
2017-06-26 11:35           ` John Garry
     [not found]           ` <9bbf18c7-34ba-6e94-53bd-3f75059c1bb2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-06-26 12:31             ` Robin Murphy
2017-06-26 12:31               ` Robin Murphy
2017-06-26 13:12               ` John Garry
2017-06-26 13:12                 ` John Garry
     [not found]                 ` <15e7ce0a-bf4b-cc77-3600-c37ed865a4d7-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-06-26 13:19                   ` Leizhen (ThunderTown)
2017-06-26 13:19                     ` Leizhen (ThunderTown)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.