All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-11 14:54 ` Isaac J. Manjarres
  0 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, pratikp, linux-arm-kernel, iommu,
	linux-kernel

The iommu_map_sg() code currently iterates through the given
scatter-gather list, and in the worst case, invokes iommu_map()
for each element in the scatter-gather list, which calls into
the IOMMU driver through an indirect call. For an IOMMU driver
that uses a format supported by the io-pgtable code, the IOMMU
driver will then call into the io-pgtable code to map the chunk.

Jumping between the IOMMU core code, the IOMMU driver, and the
io-pgtable code and back for each element in a scatter-gather list
is not efficient.

Instead, add a map_sg() hook in both the IOMMU driver ops and the
io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
map_sg() hook with the entire scatter-gather list, which can call
into the io-pgtable map_sg() hook, which can process the entire
scatter-gather list, signficantly reducing the number of indirect
calls, and jumps between these layers, boosting performance.

On a system that uses the ARM SMMU driver, and the ARM LPAE format,
the current implementation of iommu_map_sg() yields the following
latencies for mapping scatter-gather lists of various sizes. These
latencies are calculated by repeating the mapping operation 10 times:

    size        iommu_map_sg latency
      4K            0.624 us
     64K            9.468 us
      1M          122.557 us
      2M          239.807 us
     12M         1435.979 us
     24M         2884.968 us
     32M         3832.979 us

On the same system, the proposed modifications yield the following
results:

    size        iommu_map_sg latency
      4K            3.645 us
     64K            4.198 us
      1M           11.010 us
      2M           17.125 us
     12M           82.416 us
     24M          158.677 us
     32M          210.468 us

The procedure for collecting the iommu_map_sg latencies is
the same in both experiments. Clearly, reducing the jumps
between the different layers in the IOMMU code offers a
signficant performance boost in iommu_map_sg() latency.

Changes since v1:

-Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
when checking if the IOVA and physical address ranges being
mapped are within the appropriate limits.
-Added Sai Prakash Ranjan's "Tested-by" tag.

Thanks,
Isaac

Isaac J. Manjarres (5):
  iommu/io-pgtable: Introduce map_sg() as a page table op
  iommu/io-pgtable-arm: Hook up map_sg()
  iommu/io-pgtable-arm-v7s: Hook up map_sg()
  iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
  iommu/arm-smmu: Hook up map_sg()

 drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
 drivers/iommu/io-pgtable-arm-v7s.c    | 90 +++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable-arm.c        | 86 +++++++++++++++++++++++++++++++++
 drivers/iommu/iommu.c                 | 25 ++++++++--
 include/linux/io-pgtable.h            |  6 +++
 include/linux/iommu.h                 | 13 +++++
 6 files changed, 234 insertions(+), 5 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-11 14:54 ` Isaac J. Manjarres
  0 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, linux-kernel, iommu, pratikp,
	linux-arm-kernel

The iommu_map_sg() code currently iterates through the given
scatter-gather list, and in the worst case, invokes iommu_map()
for each element in the scatter-gather list, which calls into
the IOMMU driver through an indirect call. For an IOMMU driver
that uses a format supported by the io-pgtable code, the IOMMU
driver will then call into the io-pgtable code to map the chunk.

Jumping between the IOMMU core code, the IOMMU driver, and the
io-pgtable code and back for each element in a scatter-gather list
is not efficient.

Instead, add a map_sg() hook in both the IOMMU driver ops and the
io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
map_sg() hook with the entire scatter-gather list, which can call
into the io-pgtable map_sg() hook, which can process the entire
scatter-gather list, signficantly reducing the number of indirect
calls, and jumps between these layers, boosting performance.

On a system that uses the ARM SMMU driver, and the ARM LPAE format,
the current implementation of iommu_map_sg() yields the following
latencies for mapping scatter-gather lists of various sizes. These
latencies are calculated by repeating the mapping operation 10 times:

    size        iommu_map_sg latency
      4K            0.624 us
     64K            9.468 us
      1M          122.557 us
      2M          239.807 us
     12M         1435.979 us
     24M         2884.968 us
     32M         3832.979 us

On the same system, the proposed modifications yield the following
results:

    size        iommu_map_sg latency
      4K            3.645 us
     64K            4.198 us
      1M           11.010 us
      2M           17.125 us
     12M           82.416 us
     24M          158.677 us
     32M          210.468 us

The procedure for collecting the iommu_map_sg latencies is
the same in both experiments. Clearly, reducing the jumps
between the different layers in the IOMMU code offers a
signficant performance boost in iommu_map_sg() latency.

Changes since v1:

-Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
when checking if the IOVA and physical address ranges being
mapped are within the appropriate limits.
-Added Sai Prakash Ranjan's "Tested-by" tag.

Thanks,
Isaac

Isaac J. Manjarres (5):
  iommu/io-pgtable: Introduce map_sg() as a page table op
  iommu/io-pgtable-arm: Hook up map_sg()
  iommu/io-pgtable-arm-v7s: Hook up map_sg()
  iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
  iommu/arm-smmu: Hook up map_sg()

 drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
 drivers/iommu/io-pgtable-arm-v7s.c    | 90 +++++++++++++++++++++++++++++++++++
 drivers/iommu/io-pgtable-arm.c        | 86 +++++++++++++++++++++++++++++++++
 drivers/iommu/iommu.c                 | 25 ++++++++--
 include/linux/io-pgtable.h            |  6 +++
 include/linux/iommu.h                 | 13 +++++
 6 files changed, 234 insertions(+), 5 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/5] iommu/io-pgtable: Introduce map_sg() as a page table op
  2021-01-11 14:54 ` Isaac J. Manjarres
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  -1 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, pratikp, linux-arm-kernel, iommu,
	linux-kernel

While mapping a scatter-gather list, iommu_map_sg() calls
into the IOMMU driver through an indirect call, which can
call into the io-pgtable code through another indirect call.

This sequence of going through the IOMMU core code, the IOMMU
driver, and finally the io-pgtable code, occurs for every
element in the scatter-gather list, in the worse case, which
is not optimal.

Introduce a map_sg callback in the io-pgtable ops so that
IOMMU drivers can invoke it with the complete scatter-gather
list, so that it can be processed within the io-pgtable
code entirely, reducing the number of indirect calls, and
boosting overall iommu_map_sg() performance.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 include/linux/io-pgtable.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index ea727eb..6d0e731 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -147,6 +147,9 @@ struct io_pgtable_cfg {
  * struct io_pgtable_ops - Page table manipulation API for IOMMU drivers.
  *
  * @map:          Map a physically contiguous memory region.
+ * @map_sg:       Map a scatter-gather list of physically contiguous memory
+ *                chunks. The mapped pointer argument is used to store how
+ *                many bytes are mapped.
  * @unmap:        Unmap a physically contiguous memory region.
  * @iova_to_phys: Translate iova to physical address.
  *
@@ -156,6 +159,9 @@ struct io_pgtable_cfg {
 struct io_pgtable_ops {
 	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
 		   phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+	int (*map_sg)(struct io_pgtable_ops *ops, unsigned long iova,
+		      struct scatterlist *sg, unsigned int nents, int prot,
+		      gfp_t gfp, size_t *mapped);
 	size_t (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
 			size_t size, struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 1/5] iommu/io-pgtable: Introduce map_sg() as a page table op
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  0 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, linux-kernel, iommu, pratikp,
	linux-arm-kernel

While mapping a scatter-gather list, iommu_map_sg() calls
into the IOMMU driver through an indirect call, which can
call into the io-pgtable code through another indirect call.

This sequence of going through the IOMMU core code, the IOMMU
driver, and finally the io-pgtable code, occurs for every
element in the scatter-gather list, in the worse case, which
is not optimal.

Introduce a map_sg callback in the io-pgtable ops so that
IOMMU drivers can invoke it with the complete scatter-gather
list, so that it can be processed within the io-pgtable
code entirely, reducing the number of indirect calls, and
boosting overall iommu_map_sg() performance.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 include/linux/io-pgtable.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index ea727eb..6d0e731 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -147,6 +147,9 @@ struct io_pgtable_cfg {
  * struct io_pgtable_ops - Page table manipulation API for IOMMU drivers.
  *
  * @map:          Map a physically contiguous memory region.
+ * @map_sg:       Map a scatter-gather list of physically contiguous memory
+ *                chunks. The mapped pointer argument is used to store how
+ *                many bytes are mapped.
  * @unmap:        Unmap a physically contiguous memory region.
  * @iova_to_phys: Translate iova to physical address.
  *
@@ -156,6 +159,9 @@ struct io_pgtable_cfg {
 struct io_pgtable_ops {
 	int (*map)(struct io_pgtable_ops *ops, unsigned long iova,
 		   phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+	int (*map_sg)(struct io_pgtable_ops *ops, unsigned long iova,
+		      struct scatterlist *sg, unsigned int nents, int prot,
+		      gfp_t gfp, size_t *mapped);
 	size_t (*unmap)(struct io_pgtable_ops *ops, unsigned long iova,
 			size_t size, struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/5] iommu/io-pgtable-arm: Hook up map_sg()
  2021-01-11 14:54 ` Isaac J. Manjarres
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  -1 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, pratikp, linux-arm-kernel, iommu,
	linux-kernel

Implement the map_sg io-pgtable op for the ARM LPAE io-pgtable
code, so that IOMMU drivers can call it when they need to map
a scatter-gather list.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/io-pgtable-arm.c | 86 ++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/iommu.c          | 12 +++---
 include/linux/iommu.h          |  8 ++++
 3 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58..0c11529 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -473,6 +473,91 @@ static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
 	return ret;
 }
 
+static int arm_lpae_map_by_pgsize(struct io_pgtable_ops *ops,
+				  unsigned long iova, phys_addr_t paddr,
+				  size_t size, int iommu_prot, gfp_t gfp,
+				  size_t *mapped)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int ret, lvl = data->start_level;
+	arm_lpae_iopte prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	unsigned int min_pagesz = 1 << __ffs(cfg->pgsize_bitmap);
+	long iaext = (s64)(iova + size - 1) >> cfg->ias;
+	size_t pgsize;
+
+	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
+		       iova, &paddr, size, min_pagesz);
+		return -EINVAL;
+	}
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext || (paddr + size - 1) >> cfg->oas))
+		return -ERANGE;
+
+	while (size) {
+		pgsize = iommu_pgsize(cfg->pgsize_bitmap, iova | paddr, size);
+		ret = __arm_lpae_map(data, iova, paddr, pgsize, prot, lvl, ptep,
+				     gfp);
+		if (ret)
+			return ret;
+
+		iova += pgsize;
+		paddr += pgsize;
+		*mapped += pgsize;
+		size -= pgsize;
+	}
+
+	return 0;
+}
+
+static int arm_lpae_map_sg(struct io_pgtable_ops *ops, unsigned long iova,
+			   struct scatterlist *sg, unsigned int nents,
+			   int iommu_prot, gfp_t gfp, size_t *mapped)
+{
+
+	size_t len = 0;
+	unsigned int i = 0;
+	int ret;
+	phys_addr_t start;
+
+	*mapped = 0;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	while (i <= nents) {
+		phys_addr_t s_phys = sg_phys(sg);
+
+		if (len && s_phys != start + len) {
+			ret = arm_lpae_map_by_pgsize(ops, iova + *mapped, start,
+						     len, iommu_prot, gfp,
+						     mapped);
+
+			if (ret)
+				return ret;
+
+			len = 0;
+		}
+
+		if (len) {
+			len += sg->length;
+		} else {
+			len = sg->length;
+			start = s_phys;
+		}
+
+		if (++i < nents)
+			sg = sg_next(sg);
+	}
+
+	return 0;
+}
+
 static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
 				    arm_lpae_iopte *ptep)
 {
@@ -750,6 +835,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 
 	data->iop.ops = (struct io_pgtable_ops) {
 		.map		= arm_lpae_map,
+		.map_sg		= arm_lpae_map_sg,
 		.unmap		= arm_lpae_unmap,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 	};
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ffeebda..0da0687 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2346,8 +2346,8 @@ phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 }
 EXPORT_SYMBOL_GPL(iommu_iova_to_phys);
 
-static size_t iommu_pgsize(struct iommu_domain *domain,
-			   unsigned long addr_merge, size_t size)
+size_t iommu_pgsize(unsigned long pgsize_bitmap, unsigned long addr_merge,
+		    size_t size)
 {
 	unsigned int pgsize_idx;
 	size_t pgsize;
@@ -2366,7 +2366,7 @@ static size_t iommu_pgsize(struct iommu_domain *domain,
 	pgsize = (1UL << (pgsize_idx + 1)) - 1;
 
 	/* throw away page sizes not supported by the hardware */
-	pgsize &= domain->pgsize_bitmap;
+	pgsize &= pgsize_bitmap;
 
 	/* make sure we're still sane */
 	BUG_ON(!pgsize);
@@ -2412,7 +2412,8 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 	pr_debug("map: iova 0x%lx pa %pa size 0x%zx\n", iova, &paddr, size);
 
 	while (size) {
-		size_t pgsize = iommu_pgsize(domain, iova | paddr, size);
+		size_t pgsize = iommu_pgsize(domain->pgsize_bitmap,
+					     iova | paddr, size);
 
 		pr_debug("mapping: iova 0x%lx pa %pa pgsize 0x%zx\n",
 			 iova, &paddr, pgsize);
@@ -2490,7 +2491,8 @@ static size_t __iommu_unmap(struct iommu_domain *domain,
 	 * or we hit an area that isn't mapped.
 	 */
 	while (unmapped < size) {
-		size_t pgsize = iommu_pgsize(domain, iova, size - unmapped);
+		size_t pgsize = iommu_pgsize(domain->pgsize_bitmap, iova,
+					     size - unmapped);
 
 		unmapped_page = ops->unmap(domain, iova, pgsize, iotlb_gather);
 		if (!unmapped_page)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b3f0e20..0e40a38 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -438,6 +438,8 @@ extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
 				   struct device *dev, ioasid_t pasid);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
+extern size_t iommu_pgsize(unsigned long pgsize_bitmap, unsigned long addr_mer,
+			   size_t size);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
 extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
@@ -690,6 +692,12 @@ static inline struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
 	return NULL;
 }
 
+static inline size_t iommu_pgsize(unsigned long pgsize_bitmap,
+				  unsigned long addr_merge, size_t size)
+{
+	return 0;
+}
+
 static inline int iommu_map(struct iommu_domain *domain, unsigned long iova,
 			    phys_addr_t paddr, size_t size, int prot)
 {
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/5] iommu/io-pgtable-arm: Hook up map_sg()
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  0 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, linux-kernel, iommu, pratikp,
	linux-arm-kernel

Implement the map_sg io-pgtable op for the ARM LPAE io-pgtable
code, so that IOMMU drivers can call it when they need to map
a scatter-gather list.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/io-pgtable-arm.c | 86 ++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/iommu.c          | 12 +++---
 include/linux/iommu.h          |  8 ++++
 3 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58..0c11529 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -473,6 +473,91 @@ static int arm_lpae_map(struct io_pgtable_ops *ops, unsigned long iova,
 	return ret;
 }
 
+static int arm_lpae_map_by_pgsize(struct io_pgtable_ops *ops,
+				  unsigned long iova, phys_addr_t paddr,
+				  size_t size, int iommu_prot, gfp_t gfp,
+				  size_t *mapped)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int ret, lvl = data->start_level;
+	arm_lpae_iopte prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	unsigned int min_pagesz = 1 << __ffs(cfg->pgsize_bitmap);
+	long iaext = (s64)(iova + size - 1) >> cfg->ias;
+	size_t pgsize;
+
+	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
+		       iova, &paddr, size, min_pagesz);
+		return -EINVAL;
+	}
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext || (paddr + size - 1) >> cfg->oas))
+		return -ERANGE;
+
+	while (size) {
+		pgsize = iommu_pgsize(cfg->pgsize_bitmap, iova | paddr, size);
+		ret = __arm_lpae_map(data, iova, paddr, pgsize, prot, lvl, ptep,
+				     gfp);
+		if (ret)
+			return ret;
+
+		iova += pgsize;
+		paddr += pgsize;
+		*mapped += pgsize;
+		size -= pgsize;
+	}
+
+	return 0;
+}
+
+static int arm_lpae_map_sg(struct io_pgtable_ops *ops, unsigned long iova,
+			   struct scatterlist *sg, unsigned int nents,
+			   int iommu_prot, gfp_t gfp, size_t *mapped)
+{
+
+	size_t len = 0;
+	unsigned int i = 0;
+	int ret;
+	phys_addr_t start;
+
+	*mapped = 0;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	while (i <= nents) {
+		phys_addr_t s_phys = sg_phys(sg);
+
+		if (len && s_phys != start + len) {
+			ret = arm_lpae_map_by_pgsize(ops, iova + *mapped, start,
+						     len, iommu_prot, gfp,
+						     mapped);
+
+			if (ret)
+				return ret;
+
+			len = 0;
+		}
+
+		if (len) {
+			len += sg->length;
+		} else {
+			len = sg->length;
+			start = s_phys;
+		}
+
+		if (++i < nents)
+			sg = sg_next(sg);
+	}
+
+	return 0;
+}
+
 static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
 				    arm_lpae_iopte *ptep)
 {
@@ -750,6 +835,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 
 	data->iop.ops = (struct io_pgtable_ops) {
 		.map		= arm_lpae_map,
+		.map_sg		= arm_lpae_map_sg,
 		.unmap		= arm_lpae_unmap,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 	};
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ffeebda..0da0687 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2346,8 +2346,8 @@ phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 }
 EXPORT_SYMBOL_GPL(iommu_iova_to_phys);
 
-static size_t iommu_pgsize(struct iommu_domain *domain,
-			   unsigned long addr_merge, size_t size)
+size_t iommu_pgsize(unsigned long pgsize_bitmap, unsigned long addr_merge,
+		    size_t size)
 {
 	unsigned int pgsize_idx;
 	size_t pgsize;
@@ -2366,7 +2366,7 @@ static size_t iommu_pgsize(struct iommu_domain *domain,
 	pgsize = (1UL << (pgsize_idx + 1)) - 1;
 
 	/* throw away page sizes not supported by the hardware */
-	pgsize &= domain->pgsize_bitmap;
+	pgsize &= pgsize_bitmap;
 
 	/* make sure we're still sane */
 	BUG_ON(!pgsize);
@@ -2412,7 +2412,8 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 	pr_debug("map: iova 0x%lx pa %pa size 0x%zx\n", iova, &paddr, size);
 
 	while (size) {
-		size_t pgsize = iommu_pgsize(domain, iova | paddr, size);
+		size_t pgsize = iommu_pgsize(domain->pgsize_bitmap,
+					     iova | paddr, size);
 
 		pr_debug("mapping: iova 0x%lx pa %pa pgsize 0x%zx\n",
 			 iova, &paddr, pgsize);
@@ -2490,7 +2491,8 @@ static size_t __iommu_unmap(struct iommu_domain *domain,
 	 * or we hit an area that isn't mapped.
 	 */
 	while (unmapped < size) {
-		size_t pgsize = iommu_pgsize(domain, iova, size - unmapped);
+		size_t pgsize = iommu_pgsize(domain->pgsize_bitmap, iova,
+					     size - unmapped);
 
 		unmapped_page = ops->unmap(domain, iova, pgsize, iotlb_gather);
 		if (!unmapped_page)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b3f0e20..0e40a38 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -438,6 +438,8 @@ extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
 				   struct device *dev, ioasid_t pasid);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
+extern size_t iommu_pgsize(unsigned long pgsize_bitmap, unsigned long addr_mer,
+			   size_t size);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
 extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
@@ -690,6 +692,12 @@ static inline struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
 	return NULL;
 }
 
+static inline size_t iommu_pgsize(unsigned long pgsize_bitmap,
+				  unsigned long addr_merge, size_t size)
+{
+	return 0;
+}
+
 static inline int iommu_map(struct iommu_domain *domain, unsigned long iova,
 			    phys_addr_t paddr, size_t size, int prot)
 {
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/5] iommu/io-pgtable-arm-v7s: Hook up map_sg()
  2021-01-11 14:54 ` Isaac J. Manjarres
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  -1 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, pratikp, linux-arm-kernel, iommu,
	linux-kernel

Implement the map_sg io-pgtable op for the ARMv7s io-pgtable
code, so that IOMMU drivers can call it when they need to map
a scatter-gather list.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 90 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 1d92ac9..8665dab 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -545,6 +545,95 @@ static int arm_v7s_map(struct io_pgtable_ops *ops, unsigned long iova,
 	return ret;
 }
 
+static int arm_v7s_map_by_pgsize(struct io_pgtable_ops *ops,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t size, int prot, gfp_t gfp,
+				 size_t *mapped)
+{
+	struct arm_v7s_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable *iop = &data->iop;
+	struct io_pgtable_cfg *cfg = &iop->cfg;
+	unsigned int min_pagesz = 1 << __ffs(cfg->pgsize_bitmap);
+	int ret;
+	size_t pgsize;
+
+	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
+		       iova, &paddr, size, min_pagesz);
+		return -EINVAL;
+	}
+
+	if (WARN_ON((iova + size - 1) >= (1ULL << cfg->ias) ||
+		    (paddr + size - 1) >= (1ULL << cfg->oas)))
+		return -ERANGE;
+
+	while (size) {
+		pgsize = iommu_pgsize(cfg->pgsize_bitmap, iova | paddr, size);
+		ret = __arm_v7s_map(data, iova, paddr, pgsize, prot, 1,
+				    data->pgd, gfp);
+
+		if (iop->cfg.quirks & IO_PGTABLE_QUIRK_TLBI_ON_MAP) {
+			io_pgtable_tlb_flush_walk(&data->iop, iova, size,
+						  ARM_V7S_BLOCK_SIZE(2));
+		} else {
+			wmb();
+		}
+
+		if (ret)
+			return ret;
+
+		iova += pgsize;
+		paddr += pgsize;
+		*mapped += pgsize;
+		size -= pgsize;
+	}
+
+	return 0;
+}
+
+static int arm_v7s_map_sg(struct io_pgtable_ops *ops, unsigned long iova,
+			  struct scatterlist *sg, unsigned int nents,
+			  int iommu_prot, gfp_t gfp, size_t *mapped)
+{
+	size_t len = 0;
+	unsigned int i = 0;
+	int ret;
+	phys_addr_t start;
+
+	*mapped = 0;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	while (i <= nents) {
+		phys_addr_t s_phys = sg_phys(sg);
+
+		if (len && s_phys != start + len) {
+			ret = arm_v7s_map_by_pgsize(ops, iova + *mapped, start,
+						    len, iommu_prot, gfp,
+						    mapped);
+
+			if (ret)
+				return ret;
+
+			len = 0;
+		}
+
+		if (len) {
+			len += sg->length;
+		} else {
+			len = sg->length;
+			start = s_phys;
+		}
+
+		if (++i < nents)
+			sg = sg_next(sg);
+	}
+
+	return 0;
+}
+
 static void arm_v7s_free_pgtable(struct io_pgtable *iop)
 {
 	struct arm_v7s_io_pgtable *data = io_pgtable_to_data(iop);
@@ -783,6 +872,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 
 	data->iop.ops = (struct io_pgtable_ops) {
 		.map		= arm_v7s_map,
+		.map_sg		= arm_v7s_map_sg,
 		.unmap		= arm_v7s_unmap,
 		.iova_to_phys	= arm_v7s_iova_to_phys,
 	};
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/5] iommu/io-pgtable-arm-v7s: Hook up map_sg()
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  0 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, linux-kernel, iommu, pratikp,
	linux-arm-kernel

Implement the map_sg io-pgtable op for the ARMv7s io-pgtable
code, so that IOMMU drivers can call it when they need to map
a scatter-gather list.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 90 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 1d92ac9..8665dab 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -545,6 +545,95 @@ static int arm_v7s_map(struct io_pgtable_ops *ops, unsigned long iova,
 	return ret;
 }
 
+static int arm_v7s_map_by_pgsize(struct io_pgtable_ops *ops,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t size, int prot, gfp_t gfp,
+				 size_t *mapped)
+{
+	struct arm_v7s_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable *iop = &data->iop;
+	struct io_pgtable_cfg *cfg = &iop->cfg;
+	unsigned int min_pagesz = 1 << __ffs(cfg->pgsize_bitmap);
+	int ret;
+	size_t pgsize;
+
+	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
+		       iova, &paddr, size, min_pagesz);
+		return -EINVAL;
+	}
+
+	if (WARN_ON((iova + size - 1) >= (1ULL << cfg->ias) ||
+		    (paddr + size - 1) >= (1ULL << cfg->oas)))
+		return -ERANGE;
+
+	while (size) {
+		pgsize = iommu_pgsize(cfg->pgsize_bitmap, iova | paddr, size);
+		ret = __arm_v7s_map(data, iova, paddr, pgsize, prot, 1,
+				    data->pgd, gfp);
+
+		if (iop->cfg.quirks & IO_PGTABLE_QUIRK_TLBI_ON_MAP) {
+			io_pgtable_tlb_flush_walk(&data->iop, iova, size,
+						  ARM_V7S_BLOCK_SIZE(2));
+		} else {
+			wmb();
+		}
+
+		if (ret)
+			return ret;
+
+		iova += pgsize;
+		paddr += pgsize;
+		*mapped += pgsize;
+		size -= pgsize;
+	}
+
+	return 0;
+}
+
+static int arm_v7s_map_sg(struct io_pgtable_ops *ops, unsigned long iova,
+			  struct scatterlist *sg, unsigned int nents,
+			  int iommu_prot, gfp_t gfp, size_t *mapped)
+{
+	size_t len = 0;
+	unsigned int i = 0;
+	int ret;
+	phys_addr_t start;
+
+	*mapped = 0;
+
+	/* If no access, then nothing to do */
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return 0;
+
+	while (i <= nents) {
+		phys_addr_t s_phys = sg_phys(sg);
+
+		if (len && s_phys != start + len) {
+			ret = arm_v7s_map_by_pgsize(ops, iova + *mapped, start,
+						    len, iommu_prot, gfp,
+						    mapped);
+
+			if (ret)
+				return ret;
+
+			len = 0;
+		}
+
+		if (len) {
+			len += sg->length;
+		} else {
+			len = sg->length;
+			start = s_phys;
+		}
+
+		if (++i < nents)
+			sg = sg_next(sg);
+	}
+
+	return 0;
+}
+
 static void arm_v7s_free_pgtable(struct io_pgtable *iop)
 {
 	struct arm_v7s_io_pgtable *data = io_pgtable_to_data(iop);
@@ -783,6 +872,7 @@ static struct io_pgtable *arm_v7s_alloc_pgtable(struct io_pgtable_cfg *cfg,
 
 	data->iop.ops = (struct io_pgtable_ops) {
 		.map		= arm_v7s_map,
+		.map_sg		= arm_v7s_map_sg,
 		.unmap		= arm_v7s_unmap,
 		.iova_to_phys	= arm_v7s_iova_to_phys,
 	};
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/5] iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
  2021-01-11 14:54 ` Isaac J. Manjarres
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  -1 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, pratikp, linux-arm-kernel, iommu,
	linux-kernel

Add support for IOMMU drivers to have their own map_sg() callbacks.
This completes the path for having iommu_map_sg() invoke an IOMMU
driver's map_sg() callback, which can then invoke the io-pgtable
map_sg() callback with the entire scatter-gather list, so that it
can be processed entirely in the io-pgtable layer.

For IOMMU drivers that do not provide a callback, the default
implementation of iterating through the scatter-gather list, while
calling iommu_map() will be used.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/iommu.c | 13 +++++++++++++
 include/linux/iommu.h |  5 +++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0da0687..46acd5c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2535,11 +2535,24 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			     struct scatterlist *sg, unsigned int nents, int prot,
 			     gfp_t gfp)
 {
+	const struct iommu_ops *ops = domain->ops;
 	size_t len = 0, mapped = 0;
 	phys_addr_t start;
 	unsigned int i = 0;
 	int ret;
 
+	if (ops->map_sg) {
+		ret = ops->map_sg(domain, iova, sg, nents, prot, gfp, &mapped);
+
+		if (ops->iotlb_sync_map)
+			ops->iotlb_sync_map(domain);
+
+		if (ret)
+			goto out_err;
+
+		return mapped;
+	}
+
 	while (i <= nents) {
 		phys_addr_t s_phys = sg_phys(sg);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0e40a38..bac7681 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -192,6 +192,8 @@ struct iommu_iotlb_gather {
  * @attach_dev: attach device to an iommu domain
  * @detach_dev: detach device from an iommu domain
  * @map: map a physically contiguous memory region to an iommu domain
+ * @map_sg: map a scatter-gather list of physically contiguous chunks to
+ *          an iommu domain.
  * @unmap: unmap a physically contiguous memory region from an iommu domain
  * @flush_iotlb_all: Synchronously flush all hardware TLBs for this domain
  * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
@@ -243,6 +245,9 @@ struct iommu_ops {
 	void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
 	int (*map)(struct iommu_domain *domain, unsigned long iova,
 		   phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+	int (*map_sg)(struct iommu_domain *domain, unsigned long iova,
+		      struct scatterlist *sg, unsigned int nents, int prot,
+		      gfp_t gfp, size_t *mapped);
 	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
 		     size_t size, struct iommu_iotlb_gather *iotlb_gather);
 	void (*flush_iotlb_all)(struct iommu_domain *domain);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/5] iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  0 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, linux-kernel, iommu, pratikp,
	linux-arm-kernel

Add support for IOMMU drivers to have their own map_sg() callbacks.
This completes the path for having iommu_map_sg() invoke an IOMMU
driver's map_sg() callback, which can then invoke the io-pgtable
map_sg() callback with the entire scatter-gather list, so that it
can be processed entirely in the io-pgtable layer.

For IOMMU drivers that do not provide a callback, the default
implementation of iterating through the scatter-gather list, while
calling iommu_map() will be used.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/iommu.c | 13 +++++++++++++
 include/linux/iommu.h |  5 +++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0da0687..46acd5c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2535,11 +2535,24 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			     struct scatterlist *sg, unsigned int nents, int prot,
 			     gfp_t gfp)
 {
+	const struct iommu_ops *ops = domain->ops;
 	size_t len = 0, mapped = 0;
 	phys_addr_t start;
 	unsigned int i = 0;
 	int ret;
 
+	if (ops->map_sg) {
+		ret = ops->map_sg(domain, iova, sg, nents, prot, gfp, &mapped);
+
+		if (ops->iotlb_sync_map)
+			ops->iotlb_sync_map(domain);
+
+		if (ret)
+			goto out_err;
+
+		return mapped;
+	}
+
 	while (i <= nents) {
 		phys_addr_t s_phys = sg_phys(sg);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0e40a38..bac7681 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -192,6 +192,8 @@ struct iommu_iotlb_gather {
  * @attach_dev: attach device to an iommu domain
  * @detach_dev: detach device from an iommu domain
  * @map: map a physically contiguous memory region to an iommu domain
+ * @map_sg: map a scatter-gather list of physically contiguous chunks to
+ *          an iommu domain.
  * @unmap: unmap a physically contiguous memory region from an iommu domain
  * @flush_iotlb_all: Synchronously flush all hardware TLBs for this domain
  * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
@@ -243,6 +245,9 @@ struct iommu_ops {
 	void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
 	int (*map)(struct iommu_domain *domain, unsigned long iova,
 		   phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+	int (*map_sg)(struct iommu_domain *domain, unsigned long iova,
+		      struct scatterlist *sg, unsigned int nents, int prot,
+		      gfp_t gfp, size_t *mapped);
 	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
 		     size_t size, struct iommu_iotlb_gather *iotlb_gather);
 	void (*flush_iotlb_all)(struct iommu_domain *domain);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/5] iommu/arm-smmu: Hook up map_sg()
  2021-01-11 14:54 ` Isaac J. Manjarres
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  -1 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, pratikp, linux-arm-kernel, iommu,
	linux-kernel

Now that everything is in place for iommu_map_sg() to defer
mapping a scatter-gather list to the io-pgtable layer, implement
the map_sg() callback in the SMMU driver, so that iommu_map_sg()
can invoke it with the entire scatter-gather list that will be
mapped.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfd..52acc68 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1208,6 +1208,24 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 	return ret;
 }
 
+static int arm_smmu_map_sg(struct iommu_domain *domain, unsigned long iova,
+			   struct scatterlist *sg, unsigned int nents, int prot,
+			   gfp_t gfp, size_t *mapped)
+{
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+	int ret;
+
+	if (!ops)
+		return -ENODEV;
+
+	arm_smmu_rpm_get(smmu);
+	ret = ops->map_sg(ops, iova, sg, nents, prot, gfp, mapped);
+	arm_smmu_rpm_put(smmu);
+
+	return ret;
+}
+
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size, struct iommu_iotlb_gather *gather)
 {
@@ -1624,6 +1642,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.domain_free		= arm_smmu_domain_free,
 	.attach_dev		= arm_smmu_attach_dev,
 	.map			= arm_smmu_map,
+	.map_sg			= arm_smmu_map_sg,
 	.unmap			= arm_smmu_unmap,
 	.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
 	.iotlb_sync		= arm_smmu_iotlb_sync,
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/5] iommu/arm-smmu: Hook up map_sg()
@ 2021-01-11 14:54   ` Isaac J. Manjarres
  0 siblings, 0 replies; 26+ messages in thread
From: Isaac J. Manjarres @ 2021-01-11 14:54 UTC (permalink / raw)
  To: will, robin.murphy, joro
  Cc: Isaac J. Manjarres, pdaly, linux-kernel, iommu, pratikp,
	linux-arm-kernel

Now that everything is in place for iommu_map_sg() to defer
mapping a scatter-gather list to the io-pgtable layer, implement
the map_sg() callback in the SMMU driver, so that iommu_map_sg()
can invoke it with the entire scatter-gather list that will be
mapped.

Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfd..52acc68 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1208,6 +1208,24 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 	return ret;
 }
 
+static int arm_smmu_map_sg(struct iommu_domain *domain, unsigned long iova,
+			   struct scatterlist *sg, unsigned int nents, int prot,
+			   gfp_t gfp, size_t *mapped)
+{
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+	int ret;
+
+	if (!ops)
+		return -ENODEV;
+
+	arm_smmu_rpm_get(smmu);
+	ret = ops->map_sg(ops, iova, sg, nents, prot, gfp, mapped);
+	arm_smmu_rpm_put(smmu);
+
+	return ret;
+}
+
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size, struct iommu_iotlb_gather *gather)
 {
@@ -1624,6 +1642,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.domain_free		= arm_smmu_domain_free,
 	.attach_dev		= arm_smmu_attach_dev,
 	.map			= arm_smmu_map,
+	.map_sg			= arm_smmu_map_sg,
 	.unmap			= arm_smmu_unmap,
 	.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
 	.iotlb_sync		= arm_smmu_iotlb_sync,
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
  2021-01-11 14:54 ` Isaac J. Manjarres
  (?)
@ 2021-01-12 16:00   ` Robin Murphy
  -1 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-12 16:00 UTC (permalink / raw)
  To: Isaac J. Manjarres, will, joro
  Cc: pdaly, pratikp, linux-arm-kernel, iommu, linux-kernel

On 2021-01-11 14:54, Isaac J. Manjarres wrote:
> The iommu_map_sg() code currently iterates through the given
> scatter-gather list, and in the worst case, invokes iommu_map()
> for each element in the scatter-gather list, which calls into
> the IOMMU driver through an indirect call. For an IOMMU driver
> that uses a format supported by the io-pgtable code, the IOMMU
> driver will then call into the io-pgtable code to map the chunk.
> 
> Jumping between the IOMMU core code, the IOMMU driver, and the
> io-pgtable code and back for each element in a scatter-gather list
> is not efficient.
> 
> Instead, add a map_sg() hook in both the IOMMU driver ops and the
> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
> map_sg() hook with the entire scatter-gather list, which can call
> into the io-pgtable map_sg() hook, which can process the entire
> scatter-gather list, signficantly reducing the number of indirect
> calls, and jumps between these layers, boosting performance.

Out of curiosity, how much of the difference is attributable to actual 
indirect call overhead vs. the additional massive reduction in visits to 
arm_smmu_rpm_{get,put} that you fail to mention? There are ways to 
optimise indirect calling that would benefit *all* cases, rather than 
just one operation for one particular driver.

> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
> the current implementation of iommu_map_sg() yields the following
> latencies for mapping scatter-gather lists of various sizes. These
> latencies are calculated by repeating the mapping operation 10 times:
> 
>      size        iommu_map_sg latency
>        4K            0.624 us
>       64K            9.468 us
>        1M          122.557 us
>        2M          239.807 us
>       12M         1435.979 us
>       24M         2884.968 us
>       32M         3832.979 us
> 
> On the same system, the proposed modifications yield the following
> results:
> 
>      size        iommu_map_sg latency
>        4K            3.645 us
>       64K            4.198 us
>        1M           11.010 us
>        2M           17.125 us
>       12M           82.416 us
>       24M          158.677 us
>       32M          210.468 us
> 
> The procedure for collecting the iommu_map_sg latencies is
> the same in both experiments. Clearly, reducing the jumps
> between the different layers in the IOMMU code offers a
> signficant performance boost in iommu_map_sg() latency.

Presumably those are deliberately worst-case numbers? After all, a 32MB 
scatterlist *could* incur less overhead than a 64KB one if things line 
up just right (still 16 ->map calls, but each with one fewer level of 
pagetable to traverse). TBH I find the significant regression of the 4KB 
case the most interesting - what's going on there?

My main reservation here is that we get an explosion of duplicate copies 
of almost the same code, and it's code that's just non-trivial enough to 
start being bug-prone. And it's all still only for one specific 
operation - your argument about calling through multiple layers for each 
element applies just as much to iommu_map() itself, so why aren't we 
trying to make more fundamental improvements with wider benefits? Indeed 
I can't imagine the existing iommu_map_sg() loop really adds significant 
overhead compared to a single iommu_map() call that results in the 
equivalent set of ->map calls to the driver.

At a glance, I reckon that simply extending the internal ->map and 
->unmap interfaces to encode a number of consecutive identical pages 
would already get us a large chunk of the way there; then we'd be in a 
better place to consider options for the io-pgtable interface.

Robin.

> Changes since v1:
> 
> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
> when checking if the IOVA and physical address ranges being
> mapped are within the appropriate limits.
> -Added Sai Prakash Ranjan's "Tested-by" tag.
> 
> Thanks,
> Isaac
> 
> Isaac J. Manjarres (5):
>    iommu/io-pgtable: Introduce map_sg() as a page table op
>    iommu/io-pgtable-arm: Hook up map_sg()
>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>    iommu/arm-smmu: Hook up map_sg()
> 
>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 +++++++++++++++++++++++++++++++++++
>   drivers/iommu/io-pgtable-arm.c        | 86 +++++++++++++++++++++++++++++++++
>   drivers/iommu/iommu.c                 | 25 ++++++++--
>   include/linux/io-pgtable.h            |  6 +++
>   include/linux/iommu.h                 | 13 +++++
>   6 files changed, 234 insertions(+), 5 deletions(-)
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-12 16:00   ` Robin Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-12 16:00 UTC (permalink / raw)
  To: Isaac J. Manjarres, will, joro
  Cc: iommu, pratikp, linux-kernel, linux-arm-kernel, pdaly

On 2021-01-11 14:54, Isaac J. Manjarres wrote:
> The iommu_map_sg() code currently iterates through the given
> scatter-gather list, and in the worst case, invokes iommu_map()
> for each element in the scatter-gather list, which calls into
> the IOMMU driver through an indirect call. For an IOMMU driver
> that uses a format supported by the io-pgtable code, the IOMMU
> driver will then call into the io-pgtable code to map the chunk.
> 
> Jumping between the IOMMU core code, the IOMMU driver, and the
> io-pgtable code and back for each element in a scatter-gather list
> is not efficient.
> 
> Instead, add a map_sg() hook in both the IOMMU driver ops and the
> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
> map_sg() hook with the entire scatter-gather list, which can call
> into the io-pgtable map_sg() hook, which can process the entire
> scatter-gather list, signficantly reducing the number of indirect
> calls, and jumps between these layers, boosting performance.

Out of curiosity, how much of the difference is attributable to actual 
indirect call overhead vs. the additional massive reduction in visits to 
arm_smmu_rpm_{get,put} that you fail to mention? There are ways to 
optimise indirect calling that would benefit *all* cases, rather than 
just one operation for one particular driver.

> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
> the current implementation of iommu_map_sg() yields the following
> latencies for mapping scatter-gather lists of various sizes. These
> latencies are calculated by repeating the mapping operation 10 times:
> 
>      size        iommu_map_sg latency
>        4K            0.624 us
>       64K            9.468 us
>        1M          122.557 us
>        2M          239.807 us
>       12M         1435.979 us
>       24M         2884.968 us
>       32M         3832.979 us
> 
> On the same system, the proposed modifications yield the following
> results:
> 
>      size        iommu_map_sg latency
>        4K            3.645 us
>       64K            4.198 us
>        1M           11.010 us
>        2M           17.125 us
>       12M           82.416 us
>       24M          158.677 us
>       32M          210.468 us
> 
> The procedure for collecting the iommu_map_sg latencies is
> the same in both experiments. Clearly, reducing the jumps
> between the different layers in the IOMMU code offers a
> signficant performance boost in iommu_map_sg() latency.

Presumably those are deliberately worst-case numbers? After all, a 32MB 
scatterlist *could* incur less overhead than a 64KB one if things line 
up just right (still 16 ->map calls, but each with one fewer level of 
pagetable to traverse). TBH I find the significant regression of the 4KB 
case the most interesting - what's going on there?

My main reservation here is that we get an explosion of duplicate copies 
of almost the same code, and it's code that's just non-trivial enough to 
start being bug-prone. And it's all still only for one specific 
operation - your argument about calling through multiple layers for each 
element applies just as much to iommu_map() itself, so why aren't we 
trying to make more fundamental improvements with wider benefits? Indeed 
I can't imagine the existing iommu_map_sg() loop really adds significant 
overhead compared to a single iommu_map() call that results in the 
equivalent set of ->map calls to the driver.

At a glance, I reckon that simply extending the internal ->map and 
->unmap interfaces to encode a number of consecutive identical pages 
would already get us a large chunk of the way there; then we'd be in a 
better place to consider options for the io-pgtable interface.

Robin.

> Changes since v1:
> 
> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
> when checking if the IOVA and physical address ranges being
> mapped are within the appropriate limits.
> -Added Sai Prakash Ranjan's "Tested-by" tag.
> 
> Thanks,
> Isaac
> 
> Isaac J. Manjarres (5):
>    iommu/io-pgtable: Introduce map_sg() as a page table op
>    iommu/io-pgtable-arm: Hook up map_sg()
>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>    iommu/arm-smmu: Hook up map_sg()
> 
>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 +++++++++++++++++++++++++++++++++++
>   drivers/iommu/io-pgtable-arm.c        | 86 +++++++++++++++++++++++++++++++++
>   drivers/iommu/iommu.c                 | 25 ++++++++--
>   include/linux/io-pgtable.h            |  6 +++
>   include/linux/iommu.h                 | 13 +++++
>   6 files changed, 234 insertions(+), 5 deletions(-)
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-12 16:00   ` Robin Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-12 16:00 UTC (permalink / raw)
  To: Isaac J. Manjarres, will, joro
  Cc: iommu, pratikp, linux-kernel, linux-arm-kernel, pdaly

On 2021-01-11 14:54, Isaac J. Manjarres wrote:
> The iommu_map_sg() code currently iterates through the given
> scatter-gather list, and in the worst case, invokes iommu_map()
> for each element in the scatter-gather list, which calls into
> the IOMMU driver through an indirect call. For an IOMMU driver
> that uses a format supported by the io-pgtable code, the IOMMU
> driver will then call into the io-pgtable code to map the chunk.
> 
> Jumping between the IOMMU core code, the IOMMU driver, and the
> io-pgtable code and back for each element in a scatter-gather list
> is not efficient.
> 
> Instead, add a map_sg() hook in both the IOMMU driver ops and the
> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
> map_sg() hook with the entire scatter-gather list, which can call
> into the io-pgtable map_sg() hook, which can process the entire
> scatter-gather list, signficantly reducing the number of indirect
> calls, and jumps between these layers, boosting performance.

Out of curiosity, how much of the difference is attributable to actual 
indirect call overhead vs. the additional massive reduction in visits to 
arm_smmu_rpm_{get,put} that you fail to mention? There are ways to 
optimise indirect calling that would benefit *all* cases, rather than 
just one operation for one particular driver.

> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
> the current implementation of iommu_map_sg() yields the following
> latencies for mapping scatter-gather lists of various sizes. These
> latencies are calculated by repeating the mapping operation 10 times:
> 
>      size        iommu_map_sg latency
>        4K            0.624 us
>       64K            9.468 us
>        1M          122.557 us
>        2M          239.807 us
>       12M         1435.979 us
>       24M         2884.968 us
>       32M         3832.979 us
> 
> On the same system, the proposed modifications yield the following
> results:
> 
>      size        iommu_map_sg latency
>        4K            3.645 us
>       64K            4.198 us
>        1M           11.010 us
>        2M           17.125 us
>       12M           82.416 us
>       24M          158.677 us
>       32M          210.468 us
> 
> The procedure for collecting the iommu_map_sg latencies is
> the same in both experiments. Clearly, reducing the jumps
> between the different layers in the IOMMU code offers a
> signficant performance boost in iommu_map_sg() latency.

Presumably those are deliberately worst-case numbers? After all, a 32MB 
scatterlist *could* incur less overhead than a 64KB one if things line 
up just right (still 16 ->map calls, but each with one fewer level of 
pagetable to traverse). TBH I find the significant regression of the 4KB 
case the most interesting - what's going on there?

My main reservation here is that we get an explosion of duplicate copies 
of almost the same code, and it's code that's just non-trivial enough to 
start being bug-prone. And it's all still only for one specific 
operation - your argument about calling through multiple layers for each 
element applies just as much to iommu_map() itself, so why aren't we 
trying to make more fundamental improvements with wider benefits? Indeed 
I can't imagine the existing iommu_map_sg() loop really adds significant 
overhead compared to a single iommu_map() call that results in the 
equivalent set of ->map calls to the driver.

At a glance, I reckon that simply extending the internal ->map and 
->unmap interfaces to encode a number of consecutive identical pages 
would already get us a large chunk of the way there; then we'd be in a 
better place to consider options for the io-pgtable interface.

Robin.

> Changes since v1:
> 
> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
> when checking if the IOVA and physical address ranges being
> mapped are within the appropriate limits.
> -Added Sai Prakash Ranjan's "Tested-by" tag.
> 
> Thanks,
> Isaac
> 
> Isaac J. Manjarres (5):
>    iommu/io-pgtable: Introduce map_sg() as a page table op
>    iommu/io-pgtable-arm: Hook up map_sg()
>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>    iommu/arm-smmu: Hook up map_sg()
> 
>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 +++++++++++++++++++++++++++++++++++
>   drivers/iommu/io-pgtable-arm.c        | 86 +++++++++++++++++++++++++++++++++
>   drivers/iommu/iommu.c                 | 25 ++++++++--
>   include/linux/io-pgtable.h            |  6 +++
>   include/linux/iommu.h                 | 13 +++++
>   6 files changed, 234 insertions(+), 5 deletions(-)
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
  2021-01-12 16:00   ` Robin Murphy
  (?)
@ 2021-01-12 16:33     ` Christoph Hellwig
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-01-12 16:33 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Isaac J. Manjarres, will, joro, pdaly, pratikp, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 12, 2021 at 04:00:59PM +0000, Robin Murphy wrote:
> Out of curiosity, how much of the difference is attributable to actual
> indirect call overhead vs. the additional massive reduction in visits to
> arm_smmu_rpm_{get,put} that you fail to mention? There are ways to optimise
> indirect calling that would benefit *all* cases, rather than just one
> operation for one particular driver.

Do we have systems that use different iommu_ops at the same time?
If not this would be a prime candidate for static call optimizations.

Also I've been pondering adding direct calls to the iommu dma ops like
we do for DMA direct.  This would allow to stop using dma_ops entirely
for arm64.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-12 16:33     ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-01-12 16:33 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Isaac J. Manjarres, pdaly, will, linux-kernel, iommu, pratikp,
	linux-arm-kernel

On Tue, Jan 12, 2021 at 04:00:59PM +0000, Robin Murphy wrote:
> Out of curiosity, how much of the difference is attributable to actual
> indirect call overhead vs. the additional massive reduction in visits to
> arm_smmu_rpm_{get,put} that you fail to mention? There are ways to optimise
> indirect calling that would benefit *all* cases, rather than just one
> operation for one particular driver.

Do we have systems that use different iommu_ops at the same time?
If not this would be a prime candidate for static call optimizations.

Also I've been pondering adding direct calls to the iommu dma ops like
we do for DMA direct.  This would allow to stop using dma_ops entirely
for arm64.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-12 16:33     ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-01-12 16:33 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Isaac J. Manjarres, pdaly, will, joro, linux-kernel, iommu,
	pratikp, linux-arm-kernel

On Tue, Jan 12, 2021 at 04:00:59PM +0000, Robin Murphy wrote:
> Out of curiosity, how much of the difference is attributable to actual
> indirect call overhead vs. the additional massive reduction in visits to
> arm_smmu_rpm_{get,put} that you fail to mention? There are ways to optimise
> indirect calling that would benefit *all* cases, rather than just one
> operation for one particular driver.

Do we have systems that use different iommu_ops at the same time?
If not this would be a prime candidate for static call optimizations.

Also I've been pondering adding direct calls to the iommu dma ops like
we do for DMA direct.  This would allow to stop using dma_ops entirely
for arm64.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
  2021-01-12 16:33     ` Christoph Hellwig
  (?)
@ 2021-01-13  2:54       ` Robin Murphy
  -1 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-13  2:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Isaac J. Manjarres, will, joro, pdaly, pratikp, linux-arm-kernel,
	iommu, linux-kernel

On 2021-01-12 16:33, Christoph Hellwig wrote:
> On Tue, Jan 12, 2021 at 04:00:59PM +0000, Robin Murphy wrote:
>> Out of curiosity, how much of the difference is attributable to actual
>> indirect call overhead vs. the additional massive reduction in visits to
>> arm_smmu_rpm_{get,put} that you fail to mention? There are ways to optimise
>> indirect calling that would benefit *all* cases, rather than just one
>> operation for one particular driver.
> 
> Do we have systems that use different iommu_ops at the same time?
> If not this would be a prime candidate for static call optimizations.

They're not at all common, but such systems do technically exist. It's 
hard to make them work in the current "one set of ops per bus" model, 
but I still have a long-term dream of sorting that out so such setups 
*can* be supported properly. I certainly wouldn't want to make any 
changes that completely close the door on that idea, but any static call 
optimisation that can be done in a config-gated manner should be viable 
for x86 at least. Even better if we could do it with a dynamic 
branch-patching solution that keeps the indirect case as a fallback; 
AFAICS that should be feasible to eagerly apply somewhere around 
iommu_device_register(), then just undo again if another driver ever 
does show up registering a new set of ops that don't match. I'm pretty 
confident that the systems where performance matters most are going to 
be sensible homogeneous ones - on the Arm side the SBSA should see to 
that. The weird mix-and-match cases are typically going to be FPGA 
prototyping systems and esoteric embedded stuff that are worlds away 
from worrying about keeping up with line rate on a 40GbE NIC...

> Also I've been pondering adding direct calls to the iommu dma ops like
> we do for DMA direct.  This would allow to stop using dma_ops entirely
> for arm64.

Yes, now that we're starting to get things sufficiently consolidated 
behind iommu-dma that might be a reasonable thing to try, although given 
the amount of inherent work further down in the IOVA and IOMMU layers 
that dwarfs that of the direct case, I doubt that reducing the initial 
dispatch overhead would make any noticeable difference in practice.

Robin.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-13  2:54       ` Robin Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-13  2:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Isaac J. Manjarres, pdaly, will, linux-kernel, iommu, pratikp,
	linux-arm-kernel

On 2021-01-12 16:33, Christoph Hellwig wrote:
> On Tue, Jan 12, 2021 at 04:00:59PM +0000, Robin Murphy wrote:
>> Out of curiosity, how much of the difference is attributable to actual
>> indirect call overhead vs. the additional massive reduction in visits to
>> arm_smmu_rpm_{get,put} that you fail to mention? There are ways to optimise
>> indirect calling that would benefit *all* cases, rather than just one
>> operation for one particular driver.
> 
> Do we have systems that use different iommu_ops at the same time?
> If not this would be a prime candidate for static call optimizations.

They're not at all common, but such systems do technically exist. It's 
hard to make them work in the current "one set of ops per bus" model, 
but I still have a long-term dream of sorting that out so such setups 
*can* be supported properly. I certainly wouldn't want to make any 
changes that completely close the door on that idea, but any static call 
optimisation that can be done in a config-gated manner should be viable 
for x86 at least. Even better if we could do it with a dynamic 
branch-patching solution that keeps the indirect case as a fallback; 
AFAICS that should be feasible to eagerly apply somewhere around 
iommu_device_register(), then just undo again if another driver ever 
does show up registering a new set of ops that don't match. I'm pretty 
confident that the systems where performance matters most are going to 
be sensible homogeneous ones - on the Arm side the SBSA should see to 
that. The weird mix-and-match cases are typically going to be FPGA 
prototyping systems and esoteric embedded stuff that are worlds away 
from worrying about keeping up with line rate on a 40GbE NIC...

> Also I've been pondering adding direct calls to the iommu dma ops like
> we do for DMA direct.  This would allow to stop using dma_ops entirely
> for arm64.

Yes, now that we're starting to get things sufficiently consolidated 
behind iommu-dma that might be a reasonable thing to try, although given 
the amount of inherent work further down in the IOVA and IOMMU layers 
that dwarfs that of the direct case, I doubt that reducing the initial 
dispatch overhead would make any noticeable difference in practice.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-13  2:54       ` Robin Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-13  2:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Isaac J. Manjarres, pdaly, will, joro, linux-kernel, iommu,
	pratikp, linux-arm-kernel

On 2021-01-12 16:33, Christoph Hellwig wrote:
> On Tue, Jan 12, 2021 at 04:00:59PM +0000, Robin Murphy wrote:
>> Out of curiosity, how much of the difference is attributable to actual
>> indirect call overhead vs. the additional massive reduction in visits to
>> arm_smmu_rpm_{get,put} that you fail to mention? There are ways to optimise
>> indirect calling that would benefit *all* cases, rather than just one
>> operation for one particular driver.
> 
> Do we have systems that use different iommu_ops at the same time?
> If not this would be a prime candidate for static call optimizations.

They're not at all common, but such systems do technically exist. It's 
hard to make them work in the current "one set of ops per bus" model, 
but I still have a long-term dream of sorting that out so such setups 
*can* be supported properly. I certainly wouldn't want to make any 
changes that completely close the door on that idea, but any static call 
optimisation that can be done in a config-gated manner should be viable 
for x86 at least. Even better if we could do it with a dynamic 
branch-patching solution that keeps the indirect case as a fallback; 
AFAICS that should be feasible to eagerly apply somewhere around 
iommu_device_register(), then just undo again if another driver ever 
does show up registering a new set of ops that don't match. I'm pretty 
confident that the systems where performance matters most are going to 
be sensible homogeneous ones - on the Arm side the SBSA should see to 
that. The weird mix-and-match cases are typically going to be FPGA 
prototyping systems and esoteric embedded stuff that are worlds away 
from worrying about keeping up with line rate on a 40GbE NIC...

> Also I've been pondering adding direct calls to the iommu dma ops like
> we do for DMA direct.  This would allow to stop using dma_ops entirely
> for arm64.

Yes, now that we're starting to get things sufficiently consolidated 
behind iommu-dma that might be a reasonable thing to try, although given 
the amount of inherent work further down in the IOVA and IOMMU layers 
that dwarfs that of the direct case, I doubt that reducing the initial 
dispatch overhead would make any noticeable difference in practice.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
  2021-01-12 16:00   ` Robin Murphy
@ 2021-01-21 21:30     ` isaacm
  -1 siblings, 0 replies; 26+ messages in thread
From: isaacm @ 2021-01-21 21:30 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, joro, iommu, pratikp, linux-kernel, linux-arm-kernel, pdaly

On 2021-01-12 08:00, Robin Murphy wrote:
> On 2021-01-11 14:54, Isaac J. Manjarres wrote:
>> The iommu_map_sg() code currently iterates through the given
>> scatter-gather list, and in the worst case, invokes iommu_map()
>> for each element in the scatter-gather list, which calls into
>> the IOMMU driver through an indirect call. For an IOMMU driver
>> that uses a format supported by the io-pgtable code, the IOMMU
>> driver will then call into the io-pgtable code to map the chunk.
>> 
>> Jumping between the IOMMU core code, the IOMMU driver, and the
>> io-pgtable code and back for each element in a scatter-gather list
>> is not efficient.
>> 
>> Instead, add a map_sg() hook in both the IOMMU driver ops and the
>> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
>> map_sg() hook with the entire scatter-gather list, which can call
>> into the io-pgtable map_sg() hook, which can process the entire
>> scatter-gather list, signficantly reducing the number of indirect
>> calls, and jumps between these layers, boosting performance.
> 
> Out of curiosity, how much of the difference is attributable to actual
> indirect call overhead vs. the additional massive reduction in visits
> to arm_smmu_rpm_{get,put} that you fail to mention?There are ways to
I did an experiment where I compared the two approaches without any 
calls
to arm_smmu_rpm_[get/put]. There's still a large amount of difference
without the overhead incurred by power management calls. Here are the 
results:

no optimizations and no power management calls:
  size        iommu_map_sg
       4K            0.609 us
      64K            8.583 us
       1M          136.083 us
       2M          273.145 us
      12M         1442.119 us
      24M         2876.078 us
      32M         3832.041 us

iommu_map_sg optimizations and no power management calls:
size        iommu_map_sg
       4K            0.645 us
      64K            1.229 us
       1M            9.531 us
       2M           23.198 us
      12M           99.250 us
      24M          185.713 us
      32M          248.781 us

 From here, we can see that the amount of latency incurred by the 
indirect
calls is fairly large.

> optimise indirect calling that would benefit *all* cases, rather than
> just one operation for one particular driver.
Do you mind sharing some more information on how to optimize the 
existing
approach further, such that it benefits other drivers as well?
> 
>> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
>> the current implementation of iommu_map_sg() yields the following
>> latencies for mapping scatter-gather lists of various sizes. These
>> latencies are calculated by repeating the mapping operation 10 times:
>> 
>>      size        iommu_map_sg latency
>>        4K            0.624 us
>>       64K            9.468 us
>>        1M          122.557 us
>>        2M          239.807 us
>>       12M         1435.979 us
>>       24M         2884.968 us
>>       32M         3832.979 us
>> 
>> On the same system, the proposed modifications yield the following
>> results:
>> 
>>      size        iommu_map_sg latency
>>        4K            3.645 us
>>       64K            4.198 us
>>        1M           11.010 us
>>        2M           17.125 us
>>       12M           82.416 us
>>       24M          158.677 us
>>       32M          210.468 us
>> 
>> The procedure for collecting the iommu_map_sg latencies is
>> the same in both experiments. Clearly, reducing the jumps
>> between the different layers in the IOMMU code offers a
>> signficant performance boost in iommu_map_sg() latency.
> 
> Presumably those are deliberately worst-case numbers? After all, a
> 32MB scatterlist *could* incur less overhead than a 64KB one if things
> line up just right (still 16 ->map calls, but each with one fewer
Yes, these are worst case numbers (i.e. a buffer is composed entirely
of 4 KB pages, so higher order mappings don't get used).
> level of pagetable to traverse). TBH I find the significant regression
> of the 4KB case the most interesting - what's going on there?
That was an error on my part. After fixing my error, I observed that the
time spent mapping the 4 KB buffer is comparable with and without 
optimizations,
which is expected.
> 
> My main reservation here is that we get an explosion of duplicate
> copies of almost the same code, and it's code that's just non-trivial
> enough to start being bug-prone. And it's all still only for one
> specific operation - your argument about calling through multiple
> layers for each element applies just as much to iommu_map() itself, so
> why aren't we trying to make more fundamental improvements with wider
> benefits? Indeed I can't imagine the existing iommu_map_sg() loop
> really adds significant overhead compared to a single iommu_map() call
> that results in the equivalent set of ->map calls to the driver.
> 
> At a glance, I reckon that simply extending the internal ->map and
> ->unmap interfaces to encode a number of consecutive identical pages
> would already get us a large chunk of the way there; then we'd be in a
> better place to consider options for the io-pgtable interface.
> 
Do you mean physically contiguous pages? If so, that still wouldn't help 
the
case where a buffer is composed entirely of 4 KB pages, correct?
> Robin.
> 
>> Changes since v1:
>> 
>> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
>> when checking if the IOVA and physical address ranges being
>> mapped are within the appropriate limits.
>> -Added Sai Prakash Ranjan's "Tested-by" tag.
>> 
>> Thanks,
>> Isaac
>> 
>> Isaac J. Manjarres (5):
>>    iommu/io-pgtable: Introduce map_sg() as a page table op
>>    iommu/io-pgtable-arm: Hook up map_sg()
>>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>>    iommu/arm-smmu: Hook up map_sg()
>> 
>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 
>> +++++++++++++++++++++++++++++++++++
>>   drivers/iommu/io-pgtable-arm.c        | 86 
>> +++++++++++++++++++++++++++++++++
>>   drivers/iommu/iommu.c                 | 25 ++++++++--
>>   include/linux/io-pgtable.h            |  6 +++
>>   include/linux/iommu.h                 | 13 +++++
>>   6 files changed, 234 insertions(+), 5 deletions(-)
>> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-21 21:30     ` isaacm
  0 siblings, 0 replies; 26+ messages in thread
From: isaacm @ 2021-01-21 21:30 UTC (permalink / raw)
  To: Robin Murphy; +Cc: pdaly, will, linux-kernel, iommu, pratikp, linux-arm-kernel

On 2021-01-12 08:00, Robin Murphy wrote:
> On 2021-01-11 14:54, Isaac J. Manjarres wrote:
>> The iommu_map_sg() code currently iterates through the given
>> scatter-gather list, and in the worst case, invokes iommu_map()
>> for each element in the scatter-gather list, which calls into
>> the IOMMU driver through an indirect call. For an IOMMU driver
>> that uses a format supported by the io-pgtable code, the IOMMU
>> driver will then call into the io-pgtable code to map the chunk.
>> 
>> Jumping between the IOMMU core code, the IOMMU driver, and the
>> io-pgtable code and back for each element in a scatter-gather list
>> is not efficient.
>> 
>> Instead, add a map_sg() hook in both the IOMMU driver ops and the
>> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
>> map_sg() hook with the entire scatter-gather list, which can call
>> into the io-pgtable map_sg() hook, which can process the entire
>> scatter-gather list, signficantly reducing the number of indirect
>> calls, and jumps between these layers, boosting performance.
> 
> Out of curiosity, how much of the difference is attributable to actual
> indirect call overhead vs. the additional massive reduction in visits
> to arm_smmu_rpm_{get,put} that you fail to mention?There are ways to
I did an experiment where I compared the two approaches without any 
calls
to arm_smmu_rpm_[get/put]. There's still a large amount of difference
without the overhead incurred by power management calls. Here are the 
results:

no optimizations and no power management calls:
  size        iommu_map_sg
       4K            0.609 us
      64K            8.583 us
       1M          136.083 us
       2M          273.145 us
      12M         1442.119 us
      24M         2876.078 us
      32M         3832.041 us

iommu_map_sg optimizations and no power management calls:
size        iommu_map_sg
       4K            0.645 us
      64K            1.229 us
       1M            9.531 us
       2M           23.198 us
      12M           99.250 us
      24M          185.713 us
      32M          248.781 us

 From here, we can see that the amount of latency incurred by the 
indirect
calls is fairly large.

> optimise indirect calling that would benefit *all* cases, rather than
> just one operation for one particular driver.
Do you mind sharing some more information on how to optimize the 
existing
approach further, such that it benefits other drivers as well?
> 
>> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
>> the current implementation of iommu_map_sg() yields the following
>> latencies for mapping scatter-gather lists of various sizes. These
>> latencies are calculated by repeating the mapping operation 10 times:
>> 
>>      size        iommu_map_sg latency
>>        4K            0.624 us
>>       64K            9.468 us
>>        1M          122.557 us
>>        2M          239.807 us
>>       12M         1435.979 us
>>       24M         2884.968 us
>>       32M         3832.979 us
>> 
>> On the same system, the proposed modifications yield the following
>> results:
>> 
>>      size        iommu_map_sg latency
>>        4K            3.645 us
>>       64K            4.198 us
>>        1M           11.010 us
>>        2M           17.125 us
>>       12M           82.416 us
>>       24M          158.677 us
>>       32M          210.468 us
>> 
>> The procedure for collecting the iommu_map_sg latencies is
>> the same in both experiments. Clearly, reducing the jumps
>> between the different layers in the IOMMU code offers a
>> signficant performance boost in iommu_map_sg() latency.
> 
> Presumably those are deliberately worst-case numbers? After all, a
> 32MB scatterlist *could* incur less overhead than a 64KB one if things
> line up just right (still 16 ->map calls, but each with one fewer
Yes, these are worst case numbers (i.e. a buffer is composed entirely
of 4 KB pages, so higher order mappings don't get used).
> level of pagetable to traverse). TBH I find the significant regression
> of the 4KB case the most interesting - what's going on there?
That was an error on my part. After fixing my error, I observed that the
time spent mapping the 4 KB buffer is comparable with and without 
optimizations,
which is expected.
> 
> My main reservation here is that we get an explosion of duplicate
> copies of almost the same code, and it's code that's just non-trivial
> enough to start being bug-prone. And it's all still only for one
> specific operation - your argument about calling through multiple
> layers for each element applies just as much to iommu_map() itself, so
> why aren't we trying to make more fundamental improvements with wider
> benefits? Indeed I can't imagine the existing iommu_map_sg() loop
> really adds significant overhead compared to a single iommu_map() call
> that results in the equivalent set of ->map calls to the driver.
> 
> At a glance, I reckon that simply extending the internal ->map and
> ->unmap interfaces to encode a number of consecutive identical pages
> would already get us a large chunk of the way there; then we'd be in a
> better place to consider options for the io-pgtable interface.
> 
Do you mean physically contiguous pages? If so, that still wouldn't help 
the
case where a buffer is composed entirely of 4 KB pages, correct?
> Robin.
> 
>> Changes since v1:
>> 
>> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
>> when checking if the IOVA and physical address ranges being
>> mapped are within the appropriate limits.
>> -Added Sai Prakash Ranjan's "Tested-by" tag.
>> 
>> Thanks,
>> Isaac
>> 
>> Isaac J. Manjarres (5):
>>    iommu/io-pgtable: Introduce map_sg() as a page table op
>>    iommu/io-pgtable-arm: Hook up map_sg()
>>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>>    iommu/arm-smmu: Hook up map_sg()
>> 
>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 
>> +++++++++++++++++++++++++++++++++++
>>   drivers/iommu/io-pgtable-arm.c        | 86 
>> +++++++++++++++++++++++++++++++++
>>   drivers/iommu/iommu.c                 | 25 ++++++++--
>>   include/linux/io-pgtable.h            |  6 +++
>>   include/linux/iommu.h                 | 13 +++++
>>   6 files changed, 234 insertions(+), 5 deletions(-)
>> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
  2021-01-21 21:30     ` isaacm
  (?)
@ 2021-01-22 13:44       ` Robin Murphy
  -1 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-22 13:44 UTC (permalink / raw)
  To: isaacm; +Cc: pdaly, will, linux-kernel, iommu, pratikp, linux-arm-kernel

On 2021-01-21 21:30, isaacm@codeaurora.org wrote:
> On 2021-01-12 08:00, Robin Murphy wrote:
>> On 2021-01-11 14:54, Isaac J. Manjarres wrote:
>>> The iommu_map_sg() code currently iterates through the given
>>> scatter-gather list, and in the worst case, invokes iommu_map()
>>> for each element in the scatter-gather list, which calls into
>>> the IOMMU driver through an indirect call. For an IOMMU driver
>>> that uses a format supported by the io-pgtable code, the IOMMU
>>> driver will then call into the io-pgtable code to map the chunk.
>>>
>>> Jumping between the IOMMU core code, the IOMMU driver, and the
>>> io-pgtable code and back for each element in a scatter-gather list
>>> is not efficient.
>>>
>>> Instead, add a map_sg() hook in both the IOMMU driver ops and the
>>> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
>>> map_sg() hook with the entire scatter-gather list, which can call
>>> into the io-pgtable map_sg() hook, which can process the entire
>>> scatter-gather list, signficantly reducing the number of indirect
>>> calls, and jumps between these layers, boosting performance.
>>
>> Out of curiosity, how much of the difference is attributable to actual
>> indirect call overhead vs. the additional massive reduction in visits
>> to arm_smmu_rpm_{get,put} that you fail to mention?There are ways to
> I did an experiment where I compared the two approaches without any calls
> to arm_smmu_rpm_[get/put]. There's still a large amount of difference
> without the overhead incurred by power management calls. Here are the 
> results:
> 
> no optimizations and no power management calls:
>   size        iommu_map_sg
>        4K            0.609 us
>       64K            8.583 us
>        1M          136.083 us
>        2M          273.145 us
>       12M         1442.119 us
>       24M         2876.078 us
>       32M         3832.041 us
> 
> iommu_map_sg optimizations and no power management calls:
> size        iommu_map_sg
>        4K            0.645 us
>       64K            1.229 us
>        1M            9.531 us
>        2M           23.198 us
>       12M           99.250 us
>       24M          185.713 us
>       32M          248.781 us
> 
>  From here, we can see that the amount of latency incurred by the indirect
> calls is fairly large.

OK, that's pretty much in line with what I was imagining, just wanted to 
confirm (if you ended up actually changing the power state around each 
page then the caller would likely be doing something very stupid).

I'm guessing the optimised numbers above looking ~20% slower than the 
ones below is just indicative of a high variance between runs, or maybe 
there's some funky cache interaction that really does make the RPM 
checks have effectively negative overhead.

>> optimise indirect calling that would benefit *all* cases, rather than
>> just one operation for one particular driver.
> Do you mind sharing some more information on how to optimize the existing
> approach further, such that it benefits other drivers as well?

This article touches on some of the possible techniques:

https://lwn.net/Articles/774743/

>>> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
>>> the current implementation of iommu_map_sg() yields the following
>>> latencies for mapping scatter-gather lists of various sizes. These
>>> latencies are calculated by repeating the mapping operation 10 times:
>>>
>>>      size        iommu_map_sg latency
>>>        4K            0.624 us
>>>       64K            9.468 us
>>>        1M          122.557 us
>>>        2M          239.807 us
>>>       12M         1435.979 us
>>>       24M         2884.968 us
>>>       32M         3832.979 us
>>>
>>> On the same system, the proposed modifications yield the following
>>> results:
>>>
>>>      size        iommu_map_sg latency
>>>        4K            3.645 us
>>>       64K            4.198 us
>>>        1M           11.010 us
>>>        2M           17.125 us
>>>       12M           82.416 us
>>>       24M          158.677 us
>>>       32M          210.468 us
>>>
>>> The procedure for collecting the iommu_map_sg latencies is
>>> the same in both experiments. Clearly, reducing the jumps
>>> between the different layers in the IOMMU code offers a
>>> signficant performance boost in iommu_map_sg() latency.
>>
>> Presumably those are deliberately worst-case numbers? After all, a
>> 32MB scatterlist *could* incur less overhead than a 64KB one if things
>> line up just right (still 16 ->map calls, but each with one fewer
> Yes, these are worst case numbers (i.e. a buffer is composed entirely
> of 4 KB pages, so higher order mappings don't get used).
>> level of pagetable to traverse). TBH I find the significant regression
>> of the 4KB case the most interesting - what's going on there?
> That was an error on my part. After fixing my error, I observed that the
> time spent mapping the 4 KB buffer is comparable with and without 
> optimizations,
> which is expected.
>>
>> My main reservation here is that we get an explosion of duplicate
>> copies of almost the same code, and it's code that's just non-trivial
>> enough to start being bug-prone. And it's all still only for one
>> specific operation - your argument about calling through multiple
>> layers for each element applies just as much to iommu_map() itself, so
>> why aren't we trying to make more fundamental improvements with wider
>> benefits? Indeed I can't imagine the existing iommu_map_sg() loop
>> really adds significant overhead compared to a single iommu_map() call
>> that results in the equivalent set of ->map calls to the driver.
>>
>> At a glance, I reckon that simply extending the internal ->map and
>> ->unmap interfaces to encode a number of consecutive identical pages
>> would already get us a large chunk of the way there; then we'd be in a
>> better place to consider options for the io-pgtable interface.
>>
> Do you mean physically contiguous pages? If so, that still wouldn't help 
> the
> case where a buffer is composed entirely of 4 KB pages, correct?

Indeed, simply reducing the number of internal calls will be a fairly 
cheap win for most typical cases - both dma_map_page() for more than one 
page, and dma_map_sg() from users like the block layer rather than 
gigantic pathological dma-buf imports - but we still want to work on 
getting the individual call overhead down to a reasonable level as well.

Thanks,
Robin.

>>> Changes since v1:
>>>
>>> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
>>> when checking if the IOVA and physical address ranges being
>>> mapped are within the appropriate limits.
>>> -Added Sai Prakash Ranjan's "Tested-by" tag.
>>>
>>> Thanks,
>>> Isaac
>>>
>>> Isaac J. Manjarres (5):
>>>    iommu/io-pgtable: Introduce map_sg() as a page table op
>>>    iommu/io-pgtable-arm: Hook up map_sg()
>>>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>>>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>>>    iommu/arm-smmu: Hook up map_sg()
>>>
>>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>>>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 
>>> +++++++++++++++++++++++++++++++++++
>>>   drivers/iommu/io-pgtable-arm.c        | 86 
>>> +++++++++++++++++++++++++++++++++
>>>   drivers/iommu/iommu.c                 | 25 ++++++++--
>>>   include/linux/io-pgtable.h            |  6 +++
>>>   include/linux/iommu.h                 | 13 +++++
>>>   6 files changed, 234 insertions(+), 5 deletions(-)
>>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-22 13:44       ` Robin Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-22 13:44 UTC (permalink / raw)
  To: isaacm; +Cc: pdaly, will, linux-kernel, iommu, pratikp, linux-arm-kernel

On 2021-01-21 21:30, isaacm@codeaurora.org wrote:
> On 2021-01-12 08:00, Robin Murphy wrote:
>> On 2021-01-11 14:54, Isaac J. Manjarres wrote:
>>> The iommu_map_sg() code currently iterates through the given
>>> scatter-gather list, and in the worst case, invokes iommu_map()
>>> for each element in the scatter-gather list, which calls into
>>> the IOMMU driver through an indirect call. For an IOMMU driver
>>> that uses a format supported by the io-pgtable code, the IOMMU
>>> driver will then call into the io-pgtable code to map the chunk.
>>>
>>> Jumping between the IOMMU core code, the IOMMU driver, and the
>>> io-pgtable code and back for each element in a scatter-gather list
>>> is not efficient.
>>>
>>> Instead, add a map_sg() hook in both the IOMMU driver ops and the
>>> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
>>> map_sg() hook with the entire scatter-gather list, which can call
>>> into the io-pgtable map_sg() hook, which can process the entire
>>> scatter-gather list, signficantly reducing the number of indirect
>>> calls, and jumps between these layers, boosting performance.
>>
>> Out of curiosity, how much of the difference is attributable to actual
>> indirect call overhead vs. the additional massive reduction in visits
>> to arm_smmu_rpm_{get,put} that you fail to mention?There are ways to
> I did an experiment where I compared the two approaches without any calls
> to arm_smmu_rpm_[get/put]. There's still a large amount of difference
> without the overhead incurred by power management calls. Here are the 
> results:
> 
> no optimizations and no power management calls:
>   size        iommu_map_sg
>        4K            0.609 us
>       64K            8.583 us
>        1M          136.083 us
>        2M          273.145 us
>       12M         1442.119 us
>       24M         2876.078 us
>       32M         3832.041 us
> 
> iommu_map_sg optimizations and no power management calls:
> size        iommu_map_sg
>        4K            0.645 us
>       64K            1.229 us
>        1M            9.531 us
>        2M           23.198 us
>       12M           99.250 us
>       24M          185.713 us
>       32M          248.781 us
> 
>  From here, we can see that the amount of latency incurred by the indirect
> calls is fairly large.

OK, that's pretty much in line with what I was imagining, just wanted to 
confirm (if you ended up actually changing the power state around each 
page then the caller would likely be doing something very stupid).

I'm guessing the optimised numbers above looking ~20% slower than the 
ones below is just indicative of a high variance between runs, or maybe 
there's some funky cache interaction that really does make the RPM 
checks have effectively negative overhead.

>> optimise indirect calling that would benefit *all* cases, rather than
>> just one operation for one particular driver.
> Do you mind sharing some more information on how to optimize the existing
> approach further, such that it benefits other drivers as well?

This article touches on some of the possible techniques:

https://lwn.net/Articles/774743/

>>> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
>>> the current implementation of iommu_map_sg() yields the following
>>> latencies for mapping scatter-gather lists of various sizes. These
>>> latencies are calculated by repeating the mapping operation 10 times:
>>>
>>>      size        iommu_map_sg latency
>>>        4K            0.624 us
>>>       64K            9.468 us
>>>        1M          122.557 us
>>>        2M          239.807 us
>>>       12M         1435.979 us
>>>       24M         2884.968 us
>>>       32M         3832.979 us
>>>
>>> On the same system, the proposed modifications yield the following
>>> results:
>>>
>>>      size        iommu_map_sg latency
>>>        4K            3.645 us
>>>       64K            4.198 us
>>>        1M           11.010 us
>>>        2M           17.125 us
>>>       12M           82.416 us
>>>       24M          158.677 us
>>>       32M          210.468 us
>>>
>>> The procedure for collecting the iommu_map_sg latencies is
>>> the same in both experiments. Clearly, reducing the jumps
>>> between the different layers in the IOMMU code offers a
>>> signficant performance boost in iommu_map_sg() latency.
>>
>> Presumably those are deliberately worst-case numbers? After all, a
>> 32MB scatterlist *could* incur less overhead than a 64KB one if things
>> line up just right (still 16 ->map calls, but each with one fewer
> Yes, these are worst case numbers (i.e. a buffer is composed entirely
> of 4 KB pages, so higher order mappings don't get used).
>> level of pagetable to traverse). TBH I find the significant regression
>> of the 4KB case the most interesting - what's going on there?
> That was an error on my part. After fixing my error, I observed that the
> time spent mapping the 4 KB buffer is comparable with and without 
> optimizations,
> which is expected.
>>
>> My main reservation here is that we get an explosion of duplicate
>> copies of almost the same code, and it's code that's just non-trivial
>> enough to start being bug-prone. And it's all still only for one
>> specific operation - your argument about calling through multiple
>> layers for each element applies just as much to iommu_map() itself, so
>> why aren't we trying to make more fundamental improvements with wider
>> benefits? Indeed I can't imagine the existing iommu_map_sg() loop
>> really adds significant overhead compared to a single iommu_map() call
>> that results in the equivalent set of ->map calls to the driver.
>>
>> At a glance, I reckon that simply extending the internal ->map and
>> ->unmap interfaces to encode a number of consecutive identical pages
>> would already get us a large chunk of the way there; then we'd be in a
>> better place to consider options for the io-pgtable interface.
>>
> Do you mean physically contiguous pages? If so, that still wouldn't help 
> the
> case where a buffer is composed entirely of 4 KB pages, correct?

Indeed, simply reducing the number of internal calls will be a fairly 
cheap win for most typical cases - both dma_map_page() for more than one 
page, and dma_map_sg() from users like the block layer rather than 
gigantic pathological dma-buf imports - but we still want to work on 
getting the individual call overhead down to a reasonable level as well.

Thanks,
Robin.

>>> Changes since v1:
>>>
>>> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
>>> when checking if the IOVA and physical address ranges being
>>> mapped are within the appropriate limits.
>>> -Added Sai Prakash Ranjan's "Tested-by" tag.
>>>
>>> Thanks,
>>> Isaac
>>>
>>> Isaac J. Manjarres (5):
>>>    iommu/io-pgtable: Introduce map_sg() as a page table op
>>>    iommu/io-pgtable-arm: Hook up map_sg()
>>>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>>>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>>>    iommu/arm-smmu: Hook up map_sg()
>>>
>>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>>>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 
>>> +++++++++++++++++++++++++++++++++++
>>>   drivers/iommu/io-pgtable-arm.c        | 86 
>>> +++++++++++++++++++++++++++++++++
>>>   drivers/iommu/iommu.c                 | 25 ++++++++--
>>>   include/linux/io-pgtable.h            |  6 +++
>>>   include/linux/iommu.h                 | 13 +++++
>>>   6 files changed, 234 insertions(+), 5 deletions(-)
>>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Optimize iommu_map_sg() performance
@ 2021-01-22 13:44       ` Robin Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Robin Murphy @ 2021-01-22 13:44 UTC (permalink / raw)
  To: isaacm; +Cc: pdaly, will, linux-kernel, iommu, pratikp, linux-arm-kernel

On 2021-01-21 21:30, isaacm@codeaurora.org wrote:
> On 2021-01-12 08:00, Robin Murphy wrote:
>> On 2021-01-11 14:54, Isaac J. Manjarres wrote:
>>> The iommu_map_sg() code currently iterates through the given
>>> scatter-gather list, and in the worst case, invokes iommu_map()
>>> for each element in the scatter-gather list, which calls into
>>> the IOMMU driver through an indirect call. For an IOMMU driver
>>> that uses a format supported by the io-pgtable code, the IOMMU
>>> driver will then call into the io-pgtable code to map the chunk.
>>>
>>> Jumping between the IOMMU core code, the IOMMU driver, and the
>>> io-pgtable code and back for each element in a scatter-gather list
>>> is not efficient.
>>>
>>> Instead, add a map_sg() hook in both the IOMMU driver ops and the
>>> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
>>> map_sg() hook with the entire scatter-gather list, which can call
>>> into the io-pgtable map_sg() hook, which can process the entire
>>> scatter-gather list, signficantly reducing the number of indirect
>>> calls, and jumps between these layers, boosting performance.
>>
>> Out of curiosity, how much of the difference is attributable to actual
>> indirect call overhead vs. the additional massive reduction in visits
>> to arm_smmu_rpm_{get,put} that you fail to mention?There are ways to
> I did an experiment where I compared the two approaches without any calls
> to arm_smmu_rpm_[get/put]. There's still a large amount of difference
> without the overhead incurred by power management calls. Here are the 
> results:
> 
> no optimizations and no power management calls:
>   size        iommu_map_sg
>        4K            0.609 us
>       64K            8.583 us
>        1M          136.083 us
>        2M          273.145 us
>       12M         1442.119 us
>       24M         2876.078 us
>       32M         3832.041 us
> 
> iommu_map_sg optimizations and no power management calls:
> size        iommu_map_sg
>        4K            0.645 us
>       64K            1.229 us
>        1M            9.531 us
>        2M           23.198 us
>       12M           99.250 us
>       24M          185.713 us
>       32M          248.781 us
> 
>  From here, we can see that the amount of latency incurred by the indirect
> calls is fairly large.

OK, that's pretty much in line with what I was imagining, just wanted to 
confirm (if you ended up actually changing the power state around each 
page then the caller would likely be doing something very stupid).

I'm guessing the optimised numbers above looking ~20% slower than the 
ones below is just indicative of a high variance between runs, or maybe 
there's some funky cache interaction that really does make the RPM 
checks have effectively negative overhead.

>> optimise indirect calling that would benefit *all* cases, rather than
>> just one operation for one particular driver.
> Do you mind sharing some more information on how to optimize the existing
> approach further, such that it benefits other drivers as well?

This article touches on some of the possible techniques:

https://lwn.net/Articles/774743/

>>> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
>>> the current implementation of iommu_map_sg() yields the following
>>> latencies for mapping scatter-gather lists of various sizes. These
>>> latencies are calculated by repeating the mapping operation 10 times:
>>>
>>>      size        iommu_map_sg latency
>>>        4K            0.624 us
>>>       64K            9.468 us
>>>        1M          122.557 us
>>>        2M          239.807 us
>>>       12M         1435.979 us
>>>       24M         2884.968 us
>>>       32M         3832.979 us
>>>
>>> On the same system, the proposed modifications yield the following
>>> results:
>>>
>>>      size        iommu_map_sg latency
>>>        4K            3.645 us
>>>       64K            4.198 us
>>>        1M           11.010 us
>>>        2M           17.125 us
>>>       12M           82.416 us
>>>       24M          158.677 us
>>>       32M          210.468 us
>>>
>>> The procedure for collecting the iommu_map_sg latencies is
>>> the same in both experiments. Clearly, reducing the jumps
>>> between the different layers in the IOMMU code offers a
>>> signficant performance boost in iommu_map_sg() latency.
>>
>> Presumably those are deliberately worst-case numbers? After all, a
>> 32MB scatterlist *could* incur less overhead than a 64KB one if things
>> line up just right (still 16 ->map calls, but each with one fewer
> Yes, these are worst case numbers (i.e. a buffer is composed entirely
> of 4 KB pages, so higher order mappings don't get used).
>> level of pagetable to traverse). TBH I find the significant regression
>> of the 4KB case the most interesting - what's going on there?
> That was an error on my part. After fixing my error, I observed that the
> time spent mapping the 4 KB buffer is comparable with and without 
> optimizations,
> which is expected.
>>
>> My main reservation here is that we get an explosion of duplicate
>> copies of almost the same code, and it's code that's just non-trivial
>> enough to start being bug-prone. And it's all still only for one
>> specific operation - your argument about calling through multiple
>> layers for each element applies just as much to iommu_map() itself, so
>> why aren't we trying to make more fundamental improvements with wider
>> benefits? Indeed I can't imagine the existing iommu_map_sg() loop
>> really adds significant overhead compared to a single iommu_map() call
>> that results in the equivalent set of ->map calls to the driver.
>>
>> At a glance, I reckon that simply extending the internal ->map and
>> ->unmap interfaces to encode a number of consecutive identical pages
>> would already get us a large chunk of the way there; then we'd be in a
>> better place to consider options for the io-pgtable interface.
>>
> Do you mean physically contiguous pages? If so, that still wouldn't help 
> the
> case where a buffer is composed entirely of 4 KB pages, correct?

Indeed, simply reducing the number of internal calls will be a fairly 
cheap win for most typical cases - both dma_map_page() for more than one 
page, and dma_map_sg() from users like the block layer rather than 
gigantic pathological dma-buf imports - but we still want to work on 
getting the individual call overhead down to a reasonable level as well.

Thanks,
Robin.

>>> Changes since v1:
>>>
>>> -Fixed an off by one error in arm_[lpae/v7s]_map_by_pgsize
>>> when checking if the IOVA and physical address ranges being
>>> mapped are within the appropriate limits.
>>> -Added Sai Prakash Ranjan's "Tested-by" tag.
>>>
>>> Thanks,
>>> Isaac
>>>
>>> Isaac J. Manjarres (5):
>>>    iommu/io-pgtable: Introduce map_sg() as a page table op
>>>    iommu/io-pgtable-arm: Hook up map_sg()
>>>    iommu/io-pgtable-arm-v7s: Hook up map_sg()
>>>    iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
>>>    iommu/arm-smmu: Hook up map_sg()
>>>
>>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 ++++++++
>>>   drivers/iommu/io-pgtable-arm-v7s.c    | 90 
>>> +++++++++++++++++++++++++++++++++++
>>>   drivers/iommu/io-pgtable-arm.c        | 86 
>>> +++++++++++++++++++++++++++++++++
>>>   drivers/iommu/iommu.c                 | 25 ++++++++--
>>>   include/linux/io-pgtable.h            |  6 +++
>>>   include/linux/iommu.h                 | 13 +++++
>>>   6 files changed, 234 insertions(+), 5 deletions(-)
>>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-01-22 13:47 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-11 14:54 [PATCH v2 0/5] Optimize iommu_map_sg() performance Isaac J. Manjarres
2021-01-11 14:54 ` Isaac J. Manjarres
2021-01-11 14:54 ` [PATCH v2 1/5] iommu/io-pgtable: Introduce map_sg() as a page table op Isaac J. Manjarres
2021-01-11 14:54   ` Isaac J. Manjarres
2021-01-11 14:54 ` [PATCH v2 2/5] iommu/io-pgtable-arm: Hook up map_sg() Isaac J. Manjarres
2021-01-11 14:54   ` Isaac J. Manjarres
2021-01-11 14:54 ` [PATCH v2 3/5] iommu/io-pgtable-arm-v7s: " Isaac J. Manjarres
2021-01-11 14:54   ` Isaac J. Manjarres
2021-01-11 14:54 ` [PATCH v2 4/5] iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers Isaac J. Manjarres
2021-01-11 14:54   ` Isaac J. Manjarres
2021-01-11 14:54 ` [PATCH v2 5/5] iommu/arm-smmu: Hook up map_sg() Isaac J. Manjarres
2021-01-11 14:54   ` Isaac J. Manjarres
2021-01-12 16:00 ` [PATCH v2 0/5] Optimize iommu_map_sg() performance Robin Murphy
2021-01-12 16:00   ` Robin Murphy
2021-01-12 16:00   ` Robin Murphy
2021-01-12 16:33   ` Christoph Hellwig
2021-01-12 16:33     ` Christoph Hellwig
2021-01-12 16:33     ` Christoph Hellwig
2021-01-13  2:54     ` Robin Murphy
2021-01-13  2:54       ` Robin Murphy
2021-01-13  2:54       ` Robin Murphy
2021-01-21 21:30   ` isaacm
2021-01-21 21:30     ` isaacm
2021-01-22 13:44     ` Robin Murphy
2021-01-22 13:44       ` Robin Murphy
2021-01-22 13:44       ` Robin Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.