[RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-25  2:38 Lu Baolu
  2021-01-25  2:38 ` [RFT PATCH 1/3] iommu: Move iotlb_sync_map out from __iommu_map Lu Baolu
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Lu Baolu @ 2021-01-25  2:38 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Robin Murphy, iommu, Will Deacon

This patch series is only for Request-For-Testing purpose. It aims to
fix the performance regression reported here.

https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/

The first two patches are borrowed from here.

https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/

Please kindly help to verification.

Best regards,
baolu

Lu Baolu (1):
  iommu/vt-d: Add iotlb_sync_map callback

Yong Wu (2):
  iommu: Move iotlb_sync_map out from __iommu_map
  iommu: Add iova and size as parameters in iotlb_sync_map

 drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
 drivers/iommu/iommu.c       | 23 +++++++---
 drivers/iommu/tegra-gart.c  |  7 ++-
 include/linux/iommu.h       |  3 +-
 4 files changed, 83 insertions(+), 36 deletions(-)

-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFT PATCH 1/3] iommu: Move iotlb_sync_map out from __iommu_map
  2021-01-25  2:38 [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Lu Baolu
@ 2021-01-25  2:38 ` Lu Baolu
  2021-01-25  2:38 ` [RFT PATCH 2/3] iommu: Add iova and size as parameters in iotlb_sync_map Lu Baolu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Lu Baolu @ 2021-01-25  2:38 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Will Deacon, Robin Murphy, iommu

From: Yong Wu <yong.wu@mediatek.com>

In the end of __iommu_map, It alway call iotlb_sync_map.

This patch moves iotlb_sync_map out from __iommu_map since it is
unnecessary to call this for each sg segment especially iotlb_sync_map
is flush tlb all currently. Add a little helper _iommu_map for this.

Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iommu.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ffeebda8d6de..c304a6a30d42 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2426,9 +2426,6 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 		size -= pgsize;
 	}
 
-	if (ops->iotlb_sync_map)
-		ops->iotlb_sync_map(domain);
-
 	/* unroll mapping in case something went wrong */
 	if (ret)
 		iommu_unmap(domain, orig_iova, orig_size - size);
@@ -2438,18 +2435,31 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 	return ret;
 }
 
+static int _iommu_map(struct iommu_domain *domain, unsigned long iova,
+		      phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+{
+	const struct iommu_ops *ops = domain->ops;
+	int ret;
+
+	ret = __iommu_map(domain, iova, paddr, size, prot, GFP_KERNEL);
+	if (ret == 0 && ops->iotlb_sync_map)
+		ops->iotlb_sync_map(domain);
+
+	return ret;
+}
+
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
 	      phys_addr_t paddr, size_t size, int prot)
 {
 	might_sleep();
-	return __iommu_map(domain, iova, paddr, size, prot, GFP_KERNEL);
+	return _iommu_map(domain, iova, paddr, size, prot, GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(iommu_map);
 
 int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
 	      phys_addr_t paddr, size_t size, int prot)
 {
-	return __iommu_map(domain, iova, paddr, size, prot, GFP_ATOMIC);
+	return _iommu_map(domain, iova, paddr, size, prot, GFP_ATOMIC);
 }
 EXPORT_SYMBOL_GPL(iommu_map_atomic);
 
@@ -2533,6 +2543,7 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			     struct scatterlist *sg, unsigned int nents, int prot,
 			     gfp_t gfp)
 {
+	const struct iommu_ops *ops = domain->ops;
 	size_t len = 0, mapped = 0;
 	phys_addr_t start;
 	unsigned int i = 0;
@@ -2563,6 +2574,8 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			sg = sg_next(sg);
 	}
 
+	if (ops->iotlb_sync_map)
+		ops->iotlb_sync_map(domain);
 	return mapped;
 
 out_err:
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFT PATCH 2/3] iommu: Add iova and size as parameters in iotlb_sync_map
  2021-01-25  2:38 [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Lu Baolu
  2021-01-25  2:38 ` [RFT PATCH 1/3] iommu: Move iotlb_sync_map out from __iommu_map Lu Baolu
@ 2021-01-25  2:38 ` Lu Baolu
  2021-01-25  2:38 ` [RFT PATCH 3/3] iommu/vt-d: Add iotlb_sync_map callback Lu Baolu
  2021-01-25 17:39 ` [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Chuck Lever
  3 siblings, 0 replies; 12+ messages in thread
From: Lu Baolu @ 2021-01-25  2:38 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Will Deacon, Robin Murphy, iommu

From: Yong Wu <yong.wu@mediatek.com>

iotlb_sync_map allow IOMMU drivers tlb sync after completing the whole
mapping. This patch adds iova and size as the parameters in it. then the
IOMMU driver could flush tlb with the whole range once after iova mapping
to improve performance.

Signed-off-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iommu.c      | 4 ++--
 drivers/iommu/tegra-gart.c | 7 +++++--
 include/linux/iommu.h      | 3 ++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c304a6a30d42..3d099a31ddca 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2443,7 +2443,7 @@ static int _iommu_map(struct iommu_domain *domain, unsigned long iova,
 
 	ret = __iommu_map(domain, iova, paddr, size, prot, GFP_KERNEL);
 	if (ret == 0 && ops->iotlb_sync_map)
-		ops->iotlb_sync_map(domain);
+		ops->iotlb_sync_map(domain, iova, size);
 
 	return ret;
 }
@@ -2575,7 +2575,7 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	}
 
 	if (ops->iotlb_sync_map)
-		ops->iotlb_sync_map(domain);
+		ops->iotlb_sync_map(domain, iova, mapped);
 	return mapped;
 
 out_err:
diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
index fac720273889..05e8e19b8269 100644
--- a/drivers/iommu/tegra-gart.c
+++ b/drivers/iommu/tegra-gart.c
@@ -261,7 +261,8 @@ static int gart_iommu_of_xlate(struct device *dev,
 	return 0;
 }
 
-static void gart_iommu_sync_map(struct iommu_domain *domain)
+static void gart_iommu_sync_map(struct iommu_domain *domain, unsigned long iova,
+				size_t size)
 {
 	FLUSH_GART_REGS(gart_handle);
 }
@@ -269,7 +270,9 @@ static void gart_iommu_sync_map(struct iommu_domain *domain)
 static void gart_iommu_sync(struct iommu_domain *domain,
 			    struct iommu_iotlb_gather *gather)
 {
-	gart_iommu_sync_map(domain);
+	size_t length = gather->end - gather->start;
+
+	gart_iommu_sync_map(domain, gather->start, length);
 }
 
 static const struct iommu_ops gart_iommu_ops = {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b3f0e2018c62..9ce0aa9e236b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -246,7 +246,8 @@ struct iommu_ops {
 	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
 		     size_t size, struct iommu_iotlb_gather *iotlb_gather);
 	void (*flush_iotlb_all)(struct iommu_domain *domain);
-	void (*iotlb_sync_map)(struct iommu_domain *domain);
+	void (*iotlb_sync_map)(struct iommu_domain *domain, unsigned long iova,
+			       size_t size);
 	void (*iotlb_sync)(struct iommu_domain *domain,
 			   struct iommu_iotlb_gather *iotlb_gather);
 	phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFT PATCH 3/3] iommu/vt-d: Add iotlb_sync_map callback
  2021-01-25  2:38 [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Lu Baolu
  2021-01-25  2:38 ` [RFT PATCH 1/3] iommu: Move iotlb_sync_map out from __iommu_map Lu Baolu
  2021-01-25  2:38 ` [RFT PATCH 2/3] iommu: Add iova and size as parameters in iotlb_sync_map Lu Baolu
@ 2021-01-25  2:38 ` Lu Baolu
  2021-01-25 17:39 ` [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Chuck Lever
  3 siblings, 0 replies; 12+ messages in thread
From: Lu Baolu @ 2021-01-25  2:38 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Robin Murphy, iommu, Will Deacon

Some Intel VT-d hardware implementations don't support memory coherency
for page table walk (presented by the Page-Walk-coherency bit in the
ecap register), so that software must flush the corresponding CPU cache
lines explicitly after each page table entry update.

The iommu_map_sg() code iterates through the given scatter-gather list
and invokes iommu_map() for each element in the scatter-gather list,
which calls into the vendor IOMMU driver through iommu_ops callback. As
the result, a single sg mapping may lead to multiple cache line flushes,
which leads to the degradation of I/O performance after the commit
<c588072bba6b5> ("iommu/vt-d: Convert intel iommu driver to the iommu
ops").

Fix this by adding iotlb_sync_map callback and centralizing the clflush
operations after all sg mappings.

Fixes: c588072bba6b5 ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
 1 file changed, 58 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f665322a0991..f5a236e63ded 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2383,34 +2383,14 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 		 * recalculate 'pte' and switch back to smaller pages for the
 		 * end of the mapping, if the trailing size is not enough to
 		 * use another superpage (i.e. nr_pages < lvl_pages).
+		 *
+		 * We leave clflush for the leaf pte changes to iotlb_sync_map()
+		 * callback.
 		 */
 		pte++;
 		if (!nr_pages || first_pte_in_page(pte) ||
-		    (largepage_lvl > 1 && nr_pages < lvl_pages)) {
-			domain_flush_cache(domain, first_pte,
-					   (void *)pte - (void *)first_pte);
+		    (largepage_lvl > 1 && nr_pages < lvl_pages))
 			pte = NULL;
-		}
-	}
-
-	return 0;
-}
-
-static int
-domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
-	       unsigned long phys_pfn, unsigned long nr_pages, int prot)
-{
-	int iommu_id, ret;
-	struct intel_iommu *iommu;
-
-	/* Do the real mapping first */
-	ret = __domain_mapping(domain, iov_pfn, phys_pfn, nr_pages, prot);
-	if (ret)
-		return ret;
-
-	for_each_domain_iommu(iommu_id, domain) {
-		iommu = g_iommus[iommu_id];
-		__mapping_notify_one(iommu, domain, iov_pfn, nr_pages);
 	}
 
 	return 0;
@@ -4943,7 +4923,6 @@ static int intel_iommu_map(struct iommu_domain *domain,
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 	u64 max_addr;
 	int prot = 0;
-	int ret;
 
 	if (iommu_prot & IOMMU_READ)
 		prot |= DMA_PTE_READ;
@@ -4969,9 +4948,8 @@ static int intel_iommu_map(struct iommu_domain *domain,
 	/* Round up size to next multiple of PAGE_SIZE, if it and
 	   the low bits of hpa would take us onto the next page */
 	size = aligned_nrpages(hpa, size);
-	ret = domain_mapping(dmar_domain, iova >> VTD_PAGE_SHIFT,
-			     hpa >> VTD_PAGE_SHIFT, size, prot);
-	return ret;
+	return __domain_mapping(dmar_domain, iova >> VTD_PAGE_SHIFT,
+				hpa >> VTD_PAGE_SHIFT, size, prot);
 }
 
 static size_t intel_iommu_unmap(struct iommu_domain *domain,
@@ -5478,6 +5456,57 @@ static bool risky_device(struct pci_dev *pdev)
 	return false;
 }
 
+static void clflush_sync_map(struct dmar_domain *domain, unsigned long clf_pfn,
+			     unsigned long clf_pages)
+{
+	struct dma_pte *first_pte = NULL, *pte = NULL;
+	unsigned long lvl_pages = 0;
+	int level = 0;
+
+	while (clf_pages > 0) {
+		if (!pte) {
+			level = 0;
+			pte = pfn_to_dma_pte(domain, clf_pfn, &level);
+			if (WARN_ON(!pte))
+				return;
+			first_pte = pte;
+			lvl_pages = lvl_to_nr_pages(level);
+		}
+
+		if (WARN_ON(!lvl_pages || clf_pages < lvl_pages))
+			return;
+
+		clf_pages -= lvl_pages;
+		clf_pfn += lvl_pages;
+		pte++;
+
+		if (!clf_pages || first_pte_in_page(pte) ||
+		    (level > 1 && clf_pages < lvl_pages)) {
+			domain_flush_cache(domain, first_pte,
+					   (void *)pte - (void *)first_pte);
+			pte = NULL;
+		}
+	}
+}
+
+static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
+				       unsigned long iova, size_t size)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	unsigned long pages = aligned_nrpages(iova, size);
+	unsigned long pfn = iova >> VTD_PAGE_SHIFT;
+	struct intel_iommu *iommu;
+	int iommu_id;
+
+	if (!dmar_domain->iommu_coherency)
+		clflush_sync_map(dmar_domain, pfn, pages);
+
+	for_each_domain_iommu(iommu_id, dmar_domain) {
+		iommu = g_iommus[iommu_id];
+		__mapping_notify_one(iommu, dmar_domain, pfn, pages);
+	}
+}
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.domain_alloc		= intel_iommu_domain_alloc,
@@ -5490,6 +5519,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.aux_detach_dev		= intel_iommu_aux_detach_device,
 	.aux_get_pasid		= intel_iommu_aux_get_pasid,
 	.map			= intel_iommu_map,
+	.iotlb_sync_map		= intel_iommu_iotlb_sync_map,
 	.unmap			= intel_iommu_unmap,
 	.flush_iotlb_all        = intel_flush_iotlb_all,
 	.iotlb_sync		= intel_iommu_tlb_sync,
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-25  2:38 [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Lu Baolu
                   ` (2 preceding siblings ...)
  2021-01-25  2:38 ` [RFT PATCH 3/3] iommu/vt-d: Add iotlb_sync_map callback Lu Baolu
@ 2021-01-25 17:39 ` Chuck Lever
  2021-01-25 19:31   ` Chuck Lever
  3 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2021-01-25 17:39 UTC (permalink / raw)
  To: Lu Baolu; +Cc: Will Deacon, Robin Murphy, iommu

Hello Lu -

Many thanks for your prototype.


> On Jan 24, 2021, at 9:38 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> This patch series is only for Request-For-Testing purpose. It aims to
> fix the performance regression reported here.
> 
> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
> 
> The first two patches are borrowed from here.
> 
> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/
> 
> Please kindly help to verification.
> 
> Best regards,
> baolu
> 
> Lu Baolu (1):
>  iommu/vt-d: Add iotlb_sync_map callback
> 
> Yong Wu (2):
>  iommu: Move iotlb_sync_map out from __iommu_map
>  iommu: Add iova and size as parameters in iotlb_sync_map
> 
> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
> drivers/iommu/iommu.c       | 23 +++++++---
> drivers/iommu/tegra-gart.c  |  7 ++-
> include/linux/iommu.h       |  3 +-
> 4 files changed, 83 insertions(+), 36 deletions(-)

Here are results with the NFS client at stock v5.11-rc5 and the
NFS server at v5.10, showing the regression I reported earlier.

	Children see throughput for 12 initial writers  = 4534582.00 kB/sec
	Parent sees throughput for 12 initial writers   = 4458145.56 kB/sec
	Min throughput per process                      = 373101.59 kB/sec 
	Max throughput per process                      = 382669.50 kB/sec
	Avg throughput per process                      = 377881.83 kB/sec
	Min xfer                                        = 1022720.00 kB
	CPU Utilization: Wall time    2.787    CPU time    1.922    CPU utilization  68.95 %


	Children see throughput for 12 rewriters        = 4542003.12 kB/sec
	Parent sees throughput for 12 rewriters         = 4538024.19 kB/sec
	Min throughput per process                      = 374672.00 kB/sec 
	Max throughput per process                      = 383983.78 kB/sec
	Avg throughput per process                      = 378500.26 kB/sec
	Min xfer                                        = 1022976.00 kB
	CPU utilization: Wall time    2.733    CPU time    1.947    CPU utilization  71.25 %


	Children see throughput for 12 readers          = 4568632.03 kB/sec
	Parent sees throughput for 12 readers           = 4563672.02 kB/sec
	Min throughput per process                      = 376727.56 kB/sec 
	Max throughput per process                      = 383783.91 kB/sec
	Avg throughput per process                      = 380719.34 kB/sec
	Min xfer                                        = 1029376.00 kB
	CPU utilization: Wall time    2.733    CPU time    1.898    CPU utilization  69.46 %


	Children see throughput for 12 re-readers       = 4610702.78 kB/sec
	Parent sees throughput for 12 re-readers        = 4606135.66 kB/sec
	Min throughput per process                      = 381532.78 kB/sec 
	Max throughput per process                      = 387072.53 kB/sec
	Avg throughput per process                      = 384225.23 kB/sec
	Min xfer                                        = 1034496.00 kB
	CPU utilization: Wall time    2.711    CPU time    1.910    CPU utilization  70.45 %

Here's the NFS client at v5.11-rc5 with your series applied.
The NFS server remains at v5.10:

	Children see throughput for 12 initial writers  = 4434778.81 kB/sec
	Parent sees throughput for 12 initial writers   = 4408190.69 kB/sec
	Min throughput per process                      = 367865.28 kB/sec 
	Max throughput per process                      = 371134.38 kB/sec
	Avg throughput per process                      = 369564.90 kB/sec
	Min xfer                                        = 1039360.00 kB
	CPU Utilization: Wall time    2.842    CPU time    1.904    CPU utilization  66.99 %


	Children see throughput for 12 rewriters        = 4476870.69 kB/sec
	Parent sees throughput for 12 rewriters         = 4471701.48 kB/sec
	Min throughput per process                      = 370985.34 kB/sec 
	Max throughput per process                      = 374752.28 kB/sec
	Avg throughput per process                      = 373072.56 kB/sec
	Min xfer                                        = 1038592.00 kB
	CPU utilization: Wall time    2.801    CPU time    1.902    CPU utilization  67.91 %


	Children see throughput for 12 readers          = 5865268.88 kB/sec
	Parent sees throughput for 12 readers           = 5854519.73 kB/sec
	Min throughput per process                      = 487766.81 kB/sec 
	Max throughput per process                      = 489623.88 kB/sec
	Avg throughput per process                      = 488772.41 kB/sec
	Min xfer                                        = 1044736.00 kB
	CPU utilization: Wall time    2.144    CPU time    1.895    CPU utilization  88.41 %


	Children see throughput for 12 re-readers       = 5847438.62 kB/sec
	Parent sees throughput for 12 re-readers        = 5839292.18 kB/sec
	Min throughput per process                      = 485835.03 kB/sec 
	Max throughput per process                      = 488702.12 kB/sec
	Avg throughput per process                      = 487286.55 kB/sec
	Min xfer                                        = 1042688.00 kB
	CPU utilization: Wall time    2.148    CPU time    1.909    CPU utilization  88.84 %

NFS READ throughput is almost fully restored. A normal-looking throughput
result, copied from the previous thread, is:

	Children see throughput for 12 readers 		= 5921370.94 kB/sec
	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec

The NFS WRITE throughput result appears to be unchanged, or slightly
worse than before. I don't have an explanation for this result. I applied
your patches on the NFS server also without noting improvement.


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-25 17:39 ` [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Chuck Lever
@ 2021-01-25 19:31   ` Chuck Lever
  2021-01-26  6:18     ` Lu Baolu
  2021-01-26 16:05     ` Robin Murphy
  0 siblings, 2 replies; 12+ messages in thread
From: Chuck Lever @ 2021-01-25 19:31 UTC (permalink / raw)
  To: Lu Baolu; +Cc: Will Deacon, Robin Murphy, iommu



> On Jan 25, 2021, at 12:39 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> Hello Lu -
> 
> Many thanks for your prototype.
> 
> 
>> On Jan 24, 2021, at 9:38 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>> 
>> This patch series is only for Request-For-Testing purpose. It aims to
>> fix the performance regression reported here.
>> 
>> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
>> 
>> The first two patches are borrowed from here.
>> 
>> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/
>> 
>> Please kindly help to verification.
>> 
>> Best regards,
>> baolu
>> 
>> Lu Baolu (1):
>> iommu/vt-d: Add iotlb_sync_map callback
>> 
>> Yong Wu (2):
>> iommu: Move iotlb_sync_map out from __iommu_map
>> iommu: Add iova and size as parameters in iotlb_sync_map
>> 
>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
>> drivers/iommu/iommu.c       | 23 +++++++---
>> drivers/iommu/tegra-gart.c  |  7 ++-
>> include/linux/iommu.h       |  3 +-
>> 4 files changed, 83 insertions(+), 36 deletions(-)
> 
> Here are results with the NFS client at stock v5.11-rc5 and the
> NFS server at v5.10, showing the regression I reported earlier.
> 
> 	Children see throughput for 12 initial writers  = 4534582.00 kB/sec
> 	Parent sees throughput for 12 initial writers   = 4458145.56 kB/sec
> 	Min throughput per process                      = 373101.59 kB/sec 
> 	Max throughput per process                      = 382669.50 kB/sec
> 	Avg throughput per process                      = 377881.83 kB/sec
> 	Min xfer                                        = 1022720.00 kB
> 	CPU Utilization: Wall time    2.787    CPU time    1.922    CPU utilization  68.95 %
> 
> 
> 	Children see throughput for 12 rewriters        = 4542003.12 kB/sec
> 	Parent sees throughput for 12 rewriters         = 4538024.19 kB/sec
> 	Min throughput per process                      = 374672.00 kB/sec 
> 	Max throughput per process                      = 383983.78 kB/sec
> 	Avg throughput per process                      = 378500.26 kB/sec
> 	Min xfer                                        = 1022976.00 kB
> 	CPU utilization: Wall time    2.733    CPU time    1.947    CPU utilization  71.25 %
> 
> 
> 	Children see throughput for 12 readers          = 4568632.03 kB/sec
> 	Parent sees throughput for 12 readers           = 4563672.02 kB/sec
> 	Min throughput per process                      = 376727.56 kB/sec 
> 	Max throughput per process                      = 383783.91 kB/sec
> 	Avg throughput per process                      = 380719.34 kB/sec
> 	Min xfer                                        = 1029376.00 kB
> 	CPU utilization: Wall time    2.733    CPU time    1.898    CPU utilization  69.46 %
> 
> 
> 	Children see throughput for 12 re-readers       = 4610702.78 kB/sec
> 	Parent sees throughput for 12 re-readers        = 4606135.66 kB/sec
> 	Min throughput per process                      = 381532.78 kB/sec 
> 	Max throughput per process                      = 387072.53 kB/sec
> 	Avg throughput per process                      = 384225.23 kB/sec
> 	Min xfer                                        = 1034496.00 kB
> 	CPU utilization: Wall time    2.711    CPU time    1.910    CPU utilization  70.45 %
> 
> Here's the NFS client at v5.11-rc5 with your series applied.
> The NFS server remains at v5.10:
> 
> 	Children see throughput for 12 initial writers  = 4434778.81 kB/sec
> 	Parent sees throughput for 12 initial writers   = 4408190.69 kB/sec
> 	Min throughput per process                      = 367865.28 kB/sec 
> 	Max throughput per process                      = 371134.38 kB/sec
> 	Avg throughput per process                      = 369564.90 kB/sec
> 	Min xfer                                        = 1039360.00 kB
> 	CPU Utilization: Wall time    2.842    CPU time    1.904    CPU utilization  66.99 %
> 
> 
> 	Children see throughput for 12 rewriters        = 4476870.69 kB/sec
> 	Parent sees throughput for 12 rewriters         = 4471701.48 kB/sec
> 	Min throughput per process                      = 370985.34 kB/sec 
> 	Max throughput per process                      = 374752.28 kB/sec
> 	Avg throughput per process                      = 373072.56 kB/sec
> 	Min xfer                                        = 1038592.00 kB
> 	CPU utilization: Wall time    2.801    CPU time    1.902    CPU utilization  67.91 %
> 
> 
> 	Children see throughput for 12 readers          = 5865268.88 kB/sec
> 	Parent sees throughput for 12 readers           = 5854519.73 kB/sec
> 	Min throughput per process                      = 487766.81 kB/sec 
> 	Max throughput per process                      = 489623.88 kB/sec
> 	Avg throughput per process                      = 488772.41 kB/sec
> 	Min xfer                                        = 1044736.00 kB
> 	CPU utilization: Wall time    2.144    CPU time    1.895    CPU utilization  88.41 %
> 
> 
> 	Children see throughput for 12 re-readers       = 5847438.62 kB/sec
> 	Parent sees throughput for 12 re-readers        = 5839292.18 kB/sec
> 	Min throughput per process                      = 485835.03 kB/sec 
> 	Max throughput per process                      = 488702.12 kB/sec
> 	Avg throughput per process                      = 487286.55 kB/sec
> 	Min xfer                                        = 1042688.00 kB
> 	CPU utilization: Wall time    2.148    CPU time    1.909    CPU utilization  88.84 %
> 
> NFS READ throughput is almost fully restored. A normal-looking throughput
> result, copied from the previous thread, is:
> 
> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
> 
> The NFS WRITE throughput result appears to be unchanged, or slightly
> worse than before. I don't have an explanation for this result. I applied
> your patches on the NFS server also without noting improvement.

Function-boundary tracing shows some interesting results.

# trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g dma_map_sg_attrs

Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
the same size need as many as one __iommu_map() call per SGL element (which
would be 30 for a 120KB SGL).

In v5.10, intel_map_sg() was structured such that an SGL is always handled with
a single call to domain_mapping() and thus always just a single TLB flush.

My amateur theorizing suggests that the SGL element coalescing done in
__iommu_map_sg() is not working as well as intel_map_sg() used to, which results
in more calls to domain_mapping(). Not only does that take longer, but it creates
many more DMA maps. Could that also have some impact on device TLB resources?


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-25 19:31   ` Chuck Lever
@ 2021-01-26  6:18     ` Lu Baolu
  2021-01-26 15:52       ` Chuck Lever
  2021-01-26 16:05     ` Robin Murphy
  1 sibling, 1 reply; 12+ messages in thread
From: Lu Baolu @ 2021-01-26  6:18 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Robin Murphy, iommu, Will Deacon

On 2021/1/26 3:31, Chuck Lever wrote:
> 
> 
>> On Jan 25, 2021, at 12:39 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> Hello Lu -
>>
>> Many thanks for your prototype.
>>
>>
>>> On Jan 24, 2021, at 9:38 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>>
>>> This patch series is only for Request-For-Testing purpose. It aims to
>>> fix the performance regression reported here.
>>>
>>> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
>>>
>>> The first two patches are borrowed from here.
>>>
>>> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/
>>>
>>> Please kindly help to verification.
>>>
>>> Best regards,
>>> baolu
>>>
>>> Lu Baolu (1):
>>> iommu/vt-d: Add iotlb_sync_map callback
>>>
>>> Yong Wu (2):
>>> iommu: Move iotlb_sync_map out from __iommu_map
>>> iommu: Add iova and size as parameters in iotlb_sync_map
>>>
>>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
>>> drivers/iommu/iommu.c       | 23 +++++++---
>>> drivers/iommu/tegra-gart.c  |  7 ++-
>>> include/linux/iommu.h       |  3 +-
>>> 4 files changed, 83 insertions(+), 36 deletions(-)
>>
>> Here are results with the NFS client at stock v5.11-rc5 and the
>> NFS server at v5.10, showing the regression I reported earlier.
>>
>> 	Children see throughput for 12 initial writers  = 4534582.00 kB/sec
>> 	Parent sees throughput for 12 initial writers   = 4458145.56 kB/sec
>> 	Min throughput per process                      = 373101.59 kB/sec
>> 	Max throughput per process                      = 382669.50 kB/sec
>> 	Avg throughput per process                      = 377881.83 kB/sec
>> 	Min xfer                                        = 1022720.00 kB
>> 	CPU Utilization: Wall time    2.787    CPU time    1.922    CPU utilization  68.95 %
>>
>>
>> 	Children see throughput for 12 rewriters        = 4542003.12 kB/sec
>> 	Parent sees throughput for 12 rewriters         = 4538024.19 kB/sec
>> 	Min throughput per process                      = 374672.00 kB/sec
>> 	Max throughput per process                      = 383983.78 kB/sec
>> 	Avg throughput per process                      = 378500.26 kB/sec
>> 	Min xfer                                        = 1022976.00 kB
>> 	CPU utilization: Wall time    2.733    CPU time    1.947    CPU utilization  71.25 %
>>
>>
>> 	Children see throughput for 12 readers          = 4568632.03 kB/sec
>> 	Parent sees throughput for 12 readers           = 4563672.02 kB/sec
>> 	Min throughput per process                      = 376727.56 kB/sec
>> 	Max throughput per process                      = 383783.91 kB/sec
>> 	Avg throughput per process                      = 380719.34 kB/sec
>> 	Min xfer                                        = 1029376.00 kB
>> 	CPU utilization: Wall time    2.733    CPU time    1.898    CPU utilization  69.46 %
>>
>>
>> 	Children see throughput for 12 re-readers       = 4610702.78 kB/sec
>> 	Parent sees throughput for 12 re-readers        = 4606135.66 kB/sec
>> 	Min throughput per process                      = 381532.78 kB/sec
>> 	Max throughput per process                      = 387072.53 kB/sec
>> 	Avg throughput per process                      = 384225.23 kB/sec
>> 	Min xfer                                        = 1034496.00 kB
>> 	CPU utilization: Wall time    2.711    CPU time    1.910    CPU utilization  70.45 %
>>
>> Here's the NFS client at v5.11-rc5 with your series applied.
>> The NFS server remains at v5.10:
>>
>> 	Children see throughput for 12 initial writers  = 4434778.81 kB/sec
>> 	Parent sees throughput for 12 initial writers   = 4408190.69 kB/sec
>> 	Min throughput per process                      = 367865.28 kB/sec
>> 	Max throughput per process                      = 371134.38 kB/sec
>> 	Avg throughput per process                      = 369564.90 kB/sec
>> 	Min xfer                                        = 1039360.00 kB
>> 	CPU Utilization: Wall time    2.842    CPU time    1.904    CPU utilization  66.99 %
>>
>>
>> 	Children see throughput for 12 rewriters        = 4476870.69 kB/sec
>> 	Parent sees throughput for 12 rewriters         = 4471701.48 kB/sec
>> 	Min throughput per process                      = 370985.34 kB/sec
>> 	Max throughput per process                      = 374752.28 kB/sec
>> 	Avg throughput per process                      = 373072.56 kB/sec
>> 	Min xfer                                        = 1038592.00 kB
>> 	CPU utilization: Wall time    2.801    CPU time    1.902    CPU utilization  67.91 %
>>
>>
>> 	Children see throughput for 12 readers          = 5865268.88 kB/sec
>> 	Parent sees throughput for 12 readers           = 5854519.73 kB/sec
>> 	Min throughput per process                      = 487766.81 kB/sec
>> 	Max throughput per process                      = 489623.88 kB/sec
>> 	Avg throughput per process                      = 488772.41 kB/sec
>> 	Min xfer                                        = 1044736.00 kB
>> 	CPU utilization: Wall time    2.144    CPU time    1.895    CPU utilization  88.41 %
>>
>>
>> 	Children see throughput for 12 re-readers       = 5847438.62 kB/sec
>> 	Parent sees throughput for 12 re-readers        = 5839292.18 kB/sec
>> 	Min throughput per process                      = 485835.03 kB/sec
>> 	Max throughput per process                      = 488702.12 kB/sec
>> 	Avg throughput per process                      = 487286.55 kB/sec
>> 	Min xfer                                        = 1042688.00 kB
>> 	CPU utilization: Wall time    2.148    CPU time    1.909    CPU utilization  88.84 %
>>
>> NFS READ throughput is almost fully restored. A normal-looking throughput
>> result, copied from the previous thread, is:
>>
>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>
>> The NFS WRITE throughput result appears to be unchanged, or slightly
>> worse than before. I don't have an explanation for this result. I applied
>> your patches on the NFS server also without noting improvement.
> 
> Function-boundary tracing shows some interesting results.
> 
> # trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g dma_map_sg_attrs
> 
> Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
> the same size need as many as one __iommu_map() call per SGL element (which
> would be 30 for a 120KB SGL).
> 
> In v5.10, intel_map_sg() was structured such that an SGL is always handled with
> a single call to domain_mapping() and thus always just a single TLB flush.
> 
> My amateur theorizing suggests that the SGL element coalescing done in
> __iommu_map_sg() is not working as well as intel_map_sg() used to, which results
> in more calls to domain_mapping(). Not only does that take longer, but it creates
> many more DMA maps. Could that also have some impact on device TLB resources?

It seems that more domain_mapping() calls are not caused by
__iommu_map_sg() but __iommu_map().

Can you please test below changes? It call intel_iommu_map() directly
instead of __iommu_map().

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f5a236e63ded..660d5744a117 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4916,7 +4916,7 @@ intel_iommu_sva_invalidate(struct iommu_domain 
*domain, struct device *dev,
  }
  #endif

-static int intel_iommu_map(struct iommu_domain *domain,
+int intel_iommu_map(struct iommu_domain *domain,
                            unsigned long iova, phys_addr_t hpa,
                            size_t size, int iommu_prot, gfp_t gfp)
  {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3d099a31ddca..a1b41fd3fb4e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -23,8 +23,13 @@
  #include <linux/property.h>
  #include <linux/fsl/mc.h>
  #include <linux/module.h>
+#include <linux/intel-iommu.h>
  #include <trace/events/iommu.h>

+extern int intel_iommu_map(struct iommu_domain *domain,
+                          unsigned long iova, phys_addr_t hpa,
+                          size_t size, int iommu_prot, gfp_t gfp);
+
  static struct kset *iommu_group_kset;
  static DEFINE_IDA(iommu_group_ida);

@@ -2553,8 +2558,7 @@ static size_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
                 phys_addr_t s_phys = sg_phys(sg);

                 if (len && s_phys != start + len) {
-                       ret = __iommu_map(domain, iova + mapped, start,
-                                       len, prot, gfp);
+                       ret = intel_iommu_map(domain, iova + mapped, 
start, len, prot, gfp);

                         if (ret)
                                 goto out_err;

Does it change anything?

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-26  6:18     ` Lu Baolu
@ 2021-01-26 15:52       ` Chuck Lever
  2021-01-27  1:53         ` Lu Baolu
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2021-01-26 15:52 UTC (permalink / raw)
  To: Lu Baolu; +Cc: Will Deacon, Robin Murphy, iommu



> On Jan 26, 2021, at 1:18 AM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> On 2021/1/26 3:31, Chuck Lever wrote:
>>> On Jan 25, 2021, at 12:39 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>> Hello Lu -
>>> 
>>> Many thanks for your prototype.
>>> 
>>> 
>>>> On Jan 24, 2021, at 9:38 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>>> 
>>>> This patch series is only for Request-For-Testing purpose. It aims to
>>>> fix the performance regression reported here.
>>>> 
>>>> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
>>>> 
>>>> The first two patches are borrowed from here.
>>>> 
>>>> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/
>>>> 
>>>> Please kindly help to verification.
>>>> 
>>>> Best regards,
>>>> baolu
>>>> 
>>>> Lu Baolu (1):
>>>> iommu/vt-d: Add iotlb_sync_map callback
>>>> 
>>>> Yong Wu (2):
>>>> iommu: Move iotlb_sync_map out from __iommu_map
>>>> iommu: Add iova and size as parameters in iotlb_sync_map
>>>> 
>>>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
>>>> drivers/iommu/iommu.c       | 23 +++++++---
>>>> drivers/iommu/tegra-gart.c  |  7 ++-
>>>> include/linux/iommu.h       |  3 +-
>>>> 4 files changed, 83 insertions(+), 36 deletions(-)
>>> 
>>> Here are results with the NFS client at stock v5.11-rc5 and the
>>> NFS server at v5.10, showing the regression I reported earlier.
>>> 
>>> 	Children see throughput for 12 initial writers  = 4534582.00 kB/sec
>>> 	Parent sees throughput for 12 initial writers   = 4458145.56 kB/sec
>>> 	Min throughput per process                      = 373101.59 kB/sec
>>> 	Max throughput per process                      = 382669.50 kB/sec
>>> 	Avg throughput per process                      = 377881.83 kB/sec
>>> 	Min xfer                                        = 1022720.00 kB
>>> 	CPU Utilization: Wall time    2.787    CPU time    1.922    CPU utilization  68.95 %
>>> 
>>> 
>>> 	Children see throughput for 12 rewriters        = 4542003.12 kB/sec
>>> 	Parent sees throughput for 12 rewriters         = 4538024.19 kB/sec
>>> 	Min throughput per process                      = 374672.00 kB/sec
>>> 	Max throughput per process                      = 383983.78 kB/sec
>>> 	Avg throughput per process                      = 378500.26 kB/sec
>>> 	Min xfer                                        = 1022976.00 kB
>>> 	CPU utilization: Wall time    2.733    CPU time    1.947    CPU utilization  71.25 %
>>> 
>>> 
>>> 	Children see throughput for 12 readers          = 4568632.03 kB/sec
>>> 	Parent sees throughput for 12 readers           = 4563672.02 kB/sec
>>> 	Min throughput per process                      = 376727.56 kB/sec
>>> 	Max throughput per process                      = 383783.91 kB/sec
>>> 	Avg throughput per process                      = 380719.34 kB/sec
>>> 	Min xfer                                        = 1029376.00 kB
>>> 	CPU utilization: Wall time    2.733    CPU time    1.898    CPU utilization  69.46 %
>>> 
>>> 
>>> 	Children see throughput for 12 re-readers       = 4610702.78 kB/sec
>>> 	Parent sees throughput for 12 re-readers        = 4606135.66 kB/sec
>>> 	Min throughput per process                      = 381532.78 kB/sec
>>> 	Max throughput per process                      = 387072.53 kB/sec
>>> 	Avg throughput per process                      = 384225.23 kB/sec
>>> 	Min xfer                                        = 1034496.00 kB
>>> 	CPU utilization: Wall time    2.711    CPU time    1.910    CPU utilization  70.45 %
>>> 
>>> Here's the NFS client at v5.11-rc5 with your series applied.
>>> The NFS server remains at v5.10:
>>> 
>>> 	Children see throughput for 12 initial writers  = 4434778.81 kB/sec
>>> 	Parent sees throughput for 12 initial writers   = 4408190.69 kB/sec
>>> 	Min throughput per process                      = 367865.28 kB/sec
>>> 	Max throughput per process                      = 371134.38 kB/sec
>>> 	Avg throughput per process                      = 369564.90 kB/sec
>>> 	Min xfer                                        = 1039360.00 kB
>>> 	CPU Utilization: Wall time    2.842    CPU time    1.904    CPU utilization  66.99 %
>>> 
>>> 
>>> 	Children see throughput for 12 rewriters        = 4476870.69 kB/sec
>>> 	Parent sees throughput for 12 rewriters         = 4471701.48 kB/sec
>>> 	Min throughput per process                      = 370985.34 kB/sec
>>> 	Max throughput per process                      = 374752.28 kB/sec
>>> 	Avg throughput per process                      = 373072.56 kB/sec
>>> 	Min xfer                                        = 1038592.00 kB
>>> 	CPU utilization: Wall time    2.801    CPU time    1.902    CPU utilization  67.91 %
>>> 
>>> 
>>> 	Children see throughput for 12 readers          = 5865268.88 kB/sec
>>> 	Parent sees throughput for 12 readers           = 5854519.73 kB/sec
>>> 	Min throughput per process                      = 487766.81 kB/sec
>>> 	Max throughput per process                      = 489623.88 kB/sec
>>> 	Avg throughput per process                      = 488772.41 kB/sec
>>> 	Min xfer                                        = 1044736.00 kB
>>> 	CPU utilization: Wall time    2.144    CPU time    1.895    CPU utilization  88.41 %
>>> 
>>> 
>>> 	Children see throughput for 12 re-readers       = 5847438.62 kB/sec
>>> 	Parent sees throughput for 12 re-readers        = 5839292.18 kB/sec
>>> 	Min throughput per process                      = 485835.03 kB/sec
>>> 	Max throughput per process                      = 488702.12 kB/sec
>>> 	Avg throughput per process                      = 487286.55 kB/sec
>>> 	Min xfer                                        = 1042688.00 kB
>>> 	CPU utilization: Wall time    2.148    CPU time    1.909    CPU utilization  88.84 %
>>> 
>>> NFS READ throughput is almost fully restored. A normal-looking throughput
>>> result, copied from the previous thread, is:
>>> 
>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>> 
>>> The NFS WRITE throughput result appears to be unchanged, or slightly
>>> worse than before. I don't have an explanation for this result. I applied
>>> your patches on the NFS server also without noting improvement.
>> Function-boundary tracing shows some interesting results.
>> # trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g dma_map_sg_attrs
>> Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
>> the same size need as many as one __iommu_map() call per SGL element (which
>> would be 30 for a 120KB SGL).
>> In v5.10, intel_map_sg() was structured such that an SGL is always handled with
>> a single call to domain_mapping() and thus always just a single TLB flush.
>> My amateur theorizing suggests that the SGL element coalescing done in
>> __iommu_map_sg() is not working as well as intel_map_sg() used to, which results
>> in more calls to domain_mapping(). Not only does that take longer, but it creates
>> many more DMA maps. Could that also have some impact on device TLB resources?
> 
> It seems that more domain_mapping() calls are not caused by
> __iommu_map_sg() but __iommu_map().
> 
> Can you please test below changes? It call intel_iommu_map() directly
> instead of __iommu_map().
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index f5a236e63ded..660d5744a117 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4916,7 +4916,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev,
> }
> #endif
> 
> -static int intel_iommu_map(struct iommu_domain *domain,
> +int intel_iommu_map(struct iommu_domain *domain,
>                           unsigned long iova, phys_addr_t hpa,
>                           size_t size, int iommu_prot, gfp_t gfp)
> {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3d099a31ddca..a1b41fd3fb4e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -23,8 +23,13 @@
> #include <linux/property.h>
> #include <linux/fsl/mc.h>
> #include <linux/module.h>
> +#include <linux/intel-iommu.h>
> #include <trace/events/iommu.h>
> 
> +extern int intel_iommu_map(struct iommu_domain *domain,
> +                          unsigned long iova, phys_addr_t hpa,
> +                          size_t size, int iommu_prot, gfp_t gfp);
> +
> static struct kset *iommu_group_kset;
> static DEFINE_IDA(iommu_group_ida);
> 
> @@ -2553,8 +2558,7 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
>                phys_addr_t s_phys = sg_phys(sg);
> 
>                if (len && s_phys != start + len) {
> -                       ret = __iommu_map(domain, iova + mapped, start,
> -                                       len, prot, gfp);
> +                       ret = intel_iommu_map(domain, iova + mapped, start, len, prot, gfp);
> 
>                        if (ret)
>                                goto out_err;
> 
> Does it change anything?

I removed yesterday's 3-patch series, and applied the above.
Not a full restoration, but interesting nonetheless.

	Children see throughput for 12 initial writers 	= 4852657.22 kB/sec
	Parent sees throughput for 12 initial writers 	= 4826730.38 kB/sec
	Min throughput per process 			=  403196.09 kB/sec 
	Max throughput per process 			=  406071.47 kB/sec
	Avg throughput per process 			=  404388.10 kB/sec
	Min xfer 					= 1041408.00 kB
	CPU Utilization: Wall time    2.596    CPU time    2.047    CPU utilization  78.84 %


	Children see throughput for 12 rewriters 	= 4853900.22 kB/sec
	Parent sees throughput for 12 rewriters 	= 4848623.31 kB/sec
	Min throughput per process 			=  403380.81 kB/sec 
	Max throughput per process 			=  405478.53 kB/sec
	Avg throughput per process 			=  404491.68 kB/sec
	Min xfer 					= 1042944.00 kB
	CPU utilization: Wall time    2.589    CPU time    2.048    CPU utilization  79.12 %


	Children see throughput for 12 readers 		= 4938124.12 kB/sec
	Parent sees throughput for 12 readers 		= 4932862.08 kB/sec
	Min throughput per process 			=  408768.81 kB/sec 
	Max throughput per process 			=  413879.25 kB/sec
	Avg throughput per process 			=  411510.34 kB/sec
	Min xfer 					= 1036800.00 kB
	CPU utilization: Wall time    2.536    CPU time    1.906    CPU utilization  75.16 %


	Children see throughput for 12 re-readers 	= 4992115.16 kB/sec
	Parent sees throughput for 12 re-readers 	= 4986102.07 kB/sec
	Min throughput per process 			=  411103.00 kB/sec 
	Max throughput per process 			=  420302.97 kB/sec
	Avg throughput per process 			=  416009.60 kB/sec
	Min xfer 					= 1025792.00 kB
	CPU utilization: Wall time    2.497    CPU time    1.887    CPU utilization  75.57 %

NFS WRITE throughput improves significantly. NFS READ throughput
improves somewhat, but not to the degree it did with yesterday's
patch series.

function_graph shows a single intel_iommu_map() is used more
frequently, but the following happens on occasion:

395.678889: funcgraph_entry:                   |  dma_map_sg_attrs() {
395.678889: funcgraph_entry:                   |    iommu_dma_map_sg() {
395.678890: funcgraph_entry:        0.257 us   |      iommu_get_dma_domain();
395.678890: funcgraph_entry:        0.255 us   |      iommu_dma_deferred_attach();
395.678891: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
395.678891: funcgraph_entry:        0.253 us   |        dev_is_untrusted();
395.678891: funcgraph_exit:         0.773 us   |      }
395.678892: funcgraph_entry:        0.250 us   |      dev_is_untrusted();
395.678893: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
395.678893: funcgraph_entry:                   |        alloc_iova_fast() {
395.678893: funcgraph_entry:        0.255 us   |          _raw_spin_lock_irqsave();
395.678894: funcgraph_entry:        0.375 us   |          __lock_text_start();
395.678894: funcgraph_exit:         1.435 us   |        }
395.678895: funcgraph_exit:         2.002 us   |      }
395.678895: funcgraph_entry:        0.252 us   |      dma_info_to_prot();
395.678895: funcgraph_entry:                   |      iommu_map_sg_atomic() {
395.678896: funcgraph_entry:                   |        __iommu_map_sg() {
395.678896: funcgraph_entry:        1.675 us   |          intel_iommu_map();
395.678898: funcgraph_entry:        1.365 us   |          intel_iommu_map();
395.678900: funcgraph_entry:        1.373 us   |          intel_iommu_map();
395.678901: funcgraph_entry:        1.378 us   |          intel_iommu_map();
395.678903: funcgraph_entry:        1.380 us   |          intel_iommu_map();
395.678905: funcgraph_entry:        1.380 us   |          intel_iommu_map();
395.678906: funcgraph_entry:        1.378 us   |          intel_iommu_map();
395.678908: funcgraph_entry:        1.380 us   |          intel_iommu_map();
395.678910: funcgraph_entry:        1.345 us   |          intel_iommu_map();
395.678911: funcgraph_entry:        1.342 us   |          intel_iommu_map();
395.678913: funcgraph_entry:        1.342 us   |          intel_iommu_map();
395.678915: funcgraph_entry:        1.395 us   |          intel_iommu_map();
395.678916: funcgraph_entry:        1.343 us   |          intel_iommu_map();
395.678918: funcgraph_entry:        1.350 us   |          intel_iommu_map();
395.678920: funcgraph_entry:        1.395 us   |          intel_iommu_map();
395.678921: funcgraph_entry:        1.342 us   |          intel_iommu_map();
395.678923: funcgraph_entry:        1.350 us   |          intel_iommu_map();
395.678924: funcgraph_entry:        1.345 us   |          intel_iommu_map();
395.678926: funcgraph_entry:        1.345 us   |          intel_iommu_map();
395.678928: funcgraph_entry:        1.340 us   |          intel_iommu_map();
395.678929: funcgraph_entry:        1.342 us   |          intel_iommu_map();
395.678931: funcgraph_entry:        1.335 us   |          intel_iommu_map();
395.678933: funcgraph_entry:        1.345 us   |          intel_iommu_map();
395.678934: funcgraph_entry:        1.337 us   |          intel_iommu_map();
395.678936: funcgraph_entry:        1.305 us   |          intel_iommu_map();
395.678938: funcgraph_entry:        1.380 us   |          intel_iommu_map();
395.678939: funcgraph_entry:        1.365 us   |          intel_iommu_map();
395.678941: funcgraph_entry:        1.370 us   |          intel_iommu_map();
395.678943: funcgraph_entry:        1.365 us   |          intel_iommu_map();
395.678945: funcgraph_entry:        1.482 us   |          intel_iommu_map();
395.678946: funcgraph_exit:       + 50.753 us  |        }
395.678947: funcgraph_exit:       + 51.348 us  |      }
395.678947: funcgraph_exit:       + 57.975 us  |    }
395.678948: funcgraph_exit:       + 58.700 us  |  }
395.678953: xprtrdma_mr_map:      task:64127@5 mr.id=104 nents=30 122880@0xc5e467fde9380000:0xc0011103 (TO_DEVICE)
395.678953: xprtrdma_chunk_read:  task:64127@5 pos=148 122880@0xc5e467fde9380000:0xc0011103 (more)


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-25 19:31   ` Chuck Lever
  2021-01-26  6:18     ` Lu Baolu
@ 2021-01-26 16:05     ` Robin Murphy
  2021-01-26 17:32       ` Chuck Lever
  1 sibling, 1 reply; 12+ messages in thread
From: Robin Murphy @ 2021-01-26 16:05 UTC (permalink / raw)
  To: Chuck Lever, Lu Baolu; +Cc: Will Deacon, iommu

On 2021-01-25 19:31, Chuck Lever wrote:
> 
> 
>> On Jan 25, 2021, at 12:39 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> Hello Lu -
>>
>> Many thanks for your prototype.
>>
>>
>>> On Jan 24, 2021, at 9:38 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>>
>>> This patch series is only for Request-For-Testing purpose. It aims to
>>> fix the performance regression reported here.
>>>
>>> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
>>>
>>> The first two patches are borrowed from here.
>>>
>>> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/
>>>
>>> Please kindly help to verification.
>>>
>>> Best regards,
>>> baolu
>>>
>>> Lu Baolu (1):
>>> iommu/vt-d: Add iotlb_sync_map callback
>>>
>>> Yong Wu (2):
>>> iommu: Move iotlb_sync_map out from __iommu_map
>>> iommu: Add iova and size as parameters in iotlb_sync_map
>>>
>>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
>>> drivers/iommu/iommu.c       | 23 +++++++---
>>> drivers/iommu/tegra-gart.c  |  7 ++-
>>> include/linux/iommu.h       |  3 +-
>>> 4 files changed, 83 insertions(+), 36 deletions(-)
>>
>> Here are results with the NFS client at stock v5.11-rc5 and the
>> NFS server at v5.10, showing the regression I reported earlier.
>>
>> 	Children see throughput for 12 initial writers  = 4534582.00 kB/sec
>> 	Parent sees throughput for 12 initial writers   = 4458145.56 kB/sec
>> 	Min throughput per process                      = 373101.59 kB/sec
>> 	Max throughput per process                      = 382669.50 kB/sec
>> 	Avg throughput per process                      = 377881.83 kB/sec
>> 	Min xfer                                        = 1022720.00 kB
>> 	CPU Utilization: Wall time    2.787    CPU time    1.922    CPU utilization  68.95 %
>>
>>
>> 	Children see throughput for 12 rewriters        = 4542003.12 kB/sec
>> 	Parent sees throughput for 12 rewriters         = 4538024.19 kB/sec
>> 	Min throughput per process                      = 374672.00 kB/sec
>> 	Max throughput per process                      = 383983.78 kB/sec
>> 	Avg throughput per process                      = 378500.26 kB/sec
>> 	Min xfer                                        = 1022976.00 kB
>> 	CPU utilization: Wall time    2.733    CPU time    1.947    CPU utilization  71.25 %
>>
>>
>> 	Children see throughput for 12 readers          = 4568632.03 kB/sec
>> 	Parent sees throughput for 12 readers           = 4563672.02 kB/sec
>> 	Min throughput per process                      = 376727.56 kB/sec
>> 	Max throughput per process                      = 383783.91 kB/sec
>> 	Avg throughput per process                      = 380719.34 kB/sec
>> 	Min xfer                                        = 1029376.00 kB
>> 	CPU utilization: Wall time    2.733    CPU time    1.898    CPU utilization  69.46 %
>>
>>
>> 	Children see throughput for 12 re-readers       = 4610702.78 kB/sec
>> 	Parent sees throughput for 12 re-readers        = 4606135.66 kB/sec
>> 	Min throughput per process                      = 381532.78 kB/sec
>> 	Max throughput per process                      = 387072.53 kB/sec
>> 	Avg throughput per process                      = 384225.23 kB/sec
>> 	Min xfer                                        = 1034496.00 kB
>> 	CPU utilization: Wall time    2.711    CPU time    1.910    CPU utilization  70.45 %
>>
>> Here's the NFS client at v5.11-rc5 with your series applied.
>> The NFS server remains at v5.10:
>>
>> 	Children see throughput for 12 initial writers  = 4434778.81 kB/sec
>> 	Parent sees throughput for 12 initial writers   = 4408190.69 kB/sec
>> 	Min throughput per process                      = 367865.28 kB/sec
>> 	Max throughput per process                      = 371134.38 kB/sec
>> 	Avg throughput per process                      = 369564.90 kB/sec
>> 	Min xfer                                        = 1039360.00 kB
>> 	CPU Utilization: Wall time    2.842    CPU time    1.904    CPU utilization  66.99 %
>>
>>
>> 	Children see throughput for 12 rewriters        = 4476870.69 kB/sec
>> 	Parent sees throughput for 12 rewriters         = 4471701.48 kB/sec
>> 	Min throughput per process                      = 370985.34 kB/sec
>> 	Max throughput per process                      = 374752.28 kB/sec
>> 	Avg throughput per process                      = 373072.56 kB/sec
>> 	Min xfer                                        = 1038592.00 kB
>> 	CPU utilization: Wall time    2.801    CPU time    1.902    CPU utilization  67.91 %
>>
>>
>> 	Children see throughput for 12 readers          = 5865268.88 kB/sec
>> 	Parent sees throughput for 12 readers           = 5854519.73 kB/sec
>> 	Min throughput per process                      = 487766.81 kB/sec
>> 	Max throughput per process                      = 489623.88 kB/sec
>> 	Avg throughput per process                      = 488772.41 kB/sec
>> 	Min xfer                                        = 1044736.00 kB
>> 	CPU utilization: Wall time    2.144    CPU time    1.895    CPU utilization  88.41 %
>>
>>
>> 	Children see throughput for 12 re-readers       = 5847438.62 kB/sec
>> 	Parent sees throughput for 12 re-readers        = 5839292.18 kB/sec
>> 	Min throughput per process                      = 485835.03 kB/sec
>> 	Max throughput per process                      = 488702.12 kB/sec
>> 	Avg throughput per process                      = 487286.55 kB/sec
>> 	Min xfer                                        = 1042688.00 kB
>> 	CPU utilization: Wall time    2.148    CPU time    1.909    CPU utilization  88.84 %
>>
>> NFS READ throughput is almost fully restored. A normal-looking throughput
>> result, copied from the previous thread, is:
>>
>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>
>> The NFS WRITE throughput result appears to be unchanged, or slightly
>> worse than before. I don't have an explanation for this result. I applied
>> your patches on the NFS server also without noting improvement.
> 
> Function-boundary tracing shows some interesting results.
> 
> # trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g dma_map_sg_attrs
> 
> Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
> the same size need as many as one __iommu_map() call per SGL element (which
> would be 30 for a 120KB SGL).
> 
> In v5.10, intel_map_sg() was structured such that an SGL is always handled with
> a single call to domain_mapping() and thus always just a single TLB flush.

Implementing .iotlb_sync_map means that a single top-level 
iommu_map()/iommu_map_sg() call should still only invoke a single "TLB 
flush" (really, any maintenance required for the IOMMU to start using 
the new mapping) at the end, regardless of how many internal 
__iommu_map() calls are made to satisfy the overall request. If you're 
seeing something other than that behaviour (with this series), that 
implies we've not got things quite right yet.

> My amateur theorizing suggests that the SGL element coalescing done in
> __iommu_map_sg() is not working as well as intel_map_sg() used to, which results
> in more calls to domain_mapping(). Not only does that take longer, but it creates
> many more DMA maps. Could that also have some impact on device TLB resources?

FWIW the old __domain_mapping() code just did a dumb iteration over the 
scatterlist segments internally, so __iommu_map_sg() should be no worse 
in that regard, and could in principle even be better if it's able to 
coalesce things far enough to start fitting large page mappings. The 
only appreciable difference *should* be any additional self-time in 
__iommu_map() due to the iteration now being performed one layer up.

Is there any significant difference between how the NFS read and write 
paths make their DMA API calls and/or get their scatterlists in the 
first place, that might help shed some light on the curious 
half-recovery you got?

Thanks,
Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-26 16:05     ` Robin Murphy
@ 2021-01-26 17:32       ` Chuck Lever
  0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2021-01-26 17:32 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Will Deacon, iommu

Hi Robin-

> On Jan 26, 2021, at 11:05 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> 
> Implementing .iotlb_sync_map means that a single top-level iommu_map()/iommu_map_sg() call should still only invoke a single "TLB flush" (really, any maintenance required for the IOMMU to start using the new mapping) at the end, regardless of how many internal __iommu_map() calls are made to satisfy the overall request. If you're seeing something other than that behaviour (with this series), that implies we've not got things quite right yet.

The flush is expensive, but it's not the only cost. DMA-mapping a 120KB
SGL in a single domain_mapping() call vs. 30 calls is certainly going to
be a detectable difference.

Naively speaking, if there are more DMA mappings to keep track of because
the IOMMU driver isn't coalescing the SGLs the way it did before, that
might trigger TLB thrashing on the NIC.

> Is there any significant difference between how the NFS read and write paths make their DMA API calls and/or get their scatterlists in the first place, that might help shed some light on the curious half-recovery you got?

There isn't a difference in the RPC-over-RDMA code. Client-side DMA mapping
is handled in net/sunrpc/xprtrdma/frwr_ops.c :: frwr_map() which is used for
both I/O directions.

On the server, the RDMA core r/w API is used for mapping and then posting
RDMA Read and Write work requests. That API appears in
drivers/infiniband/core/rw.c , and as far as I understand, the same mapping
functions are used for both I/O directions.

It's possible that the NIC is doing something different for RDMA Read and
RDMA Write, but I don't have much visibility into that. Reads are very
different from Writes, which are posted.

--
Chuck Lever

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-26 15:52       ` Chuck Lever
@ 2021-01-27  1:53         ` Lu Baolu
  2021-01-27  2:58           ` Chuck Lever
  0 siblings, 1 reply; 12+ messages in thread
From: Lu Baolu @ 2021-01-27  1:53 UTC (permalink / raw)
  To: Chuck Lever; +Cc: isaacm, Robin Murphy, iommu, Will Deacon

Hi Chuck,

On 1/26/21 11:52 PM, Chuck Lever wrote:
> 
> 
>> On Jan 26, 2021, at 1:18 AM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>
>> On 2021/1/26 3:31, Chuck Lever wrote:
>>>> On Jan 25, 2021, at 12:39 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> Hello Lu -
>>>>
>>>> Many thanks for your prototype.
>>>>
>>>>
>>>>> On Jan 24, 2021, at 9:38 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>>>>
>>>>> This patch series is only for Request-For-Testing purpose. It aims to
>>>>> fix the performance regression reported here.
>>>>>
>>>>> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
>>>>>
>>>>> The first two patches are borrowed from here.
>>>>>
>>>>> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/
>>>>>
>>>>> Please kindly help to verification.
>>>>>
>>>>> Best regards,
>>>>> baolu
>>>>>
>>>>> Lu Baolu (1):
>>>>> iommu/vt-d: Add iotlb_sync_map callback
>>>>>
>>>>> Yong Wu (2):
>>>>> iommu: Move iotlb_sync_map out from __iommu_map
>>>>> iommu: Add iova and size as parameters in iotlb_sync_map
>>>>>
>>>>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
>>>>> drivers/iommu/iommu.c       | 23 +++++++---
>>>>> drivers/iommu/tegra-gart.c  |  7 ++-
>>>>> include/linux/iommu.h       |  3 +-
>>>>> 4 files changed, 83 insertions(+), 36 deletions(-)
>>>>
>>>> Here are results with the NFS client at stock v5.11-rc5 and the
>>>> NFS server at v5.10, showing the regression I reported earlier.
>>>>
>>>> 	Children see throughput for 12 initial writers  = 4534582.00 kB/sec
>>>> 	Parent sees throughput for 12 initial writers   = 4458145.56 kB/sec
>>>> 	Min throughput per process                      = 373101.59 kB/sec
>>>> 	Max throughput per process                      = 382669.50 kB/sec
>>>> 	Avg throughput per process                      = 377881.83 kB/sec
>>>> 	Min xfer                                        = 1022720.00 kB
>>>> 	CPU Utilization: Wall time    2.787    CPU time    1.922    CPU utilization  68.95 %
>>>>
>>>>
>>>> 	Children see throughput for 12 rewriters        = 4542003.12 kB/sec
>>>> 	Parent sees throughput for 12 rewriters         = 4538024.19 kB/sec
>>>> 	Min throughput per process                      = 374672.00 kB/sec
>>>> 	Max throughput per process                      = 383983.78 kB/sec
>>>> 	Avg throughput per process                      = 378500.26 kB/sec
>>>> 	Min xfer                                        = 1022976.00 kB
>>>> 	CPU utilization: Wall time    2.733    CPU time    1.947    CPU utilization  71.25 %
>>>>
>>>>
>>>> 	Children see throughput for 12 readers          = 4568632.03 kB/sec
>>>> 	Parent sees throughput for 12 readers           = 4563672.02 kB/sec
>>>> 	Min throughput per process                      = 376727.56 kB/sec
>>>> 	Max throughput per process                      = 383783.91 kB/sec
>>>> 	Avg throughput per process                      = 380719.34 kB/sec
>>>> 	Min xfer                                        = 1029376.00 kB
>>>> 	CPU utilization: Wall time    2.733    CPU time    1.898    CPU utilization  69.46 %
>>>>
>>>>
>>>> 	Children see throughput for 12 re-readers       = 4610702.78 kB/sec
>>>> 	Parent sees throughput for 12 re-readers        = 4606135.66 kB/sec
>>>> 	Min throughput per process                      = 381532.78 kB/sec
>>>> 	Max throughput per process                      = 387072.53 kB/sec
>>>> 	Avg throughput per process                      = 384225.23 kB/sec
>>>> 	Min xfer                                        = 1034496.00 kB
>>>> 	CPU utilization: Wall time    2.711    CPU time    1.910    CPU utilization  70.45 %
>>>>
>>>> Here's the NFS client at v5.11-rc5 with your series applied.
>>>> The NFS server remains at v5.10:
>>>>
>>>> 	Children see throughput for 12 initial writers  = 4434778.81 kB/sec
>>>> 	Parent sees throughput for 12 initial writers   = 4408190.69 kB/sec
>>>> 	Min throughput per process                      = 367865.28 kB/sec
>>>> 	Max throughput per process                      = 371134.38 kB/sec
>>>> 	Avg throughput per process                      = 369564.90 kB/sec
>>>> 	Min xfer                                        = 1039360.00 kB
>>>> 	CPU Utilization: Wall time    2.842    CPU time    1.904    CPU utilization  66.99 %
>>>>
>>>>
>>>> 	Children see throughput for 12 rewriters        = 4476870.69 kB/sec
>>>> 	Parent sees throughput for 12 rewriters         = 4471701.48 kB/sec
>>>> 	Min throughput per process                      = 370985.34 kB/sec
>>>> 	Max throughput per process                      = 374752.28 kB/sec
>>>> 	Avg throughput per process                      = 373072.56 kB/sec
>>>> 	Min xfer                                        = 1038592.00 kB
>>>> 	CPU utilization: Wall time    2.801    CPU time    1.902    CPU utilization  67.91 %
>>>>
>>>>
>>>> 	Children see throughput for 12 readers          = 5865268.88 kB/sec
>>>> 	Parent sees throughput for 12 readers           = 5854519.73 kB/sec
>>>> 	Min throughput per process                      = 487766.81 kB/sec
>>>> 	Max throughput per process                      = 489623.88 kB/sec
>>>> 	Avg throughput per process                      = 488772.41 kB/sec
>>>> 	Min xfer                                        = 1044736.00 kB
>>>> 	CPU utilization: Wall time    2.144    CPU time    1.895    CPU utilization  88.41 %
>>>>
>>>>
>>>> 	Children see throughput for 12 re-readers       = 5847438.62 kB/sec
>>>> 	Parent sees throughput for 12 re-readers        = 5839292.18 kB/sec
>>>> 	Min throughput per process                      = 485835.03 kB/sec
>>>> 	Max throughput per process                      = 488702.12 kB/sec
>>>> 	Avg throughput per process                      = 487286.55 kB/sec
>>>> 	Min xfer                                        = 1042688.00 kB
>>>> 	CPU utilization: Wall time    2.148    CPU time    1.909    CPU utilization  88.84 %
>>>>
>>>> NFS READ throughput is almost fully restored. A normal-looking throughput
>>>> result, copied from the previous thread, is:
>>>>
>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>
>>>> The NFS WRITE throughput result appears to be unchanged, or slightly
>>>> worse than before. I don't have an explanation for this result. I applied
>>>> your patches on the NFS server also without noting improvement.
>>> Function-boundary tracing shows some interesting results.
>>> # trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g dma_map_sg_attrs
>>> Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
>>> the same size need as many as one __iommu_map() call per SGL element (which
>>> would be 30 for a 120KB SGL).
>>> In v5.10, intel_map_sg() was structured such that an SGL is always handled with
>>> a single call to domain_mapping() and thus always just a single TLB flush.
>>> My amateur theorizing suggests that the SGL element coalescing done in
>>> __iommu_map_sg() is not working as well as intel_map_sg() used to, which results
>>> in more calls to domain_mapping(). Not only does that take longer, but it creates
>>> many more DMA maps. Could that also have some impact on device TLB resources?
>>
>> It seems that more domain_mapping() calls are not caused by
>> __iommu_map_sg() but __iommu_map().
>>
>> Can you please test below changes? It call intel_iommu_map() directly
>> instead of __iommu_map().
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index f5a236e63ded..660d5744a117 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -4916,7 +4916,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev,
>> }
>> #endif
>>
>> -static int intel_iommu_map(struct iommu_domain *domain,
>> +int intel_iommu_map(struct iommu_domain *domain,
>>                            unsigned long iova, phys_addr_t hpa,
>>                            size_t size, int iommu_prot, gfp_t gfp)
>> {
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 3d099a31ddca..a1b41fd3fb4e 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -23,8 +23,13 @@
>> #include <linux/property.h>
>> #include <linux/fsl/mc.h>
>> #include <linux/module.h>
>> +#include <linux/intel-iommu.h>
>> #include <trace/events/iommu.h>
>>
>> +extern int intel_iommu_map(struct iommu_domain *domain,
>> +                          unsigned long iova, phys_addr_t hpa,
>> +                          size_t size, int iommu_prot, gfp_t gfp);
>> +
>> static struct kset *iommu_group_kset;
>> static DEFINE_IDA(iommu_group_ida);
>>
>> @@ -2553,8 +2558,7 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
>>                 phys_addr_t s_phys = sg_phys(sg);
>>
>>                 if (len && s_phys != start + len) {
>> -                       ret = __iommu_map(domain, iova + mapped, start,
>> -                                       len, prot, gfp);
>> +                       ret = intel_iommu_map(domain, iova + mapped, start, len, prot, gfp);
>>
>>                         if (ret)
>>                                 goto out_err;
>>
>> Does it change anything?
> 
> I removed yesterday's 3-patch series, and applied the above.
> Not a full restoration, but interesting nonetheless.

Do you mind having a try with all 4 patches?

> 
> 	Children see throughput for 12 initial writers 	= 4852657.22 kB/sec
> 	Parent sees throughput for 12 initial writers 	= 4826730.38 kB/sec
> 	Min throughput per process 			=  403196.09 kB/sec
> 	Max throughput per process 			=  406071.47 kB/sec
> 	Avg throughput per process 			=  404388.10 kB/sec
> 	Min xfer 					= 1041408.00 kB
> 	CPU Utilization: Wall time    2.596    CPU time    2.047    CPU utilization  78.84 %
> 
> 
> 	Children see throughput for 12 rewriters 	= 4853900.22 kB/sec
> 	Parent sees throughput for 12 rewriters 	= 4848623.31 kB/sec
> 	Min throughput per process 			=  403380.81 kB/sec
> 	Max throughput per process 			=  405478.53 kB/sec
> 	Avg throughput per process 			=  404491.68 kB/sec
> 	Min xfer 					= 1042944.00 kB
> 	CPU utilization: Wall time    2.589    CPU time    2.048    CPU utilization  79.12 %
> 
> 
> 	Children see throughput for 12 readers 		= 4938124.12 kB/sec
> 	Parent sees throughput for 12 readers 		= 4932862.08 kB/sec
> 	Min throughput per process 			=  408768.81 kB/sec
> 	Max throughput per process 			=  413879.25 kB/sec
> 	Avg throughput per process 			=  411510.34 kB/sec
> 	Min xfer 					= 1036800.00 kB
> 	CPU utilization: Wall time    2.536    CPU time    1.906    CPU utilization  75.16 %
> 
> 
> 	Children see throughput for 12 re-readers 	= 4992115.16 kB/sec
> 	Parent sees throughput for 12 re-readers 	= 4986102.07 kB/sec
> 	Min throughput per process 			=  411103.00 kB/sec
> 	Max throughput per process 			=  420302.97 kB/sec
> 	Avg throughput per process 			=  416009.60 kB/sec
> 	Min xfer 					= 1025792.00 kB
> 	CPU utilization: Wall time    2.497    CPU time    1.887    CPU utilization  75.57 %
> 
> NFS WRITE throughput improves significantly. NFS READ throughput
> improves somewhat, but not to the degree it did with yesterday's
> patch series.
> 
> function_graph shows a single intel_iommu_map() is used more
> frequently, but the following happens on occasion:

This is because the pages are not physically contiguous, __iommu_map_sg
can't coalesce them.

If there's no full restoration with all four patches, it could be due to
ping-pongs between iommu core and the vendor iommu driver with indirect
calls. Hence a possible solution is to add an iommu_ops->map_sg() so
that all scatter gathered items could be handled in a single function as
before. This has already proposed by Isaac J. Manjarres.

https://lore.kernel.org/linux-iommu/1610376862-927-5-git-send-email-isaacm@codeaurora.org/

Best regards,
baolu

> 
> 395.678889: funcgraph_entry:                   |  dma_map_sg_attrs() {
> 395.678889: funcgraph_entry:                   |    iommu_dma_map_sg() {
> 395.678890: funcgraph_entry:        0.257 us   |      iommu_get_dma_domain();
> 395.678890: funcgraph_entry:        0.255 us   |      iommu_dma_deferred_attach();
> 395.678891: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
> 395.678891: funcgraph_entry:        0.253 us   |        dev_is_untrusted();
> 395.678891: funcgraph_exit:         0.773 us   |      }
> 395.678892: funcgraph_entry:        0.250 us   |      dev_is_untrusted();
> 395.678893: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
> 395.678893: funcgraph_entry:                   |        alloc_iova_fast() {
> 395.678893: funcgraph_entry:        0.255 us   |          _raw_spin_lock_irqsave();
> 395.678894: funcgraph_entry:        0.375 us   |          __lock_text_start();
> 395.678894: funcgraph_exit:         1.435 us   |        }
> 395.678895: funcgraph_exit:         2.002 us   |      }
> 395.678895: funcgraph_entry:        0.252 us   |      dma_info_to_prot();
> 395.678895: funcgraph_entry:                   |      iommu_map_sg_atomic() {
> 395.678896: funcgraph_entry:                   |        __iommu_map_sg() {
> 395.678896: funcgraph_entry:        1.675 us   |          intel_iommu_map();
> 395.678898: funcgraph_entry:        1.365 us   |          intel_iommu_map();
> 395.678900: funcgraph_entry:        1.373 us   |          intel_iommu_map();
> 395.678901: funcgraph_entry:        1.378 us   |          intel_iommu_map();
> 395.678903: funcgraph_entry:        1.380 us   |          intel_iommu_map();
> 395.678905: funcgraph_entry:        1.380 us   |          intel_iommu_map();
> 395.678906: funcgraph_entry:        1.378 us   |          intel_iommu_map();
> 395.678908: funcgraph_entry:        1.380 us   |          intel_iommu_map();
> 395.678910: funcgraph_entry:        1.345 us   |          intel_iommu_map();
> 395.678911: funcgraph_entry:        1.342 us   |          intel_iommu_map();
> 395.678913: funcgraph_entry:        1.342 us   |          intel_iommu_map();
> 395.678915: funcgraph_entry:        1.395 us   |          intel_iommu_map();
> 395.678916: funcgraph_entry:        1.343 us   |          intel_iommu_map();
> 395.678918: funcgraph_entry:        1.350 us   |          intel_iommu_map();
> 395.678920: funcgraph_entry:        1.395 us   |          intel_iommu_map();
> 395.678921: funcgraph_entry:        1.342 us   |          intel_iommu_map();
> 395.678923: funcgraph_entry:        1.350 us   |          intel_iommu_map();
> 395.678924: funcgraph_entry:        1.345 us   |          intel_iommu_map();
> 395.678926: funcgraph_entry:        1.345 us   |          intel_iommu_map();
> 395.678928: funcgraph_entry:        1.340 us   |          intel_iommu_map();
> 395.678929: funcgraph_entry:        1.342 us   |          intel_iommu_map();
> 395.678931: funcgraph_entry:        1.335 us   |          intel_iommu_map();
> 395.678933: funcgraph_entry:        1.345 us   |          intel_iommu_map();
> 395.678934: funcgraph_entry:        1.337 us   |          intel_iommu_map();
> 395.678936: funcgraph_entry:        1.305 us   |          intel_iommu_map();
> 395.678938: funcgraph_entry:        1.380 us   |          intel_iommu_map();
> 395.678939: funcgraph_entry:        1.365 us   |          intel_iommu_map();
> 395.678941: funcgraph_entry:        1.370 us   |          intel_iommu_map();
> 395.678943: funcgraph_entry:        1.365 us   |          intel_iommu_map();
> 395.678945: funcgraph_entry:        1.482 us   |          intel_iommu_map();
> 395.678946: funcgraph_exit:       + 50.753 us  |        }
> 395.678947: funcgraph_exit:       + 51.348 us  |      }
> 395.678947: funcgraph_exit:       + 57.975 us  |    }
> 395.678948: funcgraph_exit:       + 58.700 us  |  }
> 395.678953: xprtrdma_mr_map:      task:64127@5 mr.id=104 nents=30 122880@0xc5e467fde9380000:0xc0011103 (TO_DEVICE)
> 395.678953: xprtrdma_chunk_read:  task:64127@5 pos=148 122880@0xc5e467fde9380000:0xc0011103 (more)
> 
> 
> --
> Chuck Lever
> 
> 
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40
  2021-01-27  1:53         ` Lu Baolu
@ 2021-01-27  2:58           ` Chuck Lever
  0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2021-01-27  2:58 UTC (permalink / raw)
  To: Lu Baolu; +Cc: isaacm, Robin Murphy, iommu, Will Deacon



> On Jan 26, 2021, at 8:53 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> Hi Chuck,
> 
> On 1/26/21 11:52 PM, Chuck Lever wrote:
>>> On Jan 26, 2021, at 1:18 AM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>> 
>>> On 2021/1/26 3:31, Chuck Lever wrote:
>>>>> On Jan 25, 2021, at 12:39 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>> 
>>>>> Hello Lu -
>>>>> 
>>>>> Many thanks for your prototype.
>>>>> 
>>>>> 
>>>>>> On Jan 24, 2021, at 9:38 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>>>>> 
>>>>>> This patch series is only for Request-For-Testing purpose. It aims to
>>>>>> fix the performance regression reported here.
>>>>>> 
>>>>>> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/
>>>>>> 
>>>>>> The first two patches are borrowed from here.
>>>>>> 
>>>>>> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/
>>>>>> 
>>>>>> Please kindly help to verification.
>>>>>> 
>>>>>> Best regards,
>>>>>> baolu
>>>>>> 
>>>>>> Lu Baolu (1):
>>>>>> iommu/vt-d: Add iotlb_sync_map callback
>>>>>> 
>>>>>> Yong Wu (2):
>>>>>> iommu: Move iotlb_sync_map out from __iommu_map
>>>>>> iommu: Add iova and size as parameters in iotlb_sync_map
>>>>>> 
>>>>>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
>>>>>> drivers/iommu/iommu.c       | 23 +++++++---
>>>>>> drivers/iommu/tegra-gart.c  |  7 ++-
>>>>>> include/linux/iommu.h       |  3 +-
>>>>>> 4 files changed, 83 insertions(+), 36 deletions(-)
>>>>> 
>>>>> Here are results with the NFS client at stock v5.11-rc5 and the
>>>>> NFS server at v5.10, showing the regression I reported earlier.
>>>>> 
>>>>> 	Children see throughput for 12 initial writers  = 4534582.00 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers   = 4458145.56 kB/sec
>>>>> 	Min throughput per process                      = 373101.59 kB/sec
>>>>> 	Max throughput per process                      = 382669.50 kB/sec
>>>>> 	Avg throughput per process                      = 377881.83 kB/sec
>>>>> 	Min xfer                                        = 1022720.00 kB
>>>>> 	CPU Utilization: Wall time    2.787    CPU time    1.922    CPU utilization  68.95 %
>>>>> 
>>>>> 
>>>>> 	Children see throughput for 12 rewriters        = 4542003.12 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters         = 4538024.19 kB/sec
>>>>> 	Min throughput per process                      = 374672.00 kB/sec
>>>>> 	Max throughput per process                      = 383983.78 kB/sec
>>>>> 	Avg throughput per process                      = 378500.26 kB/sec
>>>>> 	Min xfer                                        = 1022976.00 kB
>>>>> 	CPU utilization: Wall time    2.733    CPU time    1.947    CPU utilization  71.25 %
>>>>> 
>>>>> 
>>>>> 	Children see throughput for 12 readers          = 4568632.03 kB/sec
>>>>> 	Parent sees throughput for 12 readers           = 4563672.02 kB/sec
>>>>> 	Min throughput per process                      = 376727.56 kB/sec
>>>>> 	Max throughput per process                      = 383783.91 kB/sec
>>>>> 	Avg throughput per process                      = 380719.34 kB/sec
>>>>> 	Min xfer                                        = 1029376.00 kB
>>>>> 	CPU utilization: Wall time    2.733    CPU time    1.898    CPU utilization  69.46 %
>>>>> 
>>>>> 
>>>>> 	Children see throughput for 12 re-readers       = 4610702.78 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers        = 4606135.66 kB/sec
>>>>> 	Min throughput per process                      = 381532.78 kB/sec
>>>>> 	Max throughput per process                      = 387072.53 kB/sec
>>>>> 	Avg throughput per process                      = 384225.23 kB/sec
>>>>> 	Min xfer                                        = 1034496.00 kB
>>>>> 	CPU utilization: Wall time    2.711    CPU time    1.910    CPU utilization  70.45 %
>>>>> 
>>>>> Here's the NFS client at v5.11-rc5 with your series applied.
>>>>> The NFS server remains at v5.10:
>>>>> 
>>>>> 	Children see throughput for 12 initial writers  = 4434778.81 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers   = 4408190.69 kB/sec
>>>>> 	Min throughput per process                      = 367865.28 kB/sec
>>>>> 	Max throughput per process                      = 371134.38 kB/sec
>>>>> 	Avg throughput per process                      = 369564.90 kB/sec
>>>>> 	Min xfer                                        = 1039360.00 kB
>>>>> 	CPU Utilization: Wall time    2.842    CPU time    1.904    CPU utilization  66.99 %
>>>>> 
>>>>> 
>>>>> 	Children see throughput for 12 rewriters        = 4476870.69 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters         = 4471701.48 kB/sec
>>>>> 	Min throughput per process                      = 370985.34 kB/sec
>>>>> 	Max throughput per process                      = 374752.28 kB/sec
>>>>> 	Avg throughput per process                      = 373072.56 kB/sec
>>>>> 	Min xfer                                        = 1038592.00 kB
>>>>> 	CPU utilization: Wall time    2.801    CPU time    1.902    CPU utilization  67.91 %
>>>>> 
>>>>> 
>>>>> 	Children see throughput for 12 readers          = 5865268.88 kB/sec
>>>>> 	Parent sees throughput for 12 readers           = 5854519.73 kB/sec
>>>>> 	Min throughput per process                      = 487766.81 kB/sec
>>>>> 	Max throughput per process                      = 489623.88 kB/sec
>>>>> 	Avg throughput per process                      = 488772.41 kB/sec
>>>>> 	Min xfer                                        = 1044736.00 kB
>>>>> 	CPU utilization: Wall time    2.144    CPU time    1.895    CPU utilization  88.41 %
>>>>> 
>>>>> 
>>>>> 	Children see throughput for 12 re-readers       = 5847438.62 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers        = 5839292.18 kB/sec
>>>>> 	Min throughput per process                      = 485835.03 kB/sec
>>>>> 	Max throughput per process                      = 488702.12 kB/sec
>>>>> 	Avg throughput per process                      = 487286.55 kB/sec
>>>>> 	Min xfer                                        = 1042688.00 kB
>>>>> 	CPU utilization: Wall time    2.148    CPU time    1.909    CPU utilization  88.84 %
>>>>> 
>>>>> NFS READ throughput is almost fully restored. A normal-looking throughput
>>>>> result, copied from the previous thread, is:
>>>>> 
>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>> 
>>>>> The NFS WRITE throughput result appears to be unchanged, or slightly
>>>>> worse than before. I don't have an explanation for this result. I applied
>>>>> your patches on the NFS server also without noting improvement.
>>>> Function-boundary tracing shows some interesting results.
>>>> # trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g dma_map_sg_attrs
>>>> Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
>>>> the same size need as many as one __iommu_map() call per SGL element (which
>>>> would be 30 for a 120KB SGL).
>>>> In v5.10, intel_map_sg() was structured such that an SGL is always handled with
>>>> a single call to domain_mapping() and thus always just a single TLB flush.
>>>> My amateur theorizing suggests that the SGL element coalescing done in
>>>> __iommu_map_sg() is not working as well as intel_map_sg() used to, which results
>>>> in more calls to domain_mapping(). Not only does that take longer, but it creates
>>>> many more DMA maps. Could that also have some impact on device TLB resources?
>>> 
>>> It seems that more domain_mapping() calls are not caused by
>>> __iommu_map_sg() but __iommu_map().
>>> 
>>> Can you please test below changes? It call intel_iommu_map() directly
>>> instead of __iommu_map().
>>> 
>>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>>> index f5a236e63ded..660d5744a117 100644
>>> --- a/drivers/iommu/intel/iommu.c
>>> +++ b/drivers/iommu/intel/iommu.c
>>> @@ -4916,7 +4916,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev,
>>> }
>>> #endif
>>> 
>>> -static int intel_iommu_map(struct iommu_domain *domain,
>>> +int intel_iommu_map(struct iommu_domain *domain,
>>>                           unsigned long iova, phys_addr_t hpa,
>>>                           size_t size, int iommu_prot, gfp_t gfp)
>>> {
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index 3d099a31ddca..a1b41fd3fb4e 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -23,8 +23,13 @@
>>> #include <linux/property.h>
>>> #include <linux/fsl/mc.h>
>>> #include <linux/module.h>
>>> +#include <linux/intel-iommu.h>
>>> #include <trace/events/iommu.h>
>>> 
>>> +extern int intel_iommu_map(struct iommu_domain *domain,
>>> +                          unsigned long iova, phys_addr_t hpa,
>>> +                          size_t size, int iommu_prot, gfp_t gfp);
>>> +
>>> static struct kset *iommu_group_kset;
>>> static DEFINE_IDA(iommu_group_ida);
>>> 
>>> @@ -2553,8 +2558,7 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
>>>                phys_addr_t s_phys = sg_phys(sg);
>>> 
>>>                if (len && s_phys != start + len) {
>>> -                       ret = __iommu_map(domain, iova + mapped, start,
>>> -                                       len, prot, gfp);
>>> +                       ret = intel_iommu_map(domain, iova + mapped, start, len, prot, gfp);
>>> 
>>>                        if (ret)
>>>                                goto out_err;
>>> 
>>> Does it change anything?
>> I removed yesterday's 3-patch series, and applied the above.
>> Not a full restoration, but interesting nonetheless.
> 
> Do you mind having a try with all 4 patches?

Much worse. I checked this result several times. I still need to go
look more closely at the patches to see if I merged them together
sensibly.

	Children see throughput for 12 initial writers  = 3890646.38 kB/sec
	Parent sees throughput for 12 initial writers   = 3857201.57 kB/sec
	Min throughput per process                      = 322068.31 kB/sec 
	Max throughput per process                      = 326518.72 kB/sec
	Avg throughput per process                      = 324220.53 kB/sec
	Min xfer                                        = 1034240.00 kB
	CPU Utilization: Wall time    3.239    CPU time    1.959    CPU utilization  60.49 %


	Children see throughput for 12 rewriters        = 3571855.25 kB/sec
	Parent sees throughput for 12 rewriters         = 3568989.90 kB/sec
	Min throughput per process                      = 295413.09 kB/sec 
	Max throughput per process                      = 300169.31 kB/sec
	Avg throughput per process                      = 297654.60 kB/sec
	Min xfer                                        = 1032192.00 kB
	CPU utilization: Wall time    3.495    CPU time    2.017    CPU utilization  57.70 %


	Children see throughput for 12 readers          = 3485583.19 kB/sec
	Parent sees throughput for 12 readers           = 3482811.89 kB/sec
	Min throughput per process                      = 287489.16 kB/sec 
	Max throughput per process                      = 294395.06 kB/sec
	Avg throughput per process                      = 290465.27 kB/sec
	Min xfer                                        = 1024256.00 kB
	CPU utilization: Wall time    3.565    CPU time    1.984    CPU utilization  55.64 %


	Children see throughput for 12 re-readers       = 3427079.94 kB/sec
	Parent sees throughput for 12 re-readers        = 3424836.93 kB/sec
	Min throughput per process                      = 283702.56 kB/sec 
	Max throughput per process                      = 288727.31 kB/sec
	Avg throughput per process                      = 285589.99 kB/sec
	Min xfer                                        = 1030400.00 kB
	CPU utilization: Wall time    3.634    CPU time    1.999    CPU utilization  55.02 %


>> 	Children see throughput for 12 initial writers 	= 4852657.22 kB/sec
>> 	Parent sees throughput for 12 initial writers 	= 4826730.38 kB/sec
>> 	Min throughput per process 			=  403196.09 kB/sec
>> 	Max throughput per process 			=  406071.47 kB/sec
>> 	Avg throughput per process 			=  404388.10 kB/sec
>> 	Min xfer 					= 1041408.00 kB
>> 	CPU Utilization: Wall time    2.596    CPU time    2.047    CPU utilization  78.84 %
>> 	Children see throughput for 12 rewriters 	= 4853900.22 kB/sec
>> 	Parent sees throughput for 12 rewriters 	= 4848623.31 kB/sec
>> 	Min throughput per process 			=  403380.81 kB/sec
>> 	Max throughput per process 			=  405478.53 kB/sec
>> 	Avg throughput per process 			=  404491.68 kB/sec
>> 	Min xfer 					= 1042944.00 kB
>> 	CPU utilization: Wall time    2.589    CPU time    2.048    CPU utilization  79.12 %
>> 	Children see throughput for 12 readers 		= 4938124.12 kB/sec
>> 	Parent sees throughput for 12 readers 		= 4932862.08 kB/sec
>> 	Min throughput per process 			=  408768.81 kB/sec
>> 	Max throughput per process 			=  413879.25 kB/sec
>> 	Avg throughput per process 			=  411510.34 kB/sec
>> 	Min xfer 					= 1036800.00 kB
>> 	CPU utilization: Wall time    2.536    CPU time    1.906    CPU utilization  75.16 %
>> 	Children see throughput for 12 re-readers 	= 4992115.16 kB/sec
>> 	Parent sees throughput for 12 re-readers 	= 4986102.07 kB/sec
>> 	Min throughput per process 			=  411103.00 kB/sec
>> 	Max throughput per process 			=  420302.97 kB/sec
>> 	Avg throughput per process 			=  416009.60 kB/sec
>> 	Min xfer 					= 1025792.00 kB
>> 	CPU utilization: Wall time    2.497    CPU time    1.887    CPU utilization  75.57 %
>> NFS WRITE throughput improves significantly. NFS READ throughput
>> improves somewhat, but not to the degree it did with yesterday's
>> patch series.
>> function_graph shows a single intel_iommu_map() is used more
>> frequently, but the following happens on occasion:
> 
> This is because the pages are not physically contiguous, __iommu_map_sg
> can't coalesce them.

Understood, but why was physical contiguity not a problem for the
intel_map_sg() function removed by commit c588072bba6b ("iommu/vt-d:
Convert intel iommu driver to the iommu ops")? That version of the
function did no such checking, and simply invoked domain_mapping()
once for the entire SGL. Was that not safe to do?

(Feel free to tell me where I'm wrong, I'm really quite a neophyte
with this level of IOMMU detail).


> If there's no full restoration with all four patches, it could be due to
> ping-pongs between iommu core and the vendor iommu driver with indirect
> calls. Hence a possible solution is to add an iommu_ops->map_sg() so
> that all scatter gathered items could be handled in a single function as
> before. This has already proposed by Isaac J. Manjarres.
> 
> https://lore.kernel.org/linux-iommu/1610376862-927-5-git-send-email-isaacm@codeaurora.org/

That was one of the first things I tried. However, Isaac's patches
don't provide an Intel version of .map_sg, so I made one up based on
__iommu_map_sg, which basically copied the physical page checking
mentioned above. It didn't perform very well.

I could try again with a .map_sg that works like the old intel_map_sg()
function, perhaps.


> Best regards,
> baolu
> 
>> 395.678889: funcgraph_entry:                   |  dma_map_sg_attrs() {
>> 395.678889: funcgraph_entry:                   |    iommu_dma_map_sg() {
>> 395.678890: funcgraph_entry:        0.257 us   |      iommu_get_dma_domain();
>> 395.678890: funcgraph_entry:        0.255 us   |      iommu_dma_deferred_attach();
>> 395.678891: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
>> 395.678891: funcgraph_entry:        0.253 us   |        dev_is_untrusted();
>> 395.678891: funcgraph_exit:         0.773 us   |      }
>> 395.678892: funcgraph_entry:        0.250 us   |      dev_is_untrusted();
>> 395.678893: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
>> 395.678893: funcgraph_entry:                   |        alloc_iova_fast() {
>> 395.678893: funcgraph_entry:        0.255 us   |          _raw_spin_lock_irqsave();
>> 395.678894: funcgraph_entry:        0.375 us   |          __lock_text_start();
>> 395.678894: funcgraph_exit:         1.435 us   |        }
>> 395.678895: funcgraph_exit:         2.002 us   |      }
>> 395.678895: funcgraph_entry:        0.252 us   |      dma_info_to_prot();
>> 395.678895: funcgraph_entry:                   |      iommu_map_sg_atomic() {
>> 395.678896: funcgraph_entry:                   |        __iommu_map_sg() {
>> 395.678896: funcgraph_entry:        1.675 us   |          intel_iommu_map();
>> 395.678898: funcgraph_entry:        1.365 us   |          intel_iommu_map();
>> 395.678900: funcgraph_entry:        1.373 us   |          intel_iommu_map();
>> 395.678901: funcgraph_entry:        1.378 us   |          intel_iommu_map();
>> 395.678903: funcgraph_entry:        1.380 us   |          intel_iommu_map();
>> 395.678905: funcgraph_entry:        1.380 us   |          intel_iommu_map();
>> 395.678906: funcgraph_entry:        1.378 us   |          intel_iommu_map();
>> 395.678908: funcgraph_entry:        1.380 us   |          intel_iommu_map();
>> 395.678910: funcgraph_entry:        1.345 us   |          intel_iommu_map();
>> 395.678911: funcgraph_entry:        1.342 us   |          intel_iommu_map();
>> 395.678913: funcgraph_entry:        1.342 us   |          intel_iommu_map();
>> 395.678915: funcgraph_entry:        1.395 us   |          intel_iommu_map();
>> 395.678916: funcgraph_entry:        1.343 us   |          intel_iommu_map();
>> 395.678918: funcgraph_entry:        1.350 us   |          intel_iommu_map();
>> 395.678920: funcgraph_entry:        1.395 us   |          intel_iommu_map();
>> 395.678921: funcgraph_entry:        1.342 us   |          intel_iommu_map();
>> 395.678923: funcgraph_entry:        1.350 us   |          intel_iommu_map();
>> 395.678924: funcgraph_entry:        1.345 us   |          intel_iommu_map();
>> 395.678926: funcgraph_entry:        1.345 us   |          intel_iommu_map();
>> 395.678928: funcgraph_entry:        1.340 us   |          intel_iommu_map();
>> 395.678929: funcgraph_entry:        1.342 us   |          intel_iommu_map();
>> 395.678931: funcgraph_entry:        1.335 us   |          intel_iommu_map();
>> 395.678933: funcgraph_entry:        1.345 us   |          intel_iommu_map();
>> 395.678934: funcgraph_entry:        1.337 us   |          intel_iommu_map();
>> 395.678936: funcgraph_entry:        1.305 us   |          intel_iommu_map();
>> 395.678938: funcgraph_entry:        1.380 us   |          intel_iommu_map();
>> 395.678939: funcgraph_entry:        1.365 us   |          intel_iommu_map();
>> 395.678941: funcgraph_entry:        1.370 us   |          intel_iommu_map();
>> 395.678943: funcgraph_entry:        1.365 us   |          intel_iommu_map();
>> 395.678945: funcgraph_entry:        1.482 us   |          intel_iommu_map();
>> 395.678946: funcgraph_exit:       + 50.753 us  |        }
>> 395.678947: funcgraph_exit:       + 51.348 us  |      }
>> 395.678947: funcgraph_exit:       + 57.975 us  |    }
>> 395.678948: funcgraph_exit:       + 58.700 us  |  }
>> 395.678953: xprtrdma_mr_map:      task:64127@5 mr.id=104 nents=30 122880@0xc5e467fde9380000:0xc0011103 (TO_DEVICE)
>> 395.678953: xprtrdma_chunk_read:  task:64127@5 pos=148 122880@0xc5e467fde9380000:0xc0011103 (more)
>> --
>> Chuck Lever

--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-01-27  3:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-25  2:38 [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Lu Baolu
2021-01-25  2:38 ` [RFT PATCH 1/3] iommu: Move iotlb_sync_map out from __iommu_map Lu Baolu
2021-01-25  2:38 ` [RFT PATCH 2/3] iommu: Add iova and size as parameters in iotlb_sync_map Lu Baolu
2021-01-25  2:38 ` [RFT PATCH 3/3] iommu/vt-d: Add iotlb_sync_map callback Lu Baolu
2021-01-25 17:39 ` [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 Chuck Lever
2021-01-25 19:31   ` Chuck Lever
2021-01-26  6:18     ` Lu Baolu
2021-01-26 15:52       ` Chuck Lever
2021-01-27  1:53         ` Lu Baolu
2021-01-27  2:58           ` Chuck Lever
2021-01-26 16:05     ` Robin Murphy
2021-01-26 17:32       ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.