[PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking

iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking
@ 2019-08-21 15:17 Will Deacon
  2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

Hi again,

This is version two of the patches I posted yesterday:

  v1: https://lkml.kernel.org/r/20190820154549.17018-1-will@kernel.org

Changes since then include:

  * Fix 'ats_enabled' checking when enabling ATS
  * Remove redundant 'dev_is_pci()' calls
  * Remove bool bitfield
  * Add patch temporarily disabling ATS detection for -stable
  * Issue ATC invalidation even when non-leaf
  * Elide invalidation/SYNC for zero-sized address ranges
  * Shuffle the patches round a bit

Thanks,

Will

Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: John Garry <john.garry@huawei.com>
Cc: Robin Murphy <robin.murphy@arm.com>

--->8

Will Deacon (8):
  iommu/arm-smmu-v3: Document ordering guarantees of command insertion
  iommu/arm-smmu-v3: Disable detection of ATS and PRI
  iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag
  iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations
  iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
  iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
  iommu/arm-smmu-v3: Avoid locking on invalidation path when not using
    ATS
  Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI"

 drivers/iommu/arm-smmu-v3.c | 117 ++++++++++++++++++++++++++++++++------------
 1 file changed, 87 insertions(+), 30 deletions(-)

-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

It turns out that we've always relied on some subtle ordering guarantees
when inserting commands into the SMMUv3 command queue. With the recent
changes to elide locking when possible, these guarantees become more
subtle and even more important.

Add a comment documented the barrier semantics of command insertion so
that we don't have to derive the behaviour from scratch each time it
comes up on the list.

Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b36a99971401..3402b1bc8e94 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1286,6 +1286,22 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
 	}
 }
 
+/*
+ * This is the actual insertion function, and provides the following
+ * ordering guarantees to callers:
+ *
+ * - There is a dma_wmb() before publishing any commands to the queue.
+ *   This can be relied upon to order prior writes to data structures
+ *   in memory (such as a CD or an STE) before the command.
+ *
+ * - On completion of a CMD_SYNC, there is a control dependency.
+ *   This can be relied upon to order subsequent writes to memory (e.g.
+ *   freeing an IOVA) after completion of the CMD_SYNC.
+ *
+ * - Command insertion is totally ordered, so if two CPUs each race to
+ *   insert their own list of commands then all of the commands from one
+ *   CPU will appear before any of the commands from the other CPU.
+ */
 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 				       u64 *cmds, int n, bool sync)
 {
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
  2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2019-08-21 15:36   ` Robin Murphy
  2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Robin Murphy, stable, Will Deacon

Detecting the ATS capability of the SMMU at probe time introduces a
spinlock into the ->unmap() fast path, even when ATS is not actually
in use. Furthermore, the ATC invalidation that exists is broken, as it
occurs before invalidation of the main SMMU TLB which leaves a window
where the ATC can be repopulated with stale entries.

Given that ATS is both a new feature and a specialist sport, disable it
for now whilst we fix it properly in subsequent patches. Since PRI
requires ATS, disable that too.

Cc: <stable@vger.kernel.org>
Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 3402b1bc8e94..7a368059cd7d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3295,11 +3295,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 	}
 
 	/* Boolean feature flags */
+#if 0	/* ATS invalidation is slow and broken */
 	if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
 		smmu->features |= ARM_SMMU_FEAT_PRI;
 
 	if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
 		smmu->features |= ARM_SMMU_FEAT_ATS;
+#endif
 
 	if (reg & IDR0_SEV)
 		smmu->features |= ARM_SMMU_FEAT_SEV;
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI
  2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
@ 2019-08-21 15:36   ` Robin Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-21 15:36 UTC (permalink / raw)
  To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker, stable

On 21/08/2019 16:17, Will Deacon wrote:
> Detecting the ATS capability of the SMMU at probe time introduces a
> spinlock into the ->unmap() fast path, even when ATS is not actually
> in use. Furthermore, the ATC invalidation that exists is broken, as it
> occurs before invalidation of the main SMMU TLB which leaves a window
> where the ATC can be repopulated with stale entries.
> 
> Given that ATS is both a new feature and a specialist sport, disable it
> for now whilst we fix it properly in subsequent patches. Since PRI
> requires ATS, disable that too.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")

Acked-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>   drivers/iommu/arm-smmu-v3.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 3402b1bc8e94..7a368059cd7d 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -3295,11 +3295,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>   	}
>   
>   	/* Boolean feature flags */
> +#if 0	/* ATS invalidation is slow and broken */
>   	if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
>   		smmu->features |= ARM_SMMU_FEAT_PRI;
>   
>   	if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
>   		smmu->features |= ARM_SMMU_FEAT_ATS;
> +#endif
>   
>   	if (reg & IDR0_SEV)
>   		smmu->features |= ARM_SMMU_FEAT_SEV;
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
  2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
  2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

There's really no need for this to be a bitfield, particularly as we
don't have bitwise addressing on arm64.

Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 7a368059cd7d..2be11a11bb8b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -637,7 +637,7 @@ struct arm_smmu_master {
 	struct list_head		domain_head;
 	u32				*sids;
 	unsigned int			num_sids;
-	bool				ats_enabled		:1;
+	bool				ats_enabled;
 };
 
 /* SMMU private data for an IOMMU domain */
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
                   ` (2 preceding siblings ...)
  2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

Calling arm_smmu_tlb_inv_range() with a size of zero, perhaps due to
an empty 'iommu_iotlb_gather' structure, should be a NOP. Elide the
CMD_SYNC when there is no invalidation to be performed.

Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 2be11a11bb8b..b7b3b0ff8ed6 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1977,6 +1977,9 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
 		},
 	};
 
+	if (!size)
+		return;
+
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
 		cmd.opcode	= CMDQ_OP_TLBI_NH_VA;
 		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
                   ` (3 preceding siblings ...)
  2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2019-08-21 15:50   ` Robin Murphy
  2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

To prevent any potential issues arising from speculative Address
Translation Requests from an ATS-enabled PCIe endpoint, rework our ATS
enabling/disabling logic so that we enable ATS at the SMMU before we
enable it at the endpoint, and disable things in the opposite order.

Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 47 +++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b7b3b0ff8ed6..d7c65dfe42dc 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2286,44 +2286,52 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 	}
 }
 
-static int arm_smmu_enable_ats(struct arm_smmu_master *master)
+static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 {
-	int ret;
-	size_t stu;
 	struct pci_dev *pdev;
 	struct arm_smmu_device *smmu = master->smmu;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
 
 	if (!(smmu->features & ARM_SMMU_FEAT_ATS) || !dev_is_pci(master->dev) ||
 	    !(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS) || pci_ats_disabled())
-		return -ENXIO;
+		return false;
 
 	pdev = to_pci_dev(master->dev);
-	if (pdev->untrusted)
-		return -EPERM;
+	return !pdev->untrusted && pdev->ats_cap;
+}
 
-	/* Smallest Translation Unit: log2 of the smallest supported granule */
-	stu = __ffs(smmu->pgsize_bitmap);
+static void arm_smmu_enable_ats(struct arm_smmu_master *master)
+{
+	size_t stu;
+	struct pci_dev *pdev;
+	struct arm_smmu_device *smmu = master->smmu;
 
-	ret = pci_enable_ats(pdev, stu);
-	if (ret)
-		return ret;
+	/* Don't enable ATS at the endpoint if it's not enabled in the STE */
+	if (!master->ats_enabled)
+		return;
 
-	master->ats_enabled = true;
-	return 0;
+	/* Smallest Translation Unit: log2 of the smallest supported granule */
+	stu = __ffs(smmu->pgsize_bitmap);
+	pdev = to_pci_dev(master->dev);
+	if (pci_enable_ats(pdev, stu))
+		dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
 }
 
 static void arm_smmu_disable_ats(struct arm_smmu_master *master)
 {
 	struct arm_smmu_cmdq_ent cmd;
 
-	if (!master->ats_enabled || !dev_is_pci(master->dev))
+	if (!master->ats_enabled)
 		return;
 
+	pci_disable_ats(to_pci_dev(master->dev));
+	/*
+	 * Ensure ATS is disabled at the endpoint before we issue the
+	 * ATC invalidation via the SMMU.
+	 */
+	wmb();
 	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
 	arm_smmu_atc_inv_master(master, &cmd);
-	pci_disable_ats(to_pci_dev(master->dev));
-	master->ats_enabled = false;
 }
 
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
@@ -2338,10 +2346,10 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	list_del(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
+	arm_smmu_disable_ats(master);
 	master->domain = NULL;
+	master->ats_enabled = false;
 	arm_smmu_install_ste_for_dev(master);
-
-	arm_smmu_disable_ats(master);
 }
 
 static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
@@ -2386,12 +2394,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
 	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
-		arm_smmu_enable_ats(master);
+		master->ats_enabled = arm_smmu_ats_supported(master);
 
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
 		arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
 
 	arm_smmu_install_ste_for_dev(master);
+	arm_smmu_enable_ats(master);
 out_unlock:
 	mutex_unlock(&smmu_domain->init_mutex);
 	return ret;
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
  2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
@ 2019-08-21 15:50   ` Robin Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-21 15:50 UTC (permalink / raw)
  To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker

On 21/08/2019 16:17, Will Deacon wrote:
> To prevent any potential issues arising from speculative Address
> Translation Requests from an ATS-enabled PCIe endpoint, rework our ATS
> enabling/disabling logic so that we enable ATS at the SMMU before we
> enable it at the endpoint, and disable things in the opposite order.

This time it looks right :)

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>   drivers/iommu/arm-smmu-v3.c | 47 +++++++++++++++++++++++++++------------------
>   1 file changed, 28 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index b7b3b0ff8ed6..d7c65dfe42dc 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -2286,44 +2286,52 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
>   	}
>   }
>   
> -static int arm_smmu_enable_ats(struct arm_smmu_master *master)
> +static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
>   {
> -	int ret;
> -	size_t stu;
>   	struct pci_dev *pdev;
>   	struct arm_smmu_device *smmu = master->smmu;
>   	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
>   
>   	if (!(smmu->features & ARM_SMMU_FEAT_ATS) || !dev_is_pci(master->dev) ||
>   	    !(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS) || pci_ats_disabled())
> -		return -ENXIO;
> +		return false;
>   
>   	pdev = to_pci_dev(master->dev);
> -	if (pdev->untrusted)
> -		return -EPERM;
> +	return !pdev->untrusted && pdev->ats_cap;
> +}
>   
> -	/* Smallest Translation Unit: log2 of the smallest supported granule */
> -	stu = __ffs(smmu->pgsize_bitmap);
> +static void arm_smmu_enable_ats(struct arm_smmu_master *master)
> +{
> +	size_t stu;
> +	struct pci_dev *pdev;
> +	struct arm_smmu_device *smmu = master->smmu;
>   
> -	ret = pci_enable_ats(pdev, stu);
> -	if (ret)
> -		return ret;
> +	/* Don't enable ATS at the endpoint if it's not enabled in the STE */
> +	if (!master->ats_enabled)
> +		return;
>   
> -	master->ats_enabled = true;
> -	return 0;
> +	/* Smallest Translation Unit: log2 of the smallest supported granule */
> +	stu = __ffs(smmu->pgsize_bitmap);
> +	pdev = to_pci_dev(master->dev);
> +	if (pci_enable_ats(pdev, stu))
> +		dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
>   }
>   
>   static void arm_smmu_disable_ats(struct arm_smmu_master *master)
>   {
>   	struct arm_smmu_cmdq_ent cmd;
>   
> -	if (!master->ats_enabled || !dev_is_pci(master->dev))
> +	if (!master->ats_enabled)
>   		return;
>   
> +	pci_disable_ats(to_pci_dev(master->dev));
> +	/*
> +	 * Ensure ATS is disabled at the endpoint before we issue the
> +	 * ATC invalidation via the SMMU.
> +	 */
> +	wmb();
>   	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
>   	arm_smmu_atc_inv_master(master, &cmd);
> -	pci_disable_ats(to_pci_dev(master->dev));
> -	master->ats_enabled = false;
>   }
>   
>   static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> @@ -2338,10 +2346,10 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
>   	list_del(&master->domain_head);
>   	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>   
> +	arm_smmu_disable_ats(master);
>   	master->domain = NULL;
> +	master->ats_enabled = false;
>   	arm_smmu_install_ste_for_dev(master);
> -
> -	arm_smmu_disable_ats(master);
>   }
>   
>   static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> @@ -2386,12 +2394,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>   	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>   
>   	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
> -		arm_smmu_enable_ats(master);
> +		master->ats_enabled = arm_smmu_ats_supported(master);
>   
>   	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
>   		arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
>   
>   	arm_smmu_install_ste_for_dev(master);
> +	arm_smmu_enable_ats(master);
>   out_unlock:
>   	mutex_unlock(&smmu_domain->init_mutex);
>   	return ret;
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
                   ` (4 preceding siblings ...)
  2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2019-08-21 16:25   ` Robin Murphy
  2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

When invalidating the ATC for an PCIe endpoint using ATS, we must take
care to complete invalidation of the main SMMU TLBs beforehand, otherwise
the device could immediately repopulate its ATC with stale translations.

Hooking the ATC invalidation into ->unmap() as we currently do does the
exact opposite: it ensures that the ATC is invalidated *before*  the
main TLBs, which is bogus.

Move ATC invalidation into the actual (leaf) invalidation routines so
that it is always called after completing main TLB invalidation.

Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index d7c65dfe42dc..ca504a60312d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1961,6 +1961,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	 */
 	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
 	arm_smmu_cmdq_issue_sync(smmu);
+	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
 }
 
 static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
@@ -1969,7 +1970,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
 {
 	u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	unsigned long end = iova + size;
+	unsigned long start = iova, end = iova + size;
 	int i = 0;
 	struct arm_smmu_cmdq_ent cmd = {
 		.tlbi = {
@@ -2001,6 +2002,12 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
 	}
 
 	arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, true);
+
+	/*
+	 * Unfortunately, this can't be leaf-only since we may have
+	 * zapped an entire table.
+	 */
+	arm_smmu_atc_inv_domain(smmu_domain, 0, start, size);
 }
 
 static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
@@ -2420,18 +2427,13 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
 static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
 			     size_t size, struct iommu_iotlb_gather *gather)
 {
-	int ret;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
 
 	if (!ops)
 		return 0;
 
-	ret = ops->unmap(ops, iova, size, gather);
-	if (ret && arm_smmu_atc_inv_domain(smmu_domain, 0, iova, size))
-		return 0;
-
-	return ret;
+	return ops->unmap(ops, iova, size, gather);
 }
 
 static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
  2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
@ 2019-08-21 16:25   ` Robin Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-21 16:25 UTC (permalink / raw)
  To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker

On 21/08/2019 16:17, Will Deacon wrote:
> When invalidating the ATC for an PCIe endpoint using ATS, we must take
> care to complete invalidation of the main SMMU TLBs beforehand, otherwise
> the device could immediately repopulate its ATC with stale translations.
> 
> Hooking the ATC invalidation into ->unmap() as we currently do does the
> exact opposite: it ensures that the ATC is invalidated *before*  the
> main TLBs, which is bogus.
> 
> Move ATC invalidation into the actual (leaf) invalidation routines so
> that it is always called after completing main TLB invalidation.

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>   drivers/iommu/arm-smmu-v3.c | 16 +++++++++-------
>   1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index d7c65dfe42dc..ca504a60312d 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -1961,6 +1961,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>   	 */
>   	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
>   	arm_smmu_cmdq_issue_sync(smmu);
> +	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
>   }
>   
>   static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> @@ -1969,7 +1970,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
>   {
>   	u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
>   	struct arm_smmu_device *smmu = smmu_domain->smmu;
> -	unsigned long end = iova + size;
> +	unsigned long start = iova, end = iova + size;
>   	int i = 0;
>   	struct arm_smmu_cmdq_ent cmd = {
>   		.tlbi = {
> @@ -2001,6 +2002,12 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
>   	}
>   
>   	arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, true);
> +
> +	/*
> +	 * Unfortunately, this can't be leaf-only since we may have
> +	 * zapped an entire table.
> +	 */
> +	arm_smmu_atc_inv_domain(smmu_domain, 0, start, size);
>   }
>   
>   static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
> @@ -2420,18 +2427,13 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
>   static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
>   			     size_t size, struct iommu_iotlb_gather *gather)
>   {
> -	int ret;
>   	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>   	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
>   
>   	if (!ops)
>   		return 0;
>   
> -	ret = ops->unmap(ops, iova, size, gather);
> -	if (ret && arm_smmu_atc_inv_domain(smmu_domain, 0, iova, size))
> -		return 0;
> -
> -	return ret;
> +	return ops->unmap(ops, iova, size, gather);
>   }
>   
>   static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
                   ` (5 preceding siblings ...)
  2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2019-08-22 12:36   ` Robin Murphy
  2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
  2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
  8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

When ATS is not in use, we can avoid taking the 'devices_lock' for the
domain on the invalidation path by simply caching the number of ATS
masters currently attached. The fiddly part is handling a concurrent
->attach() of an ATS-enabled master to a domain that is being
invalidated, but we can handle this using an 'smp_mb()' to ensure that
our check of the count is ordered after completion of our prior TLB
invalidation.

This also makes our ->attach() and ->detach() flows symmetric wrt ATS
interactions.

Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ca504a60312d..0e43529d55fe 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -654,6 +654,7 @@ struct arm_smmu_domain {
 
 	struct io_pgtable_ops		*pgtbl_ops;
 	bool				non_strict;
+	atomic_t			nr_ats_masters;
 
 	enum arm_smmu_domain_stage	stage;
 	union {
@@ -1926,6 +1927,23 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
 		return 0;
 
+	/*
+	 * Ensure that we've completed prior invalidation of the main TLBs
+	 * before we read 'nr_ats_masters' in case of a concurrent call to
+	 * arm_smmu_enable_ats():
+	 *
+	 *	// unmap()			// arm_smmu_enable_ats()
+	 *	TLBI+SYNC			atomic_inc(&nr_ats_masters);
+	 *	smp_mb();			[...]
+	 *	atomic_read(&nr_ats_masters);	pci_enable_ats() // writel()
+	 *
+	 * Ensures that we always see the incremented 'nr_ats_masters' count if
+	 * ATS was enabled at the PCI device before completion of the TLBI.
+	 */
+	smp_mb();
+	if (!atomic_read(&smmu_domain->nr_ats_masters))
+		return 0;
+
 	arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
@@ -2312,6 +2330,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
 	size_t stu;
 	struct pci_dev *pdev;
 	struct arm_smmu_device *smmu = master->smmu;
+	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	/* Don't enable ATS at the endpoint if it's not enabled in the STE */
 	if (!master->ats_enabled)
@@ -2320,6 +2339,9 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
 	/* Smallest Translation Unit: log2 of the smallest supported granule */
 	stu = __ffs(smmu->pgsize_bitmap);
 	pdev = to_pci_dev(master->dev);
+
+	atomic_inc(&smmu_domain->nr_ats_masters);
+	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
 	if (pci_enable_ats(pdev, stu))
 		dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
 }
@@ -2327,6 +2349,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
 static void arm_smmu_disable_ats(struct arm_smmu_master *master)
 {
 	struct arm_smmu_cmdq_ent cmd;
+	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	if (!master->ats_enabled)
 		return;
@@ -2339,6 +2362,7 @@ static void arm_smmu_disable_ats(struct arm_smmu_master *master)
 	wmb();
 	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
 	arm_smmu_atc_inv_master(master, &cmd);
+	atomic_dec(&smmu_domain->nr_ats_masters);
 }
 
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
@@ -2349,11 +2373,12 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	if (!smmu_domain)
 		return;
 
+	arm_smmu_disable_ats(master);
+
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	arm_smmu_disable_ats(master);
 	master->domain = NULL;
 	master->ats_enabled = false;
 	arm_smmu_install_ste_for_dev(master);
@@ -2396,10 +2421,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	master->domain = smmu_domain;
 
-	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-	list_add(&master->domain_head, &smmu_domain->devices);
-	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
-
 	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
 		master->ats_enabled = arm_smmu_ats_supported(master);
 
@@ -2407,7 +2428,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
 
 	arm_smmu_install_ste_for_dev(master);
+
+	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+	list_add(&master->domain_head, &smmu_domain->devices);
+	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
 	arm_smmu_enable_ats(master);
+
 out_unlock:
 	mutex_unlock(&smmu_domain->init_mutex);
 	return ret;
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS
  2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
@ 2019-08-22 12:36   ` Robin Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-22 12:36 UTC (permalink / raw)
  To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker

On 21/08/2019 16:17, Will Deacon wrote:
> When ATS is not in use, we can avoid taking the 'devices_lock' for the
> domain on the invalidation path by simply caching the number of ATS
> masters currently attached. The fiddly part is handling a concurrent
> ->attach() of an ATS-enabled master to a domain that is being
> invalidated, but we can handle this using an 'smp_mb()' to ensure that
> our check of the count is ordered after completion of our prior TLB
> invalidation.
> 
> This also makes our ->attach() and ->detach() flows symmetric wrt ATS
> interactions.

I don't have the bandwidth just now for a deep dive into the ordering 
subtleties, but on the surface I think this looks sound - provided that 
we don't forget and start calling arm_smmu_atc_inv_master() directly 
from anywhere other than detach.

Acked-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>   drivers/iommu/arm-smmu-v3.c | 37 ++++++++++++++++++++++++++++++++-----
>   1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index ca504a60312d..0e43529d55fe 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -654,6 +654,7 @@ struct arm_smmu_domain {
>   
>   	struct io_pgtable_ops		*pgtbl_ops;
>   	bool				non_strict;
> +	atomic_t			nr_ats_masters;
>   
>   	enum arm_smmu_domain_stage	stage;
>   	union {
> @@ -1926,6 +1927,23 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>   	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
>   		return 0;
>   
> +	/*
> +	 * Ensure that we've completed prior invalidation of the main TLBs
> +	 * before we read 'nr_ats_masters' in case of a concurrent call to
> +	 * arm_smmu_enable_ats():
> +	 *
> +	 *	// unmap()			// arm_smmu_enable_ats()
> +	 *	TLBI+SYNC			atomic_inc(&nr_ats_masters);
> +	 *	smp_mb();			[...]
> +	 *	atomic_read(&nr_ats_masters);	pci_enable_ats() // writel()
> +	 *
> +	 * Ensures that we always see the incremented 'nr_ats_masters' count if
> +	 * ATS was enabled at the PCI device before completion of the TLBI.
> +	 */
> +	smp_mb();
> +	if (!atomic_read(&smmu_domain->nr_ats_masters))
> +		return 0;
> +
>   	arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);
>   
>   	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> @@ -2312,6 +2330,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
>   	size_t stu;
>   	struct pci_dev *pdev;
>   	struct arm_smmu_device *smmu = master->smmu;
> +	struct arm_smmu_domain *smmu_domain = master->domain;
>   
>   	/* Don't enable ATS at the endpoint if it's not enabled in the STE */
>   	if (!master->ats_enabled)
> @@ -2320,6 +2339,9 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
>   	/* Smallest Translation Unit: log2 of the smallest supported granule */
>   	stu = __ffs(smmu->pgsize_bitmap);
>   	pdev = to_pci_dev(master->dev);
> +
> +	atomic_inc(&smmu_domain->nr_ats_masters);
> +	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
>   	if (pci_enable_ats(pdev, stu))
>   		dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
>   }
> @@ -2327,6 +2349,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
>   static void arm_smmu_disable_ats(struct arm_smmu_master *master)
>   {
>   	struct arm_smmu_cmdq_ent cmd;
> +	struct arm_smmu_domain *smmu_domain = master->domain;
>   
>   	if (!master->ats_enabled)
>   		return;
> @@ -2339,6 +2362,7 @@ static void arm_smmu_disable_ats(struct arm_smmu_master *master)
>   	wmb();
>   	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
>   	arm_smmu_atc_inv_master(master, &cmd);
> +	atomic_dec(&smmu_domain->nr_ats_masters);
>   }
>   
>   static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> @@ -2349,11 +2373,12 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
>   	if (!smmu_domain)
>   		return;
>   
> +	arm_smmu_disable_ats(master);
> +
>   	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
>   	list_del(&master->domain_head);
>   	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>   
> -	arm_smmu_disable_ats(master);
>   	master->domain = NULL;
>   	master->ats_enabled = false;
>   	arm_smmu_install_ste_for_dev(master);
> @@ -2396,10 +2421,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>   
>   	master->domain = smmu_domain;
>   
> -	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> -	list_add(&master->domain_head, &smmu_domain->devices);
> -	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
> -
>   	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
>   		master->ats_enabled = arm_smmu_ats_supported(master);
>   
> @@ -2407,7 +2428,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>   		arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
>   
>   	arm_smmu_install_ste_for_dev(master);
> +
> +	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> +	list_add(&master->domain_head, &smmu_domain->devices);
> +	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
> +
>   	arm_smmu_enable_ats(master);
> +
>   out_unlock:
>   	mutex_unlock(&smmu_domain->init_mutex);
>   	return ret;
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI"
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
                   ` (6 preceding siblings ...)
  2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
  2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
  8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
  To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy

This reverts commit dc65dc6eaf3eb762f202bb3493492d372b662b3d.

Now that ATC invalidation is performed in the correct places and without
incurring a locking overhead for non-ATS systems, we can re-enable the
corresponding SMMU feature detection.

Signed-off-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm-smmu-v3.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 0e43529d55fe..b8049ea2e455 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3336,13 +3336,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 	}
 
 	/* Boolean feature flags */
-#if 0	/* ATS invalidation is slow and broken */
 	if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
 		smmu->features |= ARM_SMMU_FEAT_PRI;
 
 	if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
 		smmu->features |= ARM_SMMU_FEAT_ATS;
-#endif
 
 	if (reg & IDR0_SEV)
 		smmu->features |= ARM_SMMU_FEAT_SEV;
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* arm-smmu-v3 high cpu usage for NVMe
  2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
                   ` (7 preceding siblings ...)
  2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
@ 2020-01-02 17:44 ` John Garry
  2020-03-18 20:53   ` Will Deacon
  8 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-01-02 17:44 UTC (permalink / raw)
  To: Will Deacon, iommu, Robin Murphy
  Cc: Jean-Philippe Brucker, Ming Lei, Marc Zyngier


Hi Will, Robin,

While analyzing an arm64 issue in interrupt handling for NVMe [0], we 
have noticed a worryingly high CPU utilization in the SMMU driver.

The background is that we may get CPU lockup for high-throughput NVMe 
testing, and we noticed that disabling the SMMU during testing avoids 
the issue. However this lockup is a cross-architecture issue and there 
are attempts to address it, like [1]. To me, disabling the SMMU is just 
avoiding that specific issue.

Anyway, we should still consider this high CPU loading:

    PerfTop:    1694 irqs/sec  kernel:97.3%  exact:  0.0% lost: 0/0 
drop: 0/0 [4000Hz cycles],  (all, CPU: 0)
--------------------------------------------------------------------------------------------------------------------------

     50.84%  [kernel]       [k] arm_smmu_cmdq_issue_cmdlist
     19.51%  [kernel]       [k] _raw_spin_unlock_irqrestore
      5.14%  [kernel]       [k] __slab_free
      2.37%  [kernel]       [k] bio_release_pages.part.42
      2.20%  [kernel]       [k] fput_many
      1.92%  [kernel]       [k] aio_complete_rw
      1.85%  [kernel]       [k] __arm_lpae_unmap
      1.71%  [kernel]       [k] arm_smmu_atc_inv_domain.constprop.42
      1.11%  [kernel]       [k] sbitmap_queue_clear
      1.05%  [kernel]       [k] blk_mq_free_request
      0.97%  [kernel]       [k] nvme_irq
      0.71%  [kernel]       [k] blk_account_io_done
      0.66%  [kernel]       [k] kmem_cache_free
      0.66%  [kernel]       [k] blk_mq_complete_request

This is for a CPU servicing the NVMe interrupt and doing the DMA unmap. 
The DMA unmap is done in threaded interrupt context.

And for the overall system, we have:

   PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 
drop: 0/40116 [4000Hz cycles],  (all, 96 CPUs)
--------------------------------------------------------------------------------------------------------------------------

     27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
     11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
      6.35%  [kernel]          [k] _raw_spin_unlock_irq
      2.65%  [kernel]          [k] get_user_pages_fast
      2.03%  [kernel]          [k] __slab_free
      1.55%  [kernel]          [k] tick_nohz_idle_exit
      1.47%  [kernel]          [k] arm_lpae_map
      1.39%  [kernel]          [k] __fget
      1.14%  [kernel]          [k] __lock_text_start
      1.09%  [kernel]          [k] _raw_spin_lock
      1.08%  [kernel]          [k] bio_release_pages.part.42
      1.03%  [kernel]          [k] __sbitmap_get_word
      0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
      0.91%  [kernel]          [k] fput_many
      0.88%  [kernel]          [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time 
in arm_smmu_atc_inv_domain(), which is disappointing when considering it 
should effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing 
our batch size is 1, so we're not seeing the real benefit of the 
batching. I can't help but think that we could improve this code to try 
to combine CMD SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if 
a get a chance for other possible bottlenecks.

[0] 
https://lore.kernel.org/lkml/e815b5451ea86e99d42045f7067f455a@www.loen.fr/

[1] 
https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/

Cheers,
John

On 21/08/2019 16:17, Will Deacon wrote:
> Hi again,
> 
> This is version two of the patches I posted yesterday:
> 
>    v1: https://lkml.kernel.org/r/20190820154549.17018-1-will@kernel.org
> 
> Changes since then include:
> 
>    * Fix 'ats_enabled' checking when enabling ATS
>    * Remove redundant 'dev_is_pci()' calls
>    * Remove bool bitfield
>    * Add patch temporarily disabling ATS detection for -stable
>    * Issue ATC invalidation even when non-leaf
>    * Elide invalidation/SYNC for zero-sized address ranges
>    * Shuffle the patches round a bit
> 
> Thanks,
> 
> Will
> 
> Cc: Zhen Lei <thunder.leizhen@huawei.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> 
> --->8
> 
> Will Deacon (8):
>    iommu/arm-smmu-v3: Document ordering guarantees of command insertion
>    iommu/arm-smmu-v3: Disable detection of ATS and PRI
>    iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag
>    iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations
>    iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
>    iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
>    iommu/arm-smmu-v3: Avoid locking on invalidation path when not using
>      ATS
>    Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI"
> 
>   drivers/iommu/arm-smmu-v3.c | 117 ++++++++++++++++++++++++++++++++------------
>   1 file changed, 87 insertions(+), 30 deletions(-)
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
@ 2020-03-18 20:53   ` Will Deacon
  2020-03-19 12:54     ` John Garry
       [not found]     ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
  0 siblings, 2 replies; 32+ messages in thread
From: Will Deacon @ 2020-03-18 20:53 UTC (permalink / raw)
  To: John Garry
  Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, Robin Murphy

Hi John,

On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> And for the overall system, we have:
> 
>   PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> --------------------------------------------------------------------------------------------------------------------------
> 
>     27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
>     11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
>      6.35%  [kernel]          [k] _raw_spin_unlock_irq
>      2.65%  [kernel]          [k] get_user_pages_fast
>      2.03%  [kernel]          [k] __slab_free
>      1.55%  [kernel]          [k] tick_nohz_idle_exit
>      1.47%  [kernel]          [k] arm_lpae_map
>      1.39%  [kernel]          [k] __fget
>      1.14%  [kernel]          [k] __lock_text_start
>      1.09%  [kernel]          [k] _raw_spin_lock
>      1.08%  [kernel]          [k] bio_release_pages.part.42
>      1.03%  [kernel]          [k] __sbitmap_get_word
>      0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
>      0.91%  [kernel]          [k] fput_many
>      0.88%  [kernel]          [k] __arm_lpae_map
> 
> One thing to note is that we still spend an appreciable amount of time in
> arm_smmu_atc_inv_domain(), which is disappointing when considering it should
> effectively be a noop.
> 
> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
> batch size is 1, so we're not seeing the real benefit of the batching. I
> can't help but think that we could improve this code to try to combine CMD
> SYNCs for small batches.
> 
> Anyway, let me know your thoughts or any questions. I'll have a look if a
> get a chance for other possible bottlenecks.

Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.

Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-18 20:53   ` Will Deacon
@ 2020-03-19 12:54     ` John Garry
  2020-03-19 18:43       ` Jean-Philippe Brucker
       [not found]     ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
  1 sibling, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-19 12:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, Robin Murphy

Hi Will,

> 
> On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
>> And for the overall system, we have:
>>
>>    PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
>> --------------------------------------------------------------------------------------------------------------------------
>>
>>      27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
>>      11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
>>       6.35%  [kernel]          [k] _raw_spin_unlock_irq
>>       2.65%  [kernel]          [k] get_user_pages_fast
>>       2.03%  [kernel]          [k] __slab_free
>>       1.55%  [kernel]          [k] tick_nohz_idle_exit
>>       1.47%  [kernel]          [k] arm_lpae_map
>>       1.39%  [kernel]          [k] __fget
>>       1.14%  [kernel]          [k] __lock_text_start
>>       1.09%  [kernel]          [k] _raw_spin_lock
>>       1.08%  [kernel]          [k] bio_release_pages.part.42
>>       1.03%  [kernel]          [k] __sbitmap_get_word
>>       0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
>>       0.91%  [kernel]          [k] fput_many
>>       0.88%  [kernel]          [k] __arm_lpae_map
>>
>> One thing to note is that we still spend an appreciable amount of time in
>> arm_smmu_atc_inv_domain(), which is disappointing when considering it should
>> effectively be a noop.
>>
>> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
>> batch size is 1, so we're not seeing the real benefit of the batching. I
>> can't help but think that we could improve this code to try to combine CMD
>> SYNCs for small batches.
>>
>> Anyway, let me know your thoughts or any questions. I'll have a look if a
>> get a chance for other possible bottlenecks.
> 
> Did you ever get any more information on this? I don't have any SMMUv3
> hardware any more, so I can't really dig into this myself.

I'm only getting back to look at this now, as SMMU performance is a bit 
of a hot topic again for us.

So one thing we are doing which looks to help performance is this series 
from Marc:

https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d

So that is just spreading the per-CPU load for NVMe interrupt handling 
(where the DMA unmapping is happening), so I'd say just side-stepping 
any SMMU issue really.

Going back to the SMMU, I wanted to run epbf and perf annotate to help 
profile this, but was having no luck getting them to work properly. I'll 
look at this again now.

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-19 12:54     ` John Garry
@ 2020-03-19 18:43       ` Jean-Philippe Brucker
  2020-03-20 10:41         ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-19 18:43 UTC (permalink / raw)
  To: John Garry; +Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy

On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
> Hi Will,
> 
> > 
> > On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> > > And for the overall system, we have:
> > > 
> > >    PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
> > > 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> > > --------------------------------------------------------------------------------------------------------------------------
> > > 
> > >      27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
> > >      11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
> > >       6.35%  [kernel]          [k] _raw_spin_unlock_irq
> > >       2.65%  [kernel]          [k] get_user_pages_fast
> > >       2.03%  [kernel]          [k] __slab_free
> > >       1.55%  [kernel]          [k] tick_nohz_idle_exit
> > >       1.47%  [kernel]          [k] arm_lpae_map
> > >       1.39%  [kernel]          [k] __fget
> > >       1.14%  [kernel]          [k] __lock_text_start
> > >       1.09%  [kernel]          [k] _raw_spin_lock
> > >       1.08%  [kernel]          [k] bio_release_pages.part.42
> > >       1.03%  [kernel]          [k] __sbitmap_get_word
> > >       0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
> > >       0.91%  [kernel]          [k] fput_many
> > >       0.88%  [kernel]          [k] __arm_lpae_map
> > > 
> > > One thing to note is that we still spend an appreciable amount of time in
> > > arm_smmu_atc_inv_domain(), which is disappointing when considering it should
> > > effectively be a noop.
> > > 
> > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
> > > batch size is 1, so we're not seeing the real benefit of the batching. I
> > > can't help but think that we could improve this code to try to combine CMD
> > > SYNCs for small batches.
> > > 
> > > Anyway, let me know your thoughts or any questions. I'll have a look if a
> > > get a chance for other possible bottlenecks.
> > 
> > Did you ever get any more information on this? I don't have any SMMUv3
> > hardware any more, so I can't really dig into this myself.
> 
> I'm only getting back to look at this now, as SMMU performance is a bit of a
> hot topic again for us.
> 
> So one thing we are doing which looks to help performance is this series
> from Marc:
> 
> https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
> 
> So that is just spreading the per-CPU load for NVMe interrupt handling
> (where the DMA unmapping is happening), so I'd say just side-stepping any
> SMMU issue really.
> 
> Going back to the SMMU, I wanted to run epbf and perf annotate to help
> profile this, but was having no luck getting them to work properly. I'll
> look at this again now.

Could you also try with the upcoming ATS change currently in Will's tree?
They won't improve your numbers but it'd be good to check that they don't
make things worse.

I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
   - 5.74% arm_smmu_tlb_inv_range
        5.09% arm_smmu_cmdq_issue_cmdlist
        0.28% __pi_memset
        0.08% __pi_memcpy
        0.08% arm_smmu_atc_inv_domain.constprop.37
        0.07% arm_smmu_cmdq_build_cmd
        0.01% arm_smmu_cmdq_batch_add
     0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.

Thanks,
Jean
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-19 18:43       ` Jean-Philippe Brucker
@ 2020-03-20 10:41         ` John Garry
  2020-03-20 11:18           ` Jean-Philippe Brucker
  2020-05-22 14:52           ` John Garry
  0 siblings, 2 replies; 32+ messages in thread
From: John Garry @ 2020-03-20 10:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy

On 19/03/2020 18:43, Jean-Philippe Brucker wrote:
> On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
>> Hi Will,
>>
>>>
>>> On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
>>>> And for the overall system, we have:
>>>>
>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
>>>> --------------------------------------------------------------------------------------------------------------------------
>>>>
>>>>       27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
>>>>       11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
>>>>        6.35%  [kernel]          [k] _raw_spin_unlock_irq
>>>>        2.65%  [kernel]          [k] get_user_pages_fast
>>>>        2.03%  [kernel]          [k] __slab_free
>>>>        1.55%  [kernel]          [k] tick_nohz_idle_exit
>>>>        1.47%  [kernel]          [k] arm_lpae_map
>>>>        1.39%  [kernel]          [k] __fget
>>>>        1.14%  [kernel]          [k] __lock_text_start
>>>>        1.09%  [kernel]          [k] _raw_spin_lock
>>>>        1.08%  [kernel]          [k] bio_release_pages.part.42
>>>>        1.03%  [kernel]          [k] __sbitmap_get_word
>>>>        0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
>>>>        0.91%  [kernel]          [k] fput_many
>>>>        0.88%  [kernel]          [k] __arm_lpae_map
>>>>
>>>> One thing to note is that we still spend an appreciable amount of time in
>>>> arm_smmu_atc_inv_domain(), which is disappointing when considering it should
>>>> effectively be a noop.
>>>>
>>>> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
>>>> batch size is 1, so we're not seeing the real benefit of the batching. I
>>>> can't help but think that we could improve this code to try to combine CMD
>>>> SYNCs for small batches.
>>>>
>>>> Anyway, let me know your thoughts or any questions. I'll have a look if a
>>>> get a chance for other possible bottlenecks.
>>>
>>> Did you ever get any more information on this? I don't have any SMMUv3
>>> hardware any more, so I can't really dig into this myself.
>>
>> I'm only getting back to look at this now, as SMMU performance is a bit of a
>> hot topic again for us.
>>
>> So one thing we are doing which looks to help performance is this series
>> from Marc:
>>
>> https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
>>
>> So that is just spreading the per-CPU load for NVMe interrupt handling
>> (where the DMA unmapping is happening), so I'd say just side-stepping any
>> SMMU issue really.
>>
>> Going back to the SMMU, I wanted to run epbf and perf annotate to help
>> profile this, but was having no luck getting them to work properly. I'll
>> look at this again now.
> 
> Could you also try with the upcoming ATS change currently in Will's tree?
> They won't improve your numbers but it'd be good to check that they don't
> make things worse.

I can do when I get a chance.

> 
> I've run a bunch of netperf instances on multiple cores and collecting
> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
> consistently.
> 
> - 6.07% arm_smmu_iotlb_sync
>     - 5.74% arm_smmu_tlb_inv_range
>          5.09% arm_smmu_cmdq_issue_cmdlist
>          0.28% __pi_memset
>          0.08% __pi_memcpy
>          0.08% arm_smmu_atc_inv_domain.constprop.37
>          0.07% arm_smmu_cmdq_build_cmd
>          0.01% arm_smmu_cmdq_batch_add
>       0.31% __pi_memset
> 
> So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
> when ATS is not used. According to the annotations, the load from the
> atomic_read(), that checks whether the domain uses ATS, is 77% of the
> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
> there is much room for optimization there.

Well I did originally suggest using RCU protection to scan the list of 
devices, instead of reading an atomic and checking for non-zero value. 
But that would be an optimsation for ATS also, and there was no ATS 
devices at the time (to verify performance).

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-20 10:41         ` John Garry
@ 2020-03-20 11:18           ` Jean-Philippe Brucker
  2020-03-20 16:20             ` John Garry
  2020-05-22 14:52           ` John Garry
  1 sibling, 1 reply; 32+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-20 11:18 UTC (permalink / raw)
  To: John Garry; +Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy

On Fri, Mar 20, 2020 at 10:41:44AM +0000, John Garry wrote:
> On 19/03/2020 18:43, Jean-Philippe Brucker wrote:
> > On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
> > > Hi Will,
> > > 
> > > > 
> > > > On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> > > > > And for the overall system, we have:
> > > > > 
> > > > >     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
> > > > > 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> > > > > --------------------------------------------------------------------------------------------------------------------------
> > > > > 
> > > > >       27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
> > > > >       11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
> > > > >        6.35%  [kernel]          [k] _raw_spin_unlock_irq
> > > > >        2.65%  [kernel]          [k] get_user_pages_fast
> > > > >        2.03%  [kernel]          [k] __slab_free
> > > > >        1.55%  [kernel]          [k] tick_nohz_idle_exit
> > > > >        1.47%  [kernel]          [k] arm_lpae_map
> > > > >        1.39%  [kernel]          [k] __fget
> > > > >        1.14%  [kernel]          [k] __lock_text_start
> > > > >        1.09%  [kernel]          [k] _raw_spin_lock
> > > > >        1.08%  [kernel]          [k] bio_release_pages.part.42
> > > > >        1.03%  [kernel]          [k] __sbitmap_get_word
> > > > >        0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
> > > > >        0.91%  [kernel]          [k] fput_many
> > > > >        0.88%  [kernel]          [k] __arm_lpae_map
> > > > > 
> > > > > One thing to note is that we still spend an appreciable amount of time in
> > > > > arm_smmu_atc_inv_domain(), which is disappointing when considering it should
> > > > > effectively be a noop.
> > > > > 
> > > > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
> > > > > batch size is 1, so we're not seeing the real benefit of the batching. I
> > > > > can't help but think that we could improve this code to try to combine CMD
> > > > > SYNCs for small batches.
> > > > > 
> > > > > Anyway, let me know your thoughts or any questions. I'll have a look if a
> > > > > get a chance for other possible bottlenecks.
> > > > 
> > > > Did you ever get any more information on this? I don't have any SMMUv3
> > > > hardware any more, so I can't really dig into this myself.
> > > 
> > > I'm only getting back to look at this now, as SMMU performance is a bit of a
> > > hot topic again for us.
> > > 
> > > So one thing we are doing which looks to help performance is this series
> > > from Marc:
> > > 
> > > https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
> > > 
> > > So that is just spreading the per-CPU load for NVMe interrupt handling
> > > (where the DMA unmapping is happening), so I'd say just side-stepping any
> > > SMMU issue really.
> > > 
> > > Going back to the SMMU, I wanted to run epbf and perf annotate to help
> > > profile this, but was having no luck getting them to work properly. I'll
> > > look at this again now.
> > 
> > Could you also try with the upcoming ATS change currently in Will's tree?
> > They won't improve your numbers but it'd be good to check that they don't
> > make things worse.
> 
> I can do when I get a chance.
> 
> > 
> > I've run a bunch of netperf instances on multiple cores and collecting
> > SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
> > consistently.
> > 
> > - 6.07% arm_smmu_iotlb_sync
> >     - 5.74% arm_smmu_tlb_inv_range
> >          5.09% arm_smmu_cmdq_issue_cmdlist
> >          0.28% __pi_memset
> >          0.08% __pi_memcpy
> >          0.08% arm_smmu_atc_inv_domain.constprop.37
> >          0.07% arm_smmu_cmdq_build_cmd
> >          0.01% arm_smmu_cmdq_batch_add
> >       0.31% __pi_memset
> > 
> > So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
> > when ATS is not used. According to the annotations, the load from the
> > atomic_read(), that checks whether the domain uses ATS, is 77% of the
> > samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
> > there is much room for optimization there.
> 
> Well I did originally suggest using RCU protection to scan the list of
> devices, instead of reading an atomic and checking for non-zero value. But
> that would be an optimsation for ATS also, and there was no ATS devices at
> the time (to verify performance).

Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
performance, but I agree that using RCU to scan the list should get better
results when using ATS.

When ATS isn't in use however, I suspect reading nr_ats_masters should be
more efficient than taking the RCU lock + reading an "ats_devices" list
(since the smmu_domain->devices list also serves context descriptor
invalidation, even when ATS isn't in use). I'll run some tests however, to
see if I can micro-optimize this case, but I don't expect noticeable
improvements.

Thanks,
Jean
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-20 11:18           ` Jean-Philippe Brucker
@ 2020-03-20 16:20             ` John Garry
  2020-03-20 16:33               ` Marc Zyngier
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-20 16:20 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy


>>
>>>
>>> I've run a bunch of netperf instances on multiple cores and collecting
>>> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
>>> consistently.
>>>
>>> - 6.07% arm_smmu_iotlb_sync
>>>      - 5.74% arm_smmu_tlb_inv_range
>>>           5.09% arm_smmu_cmdq_issue_cmdlist
>>>           0.28% __pi_memset
>>>           0.08% __pi_memcpy
>>>           0.08% arm_smmu_atc_inv_domain.constprop.37
>>>           0.07% arm_smmu_cmdq_build_cmd
>>>           0.01% arm_smmu_cmdq_batch_add
>>>        0.31% __pi_memset
>>>
>>> So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
>>> when ATS is not used. According to the annotations, the load from the
>>> atomic_read(), that checks whether the domain uses ATS, is 77% of the
>>> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
>>> there is much room for optimization there.
>>
>> Well I did originally suggest using RCU protection to scan the list of
>> devices, instead of reading an atomic and checking for non-zero value. But
>> that would be an optimsation for ATS also, and there was no ATS devices at
>> the time (to verify performance).
> 
> Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
> performance, but I agree that using RCU to scan the list should get better
> results when using ATS.
> 
> When ATS isn't in use however, I suspect reading nr_ats_masters should be
> more efficient than taking the RCU lock + reading an "ats_devices" list
> (since the smmu_domain->devices list also serves context descriptor
> invalidation, even when ATS isn't in use). I'll run some tests however, to
> see if I can micro-optimize this case, but I don't expect noticeable
> improvements.

ok, cheers. I, too, would not expect a significant improvement there.

JFYI, I've been playing for "perf annotate" today and it's giving 
strange results for my NVMe testing. So "report" looks somewhat sane, if 
not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():


     55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_cmdq_issue_cmdlist
      9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
_raw_spin_unlock_irqrestore
      2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
      1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
      1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_atc_inv_domain.constprop.42
      1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
      1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw

But "annotate" consistently tells me that a specific instruction 
consumes ~99% of the load for the enqueue function:

          :                      /* 5. If we are inserting a CMD_SYNC, 
we must wait for it to complete */
          :                      if (sync) {
     0.00 :   ffff80001071c948:       ldr     w0, [x29, #108]
          :                      int ret = 0;
     0.00 :   ffff80001071c94c:       mov     w24, #0x0 
      // #0
          :                      if (sync) {
     0.00 :   ffff80001071c950:       cbnz    w0, ffff80001071c990 
<arm_smmu_cmdq_issue_cmdlist+0x420>
          :                      arch_local_irq_restore():
     0.00 :   ffff80001071c954:       msr     daif, x21
          :                      arm_smmu_cmdq_issue_cmdlist():
          :                      }
          :                      }
          :
          :                      local_irq_restore(flags);
          :                      return ret;
          :                      }
    99.51 :   ffff80001071c958:       adrp    x0, ffff800011909000 
<page_wait_table+0x14c0>
     0.00 :   ffff80001071c95c:       add     x21, x0, #0x908
     0.02 :   ffff80001071c960:       ldr     x2, [x29, #488]
     0.14 :   ffff80001071c964:       ldr     x1, [x21]
     0.00 :   ffff80001071c968:       eor     x1, x2, x1
     0.00 :   ffff80001071c96c:       mov     w0, w24


But there may be a hint that we're getting consuming a lot of time in 
polling the CMD_SYNC consumption.

The files are available here:

https://raw.githubusercontent.com/hisilicon/kernel-dev/private-topic-nvme-5.6-profiling/ann.txt, 
report

Or maybe I'm just not using the tool properly ...

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-20 16:20             ` John Garry
@ 2020-03-20 16:33               ` Marc Zyngier
  2020-03-23  9:03                 ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Marc Zyngier @ 2020-03-20 16:33 UTC (permalink / raw)
  To: John Garry
  Cc: Jean-Philippe Brucker, Robin Murphy, Ming Lei, iommu, Will Deacon

Hi John,

On 2020-03-20 16:20, John Garry wrote:
>>> 
>>>> 
>>>> I've run a bunch of netperf instances on multiple cores and 
>>>> collecting
>>>> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
>>>> consistently.
>>>> 
>>>> - 6.07% arm_smmu_iotlb_sync
>>>>      - 5.74% arm_smmu_tlb_inv_range
>>>>           5.09% arm_smmu_cmdq_issue_cmdlist
>>>>           0.28% __pi_memset
>>>>           0.08% __pi_memcpy
>>>>           0.08% arm_smmu_atc_inv_domain.constprop.37
>>>>           0.07% arm_smmu_cmdq_build_cmd
>>>>           0.01% arm_smmu_cmdq_batch_add
>>>>        0.31% __pi_memset
>>>> 
>>>> So arm_smmu_atc_inv_domain() takes about 1.4% of 
>>>> arm_smmu_iotlb_sync(),
>>>> when ATS is not used. According to the annotations, the load from 
>>>> the
>>>> atomic_read(), that checks whether the domain uses ATS, is 77% of 
>>>> the
>>>> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm 
>>>> not sure
>>>> there is much room for optimization there.
>>> 
>>> Well I did originally suggest using RCU protection to scan the list 
>>> of
>>> devices, instead of reading an atomic and checking for non-zero 
>>> value. But
>>> that would be an optimsation for ATS also, and there was no ATS 
>>> devices at
>>> the time (to verify performance).
>> 
>> Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
>> performance, but I agree that using RCU to scan the list should get 
>> better
>> results when using ATS.
>> 
>> When ATS isn't in use however, I suspect reading nr_ats_masters should 
>> be
>> more efficient than taking the RCU lock + reading an "ats_devices" 
>> list
>> (since the smmu_domain->devices list also serves context descriptor
>> invalidation, even when ATS isn't in use). I'll run some tests 
>> however, to
>> see if I can micro-optimize this case, but I don't expect noticeable
>> improvements.
> 
> ok, cheers. I, too, would not expect a significant improvement there.
> 
> JFYI, I've been playing for "perf annotate" today and it's giving
> strange results for my NVMe testing. So "report" looks somewhat sane,
> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
> 
> 
>     55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
> arm_smmu_cmdq_issue_cmdlist
>      9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
> _raw_spin_unlock_irqrestore
>      2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
>      1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
>      1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
> arm_smmu_atc_inv_domain.constprop.42
>      1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
>      1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw
> 
> But "annotate" consistently tells me that a specific instruction
> consumes ~99% of the load for the enqueue function:
> 
>          :                      /* 5. If we are inserting a CMD_SYNC,
> we must wait for it to complete */
>          :                      if (sync) {
>     0.00 :   ffff80001071c948:       ldr     w0, [x29, #108]
>          :                      int ret = 0;
>     0.00 :   ffff80001071c94c:       mov     w24, #0x0      // #0
>          :                      if (sync) {
>     0.00 :   ffff80001071c950:       cbnz    w0, ffff80001071c990
> <arm_smmu_cmdq_issue_cmdlist+0x420>
>          :                      arch_local_irq_restore():
>     0.00 :   ffff80001071c954:       msr     daif, x21
>          :                      arm_smmu_cmdq_issue_cmdlist():
>          :                      }
>          :                      }
>          :
>          :                      local_irq_restore(flags);
>          :                      return ret;
>          :                      }
>    99.51 :   ffff80001071c958:       adrp    x0, ffff800011909000
> <page_wait_table+0x14c0>

This is likely the side effect of the re-enabling of interrupts (msr 
daif, x21)
on the previous instruction which causes the perf interrupt to fire 
right after.

Time to enable pseudo-NMIs in the PMUv3 driver...

          M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-20 16:33               ` Marc Zyngier
@ 2020-03-23  9:03                 ` John Garry
  2020-03-23  9:16                   ` Marc Zyngier
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-23  9:03 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Jean-Philippe Brucker, Robin Murphy, Ming Lei, iommu, Will Deacon

On 20/03/2020 16:33, Marc Zyngier wrote:
>> JFYI, I've been playing for "perf annotate" today and it's giving
>> strange results for my NVMe testing. So "report" looks somewhat sane,
>> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
>>
>>
>>     55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
>> arm_smmu_cmdq_issue_cmdlist
>>      9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
>> _raw_spin_unlock_irqrestore
>>      2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
>>      1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
>>      1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
>> arm_smmu_atc_inv_domain.constprop.42
>>      1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
>>      1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw
>>
>> But "annotate" consistently tells me that a specific instruction
>> consumes ~99% of the load for the enqueue function:
>>
>>          :                      /* 5. If we are inserting a CMD_SYNC,
>> we must wait for it to complete */
>>          :                      if (sync) {
>>     0.00 :   ffff80001071c948:       ldr     w0, [x29, #108]
>>          :                      int ret = 0;
>>     0.00 :   ffff80001071c94c:       mov     w24, #0x0      // #0
>>          :                      if (sync) {
>>     0.00 :   ffff80001071c950:       cbnz    w0, ffff80001071c990
>> <arm_smmu_cmdq_issue_cmdlist+0x420>
>>          :                      arch_local_irq_restore():
>>     0.00 :   ffff80001071c954:       msr     daif, x21
>>          :                      arm_smmu_cmdq_issue_cmdlist():
>>          :                      }
>>          :                      }
>>          :
>>          :                      local_irq_restore(flags);
>>          :                      return ret;
>>          :                      }
>>    99.51 :   ffff80001071c958:       adrp    x0, ffff800011909000
>> <page_wait_table+0x14c0>
> 

Hi Marc,

> This is likely the side effect of the re-enabling of interrupts (msr 
> daif, x21)
> on the previous instruction which causes the perf interrupt to fire 
> right after.

ok, makes sense.

> 
> Time to enable pseudo-NMIs in the PMUv3 driver...
> 

Do you know if there is any plan for this?

In the meantime, maybe I can do some trickery by putting the 
local_irq_restore() in a separate function, outside 
arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function.

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-23  9:03                 ` John Garry
@ 2020-03-23  9:16                   ` Marc Zyngier
  2020-03-24  9:18                     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Marc Zyngier @ 2020-03-23  9:16 UTC (permalink / raw)
  To: John Garry
  Cc: Jean-Philippe Brucker, Robin Murphy, Ming Lei, iommu, Will Deacon

On 2020-03-23 09:03, John Garry wrote:
> On 20/03/2020 16:33, Marc Zyngier wrote:
>>> JFYI, I've been playing for "perf annotate" today and it's giving
>>> strange results for my NVMe testing. So "report" looks somewhat sane,
>>> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
>>> 
>>> 
>>>     55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
>>> arm_smmu_cmdq_issue_cmdlist
>>>      9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
>>> _raw_spin_unlock_irqrestore
>>>      2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
>>>      1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
>>>      1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
>>> arm_smmu_atc_inv_domain.constprop.42
>>>      1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
>>>      1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw
>>> 
>>> But "annotate" consistently tells me that a specific instruction
>>> consumes ~99% of the load for the enqueue function:
>>> 
>>>          :                      /* 5. If we are inserting a CMD_SYNC,
>>> we must wait for it to complete */
>>>          :                      if (sync) {
>>>     0.00 :   ffff80001071c948:       ldr     w0, [x29, #108]
>>>          :                      int ret = 0;
>>>     0.00 :   ffff80001071c94c:       mov     w24, #0x0      // #0
>>>          :                      if (sync) {
>>>     0.00 :   ffff80001071c950:       cbnz    w0, ffff80001071c990
>>> <arm_smmu_cmdq_issue_cmdlist+0x420>
>>>          :                      arch_local_irq_restore():
>>>     0.00 :   ffff80001071c954:       msr     daif, x21
>>>          :                      arm_smmu_cmdq_issue_cmdlist():
>>>          :                      }
>>>          :                      }
>>>          :
>>>          :                      local_irq_restore(flags);
>>>          :                      return ret;
>>>          :                      }
>>>    99.51 :   ffff80001071c958:       adrp    x0, ffff800011909000
>>> <page_wait_table+0x14c0>
>> 
> 
> Hi Marc,
> 
>> This is likely the side effect of the re-enabling of interrupts (msr 
>> daif, x21)
>> on the previous instruction which causes the perf interrupt to fire 
>> right after.
> 
> ok, makes sense.
> 
>> 
>> Time to enable pseudo-NMIs in the PMUv3 driver...
>> 
> 
> Do you know if there is any plan for this?

There was. Julien Thierry has a bunch of patches for that [1], but they 
needs
reviving.

> 
> In the meantime, maybe I can do some trickery by putting the
> local_irq_restore() in a separate function, outside
> arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same
> function.

I don't see how you can improve the profiling without compromising
the locking in this case...

Thanks,

         M.

[1] https://patchwork.kernel.org/cover/11047407/
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-23  9:16                   ` Marc Zyngier
@ 2020-03-24  9:18                     ` John Garry
  2020-03-24 10:43                       ` Marc Zyngier
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-24  9:18 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Mark Rutland, Jean-Philippe Brucker, Robin Murphy, Ming Lei,
	iommu, Will Deacon, Julien Thierry

On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,

>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>
>>
>> Do you know if there is any plan for this?
> 
> There was. Julien Thierry has a bunch of patches for that [1], but they 
> needs
> reviving.
> 

So those patches still apply cleanly (apart from the kvm patch, which I 
can skip, I suppose) and build, so I can try this I figure. Is there 
anything else which I should ensure or know about? Apart from enable 
CONFIG_ARM64_PSUEDO_NMI.

A quickly taken perf annotate and report is at the tip here: 
https://github.com/hisilicon/kernel-dev/commits/private-topic-nvme-5.6-profiling 


>>
>> In the meantime, maybe I can do some trickery by putting the
>> local_irq_restore() in a separate function, outside
>> arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same
>> function.
> 

Scratch that :)

> I don't see how you can improve the profiling without compromising
> the locking in this case...
> 

Cheers,
John

[1] https://patchwork.kernel.org/cover/11047407/
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-24  9:18                     ` John Garry
@ 2020-03-24 10:43                       ` Marc Zyngier
  2020-03-24 11:55                         ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Marc Zyngier @ 2020-03-24 10:43 UTC (permalink / raw)
  To: John Garry
  Cc: Mark Rutland, Jean-Philippe Brucker, Robin Murphy, Ming Lei,
	iommu, Will Deacon, Julien Thierry

On Tue, 24 Mar 2020 09:18:10 +0000
John Garry <john.garry@huawei.com> wrote:

> On 23/03/2020 09:16, Marc Zyngier wrote:
> 
> + Julien, Mark
> 
> Hi Marc,
> 
> >>> Time to enable pseudo-NMIs in the PMUv3 driver...
> >>>
> >>
> >> Do you know if there is any plan for this?
> > 
> > There was. Julien Thierry has a bunch of patches for that [1], but they > needs
> > reviving.
> > 
> 
> So those patches still apply cleanly (apart from the kvm patch, which
> I can skip, I suppose) and build, so I can try this I figure. Is
> there anything else which I should ensure or know about? Apart from
> enable CONFIG_ARM64_PSUEDO_NMI.

You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge... ;-).

	M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-24 10:43                       ` Marc Zyngier
@ 2020-03-24 11:55                         ` John Garry
  2020-03-24 12:07                           ` Robin Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-24 11:55 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Mark Rutland, Jean-Philippe Brucker, Robin Murphy, Ming Lei,
	iommu, Will Deacon, Julien Thierry

On 24/03/2020 10:43, Marc Zyngier wrote:
> On Tue, 24 Mar 2020 09:18:10 +0000
> John Garry<john.garry@huawei.com>  wrote:
> 
>> On 23/03/2020 09:16, Marc Zyngier wrote:
>>
>> + Julien, Mark
>>
>> Hi Marc,
>>
>>>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>>>
>>>> Do you know if there is any plan for this?
>>> There was. Julien Thierry has a bunch of patches for that [1], but they > needs
>>> reviving.
>>>
>> So those patches still apply cleanly (apart from the kvm patch, which
>> I can skip, I suppose) and build, so I can try this I figure. Is
>> there anything else which I should ensure or know about? Apart from
>> enable CONFIG_ARM64_PSUEDO_NMI.
> You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
> has it set to 0, preventing me from being able to use the feature
> (hint, nudge...;-).

Yeah, apparently it's set on our D06CS board, but I just need to double 
check the FW version with our FW guy.

As for D05, there has not been a FW update there in quite a long time 
and no plans for it. Sorry.

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-24 11:55                         ` John Garry
@ 2020-03-24 12:07                           ` Robin Murphy
  2020-03-24 12:37                             ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Robin Murphy @ 2020-03-24 12:07 UTC (permalink / raw)
  To: John Garry, Marc Zyngier
  Cc: Mark Rutland, Jean-Philippe Brucker, Ming Lei, iommu,
	Will Deacon, Julien Thierry

On 2020-03-24 11:55 am, John Garry wrote:
> On 24/03/2020 10:43, Marc Zyngier wrote:
>> On Tue, 24 Mar 2020 09:18:10 +0000
>> John Garry<john.garry@huawei.com>  wrote:
>>
>>> On 23/03/2020 09:16, Marc Zyngier wrote:
>>>
>>> + Julien, Mark
>>>
>>> Hi Marc,
>>>
>>>>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>>>>
>>>>> Do you know if there is any plan for this?
>>>> There was. Julien Thierry has a bunch of patches for that [1], but 
>>>> they > needs
>>>> reviving.
>>>>
>>> So those patches still apply cleanly (apart from the kvm patch, which
>>> I can skip, I suppose) and build, so I can try this I figure. Is
>>> there anything else which I should ensure or know about? Apart from
>>> enable CONFIG_ARM64_PSUEDO_NMI.
>> You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
>> has it set to 0, preventing me from being able to use the feature
>> (hint, nudge...;-).
> 
> Yeah, apparently it's set on our D06CS board, but I just need to double 
> check the FW version with our FW guy.

Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but 
since it's not been called out:

	  This high priority configuration for interrupts needs to be
           explicitly enabled by setting the kernel parameter
           "irqchip.gicv3_pseudo_nmi" to 1.

FWIW I believe is is still on the plan for someone here to dust off the 
PMU pNMI patches at some point.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-24 12:07                           ` Robin Murphy
@ 2020-03-24 12:37                             ` John Garry
  2020-03-25 15:31                               ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-24 12:37 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier
  Cc: Mark Rutland, Jean-Philippe Brucker, Ming Lei, iommu,
	Will Deacon, Julien Thierry

On 24/03/2020 12:07, Robin Murphy wrote:
> On 2020-03-24 11:55 am, John Garry wrote:
>> On 24/03/2020 10:43, Marc Zyngier wrote:
>>> On Tue, 24 Mar 2020 09:18:10 +0000
>>> John Garry<john.garry@huawei.com>  wrote:
>>>
>>>> On 23/03/2020 09:16, Marc Zyngier wrote:
>>>>
>>>> + Julien, Mark
>>>>
>>>> Hi Marc,
>>>>
>>>>>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>>>>>
>>>>>> Do you know if there is any plan for this?
>>>>> There was. Julien Thierry has a bunch of patches for that [1], but 
>>>>> they > needs
>>>>> reviving.
>>>>>
>>>> So those patches still apply cleanly (apart from the kvm patch, which
>>>> I can skip, I suppose) and build, so I can try this I figure. Is
>>>> there anything else which I should ensure or know about? Apart from
>>>> enable CONFIG_ARM64_PSUEDO_NMI.
>>> You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
>>> has it set to 0, preventing me from being able to use the feature
>>> (hint, nudge...;-).
>>
>> Yeah, apparently it's set on our D06CS board, but I just need to 
>> double check the FW version with our FW guy.
> 
> Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but 
> since it's not been called out:
> 
>        This high priority configuration for interrupts needs to be
>            explicitly enabled by setting the kernel parameter
>            "irqchip.gicv3_pseudo_nmi" to 1.

Yeah, I saw that by chance somewhere else previously.

> 
> FWIW I believe is is still on the plan for someone here to dust off the 
> PMU pNMI patches at some point.

Cool. Well I can try to experiment with what Julien had at v4 for now.

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-24 12:37                             ` John Garry
@ 2020-03-25 15:31                               ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2020-03-25 15:31 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier
  Cc: Mark Rutland, Jean-Philippe Brucker, Ming Lei, iommu,
	Will Deacon, Julien Thierry

> 
>>
>> FWIW I believe is is still on the plan for someone here to dust off 
>> the PMU pNMI patches at some point.
> 
> Cool. Well I can try to experiment with what Julien had at v4 for now.
> 

JFYI, I have done some more perf record capturing, and updated the 
"annotate" and "report" output here 
https://raw.githubusercontent.com/hisilicon/kernel-dev/679eca1008b1d11b42e1b5fa8a205266c240d1e1/ann.txt 
and .../report

This capture is just for cpu0, since NVMe irq handling+dma unmapping 
will occur on specific CPUs, cpu0 being one of them.

The reports look somewhat sane. So we no longer have ~99% of time 
attributed to re-enabling interrupts, now that's like:


     3.14 :   ffff80001071eae0:       ldr     w0, [x29, #108]
          :                      int ret = 0;
     0.00 :   ffff80001071eae4:       mov     w24, #0x0 
      // #0
          :                      if (sync) {
     0.00 :   ffff80001071eae8:       cbnz    w0, ffff80001071eb44 
<arm_smmu_cmdq_issue_cmdlist+0x44c>
          :                      arch_local_irq_restore():
          :                      asm volatile(ALTERNATIVE(
     0.00 :   ffff80001071eaec:       msr     daif, x21
          :                      arch_static_branch():
     0.25 :   ffff80001071eaf0:       nop
          :                      arm_smmu_cmdq_issue_cmdlist():
          :                      }
          :                      }
          :
          :                      local_irq_restore(flags);
          :                      return ret;
          :                      }
One observation (if these reports are to be believed) is that we may 
spend a lot of time in the CAS loop, trying to get a place in the queue 
initially:

          :                      __CMPXCHG_CASE(w,  ,     , 32,   )
          :                      __CMPXCHG_CASE(x,  ,     , 64,   )
     0.00 :   ffff80001071e828:       mov     x0, x27
     0.00 :   ffff80001071e82c:       mov     x4, x1
     0.00 :   ffff80001071e830:       cas     x4, x2, [x27]
    28.61 :   ffff80001071e834:       mov     x0, x4
          :                      arm_smmu_cmdq_issue_cmdlist():
          :                      if (old == llq.val)
     0.00 :   ffff80001071e838:       ldr     x1, [x23]

John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: arm-smmu-v3 high cpu usage for NVMe
  2020-03-20 10:41         ` John Garry
  2020-03-20 11:18           ` Jean-Philippe Brucker
@ 2020-05-22 14:52           ` John Garry
  2020-05-25  5:57             ` Song Bao Hua (Barry Song)
  1 sibling, 1 reply; 32+ messages in thread
From: John Garry @ 2020-05-22 14:52 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy
  Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, alexandru.elisei

On 20/03/2020 10:41, John Garry wrote:

+ Barry, Alexandru

>>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 
>>>>> 0/34434 drop:
>>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
>>>>> -------------------------------------------------------------------------------------------------------------------------- 
>>>>>
>>>>>
>>>>>       27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
>>>>>       11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
>>>>>        6.35%  [kernel]          [k] _raw_spin_unlock_irq
>>>>>        2.65%  [kernel]          [k] get_user_pages_fast
>>>>>        2.03%  [kernel]          [k] __slab_free
>>>>>        1.55%  [kernel]          [k] tick_nohz_idle_exit
>>>>>        1.47%  [kernel]          [k] arm_lpae_map
>>>>>        1.39%  [kernel]          [k] __fget
>>>>>        1.14%  [kernel]          [k] __lock_text_start
>>>>>        1.09%  [kernel]          [k] _raw_spin_lock
>>>>>        1.08%  [kernel]          [k] bio_release_pages.part.42
>>>>>        1.03%  [kernel]          [k] __sbitmap_get_word
>>>>>        0.97%  [kernel]          [k] 
>>>>> arm_smmu_atc_inv_domain.constprop.42
>>>>>        0.91%  [kernel]          [k] fput_many
>>>>>        0.88%  [kernel]          [k] __arm_lpae_map
>>>>>

Hi Will, Robin,

I'm just getting around to look at this topic again. Here's the current 
picture for my NVMe test:

perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear

* only certain CPUs run the dma unmap for my scenario, cpu0 being one of 
them.

Colleague Barry has similar findings for some other scenarios.

So we tried the latest perf NMI support wip patches, and noticed a few 
hotspots (see 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate 
and 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt) 
when running some NVMe traffic:

- initial cmpxchg to get a place in the queue
	- when more CPUs get involved, we start failing at an exponential rate
0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]
26.52 :        ffff8000107a3504:       mov     x0, x4 : 
arm_smmu_cmdq_issue_cmdlist():

- the queue locking
- polling cmd_sync

Some ideas to optimise:

a. initial cmpxchg
So this cmpxchg could be considered unfair. In addition, with all the 
contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged 
around the system.
Maybe we can implement something similar to the idea of queued/ticketed 
spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after 
initial cmpxchg fails, released by its leader, and releasing subsequent 
followers

b. Drop the queue_full checking in certain circumstances
If we cannot theoretically fill the queue, then stop the checking for 
queue full or similar. This should also help current problem of a., as 
the less time between cmpxchg, the less chance of failing (as we check 
queue available space between cmpxchg attempts).

So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we 
always issue a cmd_sync for a batch (regardless of whether requested), 
then we should never fill (I think).

c. Don't do queue locking in certain circumstances
If we implement (and support) b. and support MSI polling, then I don't 
think that this is required.

d. More minor ideas are to move forward when the "owner" stops gathering 
to reduce time of advancing the prod, hopefully reducing cmd_sync 
polling time; and also use a smaller word size for the valid bitmap 
operations, maybe 32b atomic operations are overall more efficient (than 
64b) - mostly valid range check is < 16 bits from my observation.

Let me know your thoughts or any other ideas.

Thanks,
John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: arm-smmu-v3 high cpu usage for NVMe
  2020-05-22 14:52           ` John Garry
@ 2020-05-25  5:57             ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 32+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-05-25  5:57 UTC (permalink / raw)
  To: John Garry, Will Deacon, Robin Murphy
  Cc: Jean-Philippe Brucker, Marc Zyngier, Linuxarm, Ming Lei, iommu,
	alexandru.elisei

> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
> 
> On 20/03/2020 10:41, John Garry wrote:
> 
> + Barry, Alexandru
> 
> >>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost:
> >>>>> 0/34434 drop:
> >>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> >>>>>
> ----------------------------------------------------------------------------------------------------------------
> ----------
> >>>>>
> >>>>>
> >>>>>       27.43%  [kernel]          [k]
> arm_smmu_cmdq_issue_cmdlist
> >>>>>       11.71%  [kernel]          [k]
> _raw_spin_unlock_irqrestore
> >>>>>        6.35%  [kernel]          [k] _raw_spin_unlock_irq
> >>>>>        2.65%  [kernel]          [k] get_user_pages_fast
> >>>>>        2.03%  [kernel]          [k] __slab_free
> >>>>>        1.55%  [kernel]          [k] tick_nohz_idle_exit
> >>>>>        1.47%  [kernel]          [k] arm_lpae_map
> >>>>>        1.39%  [kernel]          [k] __fget
> >>>>>        1.14%  [kernel]          [k] __lock_text_start
> >>>>>        1.09%  [kernel]          [k] _raw_spin_lock
> >>>>>        1.08%  [kernel]          [k] bio_release_pages.part.42
> >>>>>        1.03%  [kernel]          [k] __sbitmap_get_word
> >>>>>        0.97%  [kernel]          [k]
> >>>>> arm_smmu_atc_inv_domain.constprop.42
> >>>>>        0.91%  [kernel]          [k] fput_many
> >>>>>        0.88%  [kernel]          [k] __arm_lpae_map
> >>>>>
> 
> Hi Will, Robin,
> 
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
> 
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
> 
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
> 
> Colleague Barry has similar findings for some other scenarios.

I wrote a test module and use the parameter "ways" to simulate how busy SMMU is and
compare the latency under different degrees of contentions.
1.	static int ways=16;  
2.	module_param(ways, int, S_IRUGO);  
3.	  
4.	static int seconds=120;  
5.	module_param(seconds, int, S_IRUGO);  
6.	  
7.	extern struct device *get_zip_dev(void);  
8.	  
9.	static noinline void test_mapsingle(struct device *dev, void *buf, int size)  
10.	{  
11.	    dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);  
12.	    dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);  
13.	}  
14.	  
15.	static noinline void test_memcpy(void *out, void *in, int size)  
16.	{  
17.	    memcpy(out, in, size);  
18.	}  
19.	  
20.	static int testthread(void *data)  
21.	{  
22.	    unsigned long stop = jiffies +seconds*HZ;  
23.	    struct device *dev = get_zip_dev();  
24.	  
25.	    char *input = kzalloc(4096, GFP_KERNEL);  
26.	    if (!input)  
27.	        return -ENOMEM;  
28.	  
29.	    char *output = kzalloc(4096, GFP_KERNEL);  
30.	    if (!output)  
31.	        return -ENOMEM;  
32.	  
33.	    while (time_before(jiffies, stop)) {  
34.	        test_mapsingle(dev, input, 4096);  
35.	        test_memcpy(output, input, 4096);  
36.	    }  
37.	  
38.	    kfree(output);  
39.	    kfree(input);  
40.	  
41.	    return 0;  
42.	}  
43.	  
44.	static int __init test_init(void)  
45.	{  
46.	    struct task_struct *tsk;  
47.	    int i;  
50.	  
51.	    for(i=0;i<ways;i++) {  
52.	        tsk = kthread_run(testthread, &ways, "map_test-%d", i);  
53.	        if (IS_ERR(tsk))   
54.	            printk(KERN_ERR "create test thread failed\n");  
55.	    }  
56.	  
57.	    return 0;  
58.	}  
59.	  
60.	static void __exit test_exit(void)  
61.	{  
62.	}  
63.	  
64.	module_init(test_init);  
65.	module_exit(test_exit);  
66.	MODULE_LICENSE("GPL");

While ways=1, smmu is quite free with only one user,  arm_smmu_cmdq_issue_cmdlist() will spend more than 60% time
on arm_smmu_cmdq_poll_until_sync(). It seems SMMU reports the completion of CMD_SYNC quite slowly.

When I increased "ways", I found the contention would increase rapidly. When ways=16, more than 40% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

when ways=64, more than 60% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with different ways:
	 whole unmap(ns)   arm_smmu_cmdq_issue_cmdlist()ns  	wait CMD_SYNC(ns) 
Ways=1	   1956	           1328	                                883	 
Ways=16	   8891	           7474	                                4000
Ways=32	   22043	           19519	                        6879
Ways=64	   60842	           55895	                        16746 
Ways=96	   101880	           93649	                        24429

As you can see, while ways=1, we still need 2us to unmap, and arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC
takes more than 60% time of arm_smmu_cmdq_issue_cmdlist().

When SMMU is very busy, dma_unmap latency can be very large due to contention, more than 100us.


Thanks
Barry

> 
> So we tried the latest perf NMI support wip patches, and noticed a few
> hotspots (see
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/perf%20annotate
> and
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/report.txt)
> when running some NVMe traffic:
> 
> - initial cmpxchg to get a place in the queue
> 	- when more CPUs get involved, we start failing at an exponential rate
> 0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]
> 26.52 :        ffff8000107a3504:       mov     x0, x4 :
> arm_smmu_cmdq_issue_cmdlist():
> 
> - the queue locking
> - polling cmd_sync
> 
> Some ideas to optimise:
> 
> a. initial cmpxchg
> So this cmpxchg could be considered unfair. In addition, with all the
> contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
> around the system.
> Maybe we can implement something similar to the idea of queued/ticketed
> spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
> initial cmpxchg fails, released by its leader, and releasing subsequent
> followers
> 
> b. Drop the queue_full checking in certain circumstances
> If we cannot theoretically fill the queue, then stop the checking for
> queue full or similar. This should also help current problem of a., as
> the less time between cmpxchg, the less chance of failing (as we check
> queue available space between cmpxchg attempts).
> 
> So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
> always issue a cmd_sync for a batch (regardless of whether requested),
> then we should never fill (I think).
> 
> c. Don't do queue locking in certain circumstances
> If we implement (and support) b. and support MSI polling, then I don't
> think that this is required.
> 
> d. More minor ideas are to move forward when the "owner" stops gathering
> to reduce time of advancing the prod, hopefully reducing cmd_sync
> polling time; and also use a smaller word size for the valid bitmap
> operations, maybe 32b atomic operations are overall more efficient (than
> 64b) - mostly valid range check is < 16 bits from my observation.
> 
> Let me know your thoughts or any other ideas.
> 
> Thanks,
> John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>]

* Re: arm-smmu-v3 high cpu usage for NVMe
       [not found]     ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
@ 2020-04-06 15:11       ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2020-04-06 15:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, Robin Murphy

On 02/04/2020 13:10, John Garry wrote:
> On 18/03/2020 20:53, Will Deacon wrote:
>>> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the 
>>> testing our
>>> batch size is 1, so we're not seeing the real benefit of the batching. I
>>> can't help but think that we could improve this code to try to 
>>> combine CMD
>>> SYNCs for small batches.
>>>
>>> Anyway, let me know your thoughts or any questions. I'll have a look 
>>> if a
>>> get a chance for other possible bottlenecks.
>> Did you ever get any more information on this? I don't have any SMMUv3
>> hardware any more, so I can't really dig into this myself.
>>
> 

Hi Will,

JFYI, I added some debug in arm_smmu_cmdq_issue_cmdlist() to get some 
idea of what is going on. Perf annotate did not tell much.

I tested NVMe performance with and without Marc's patchset to spread 
LPIs for managed interrupts.

Average duration of arm_smmu_cmdq_issue_cmdlist() mainline [all results 
are approximations]:
owner: 6ms
non-owner: 4ms

mainline + LPI spreading patchset:
owner: 25ms
non-owner: 22ms

For this, a list would be a itlb+cmd_sync.

Please note that the LPI spreading patchset is still giving circa 25% 
NVMe throughput increase. What happens there would be that we get many 
more cpus involved, which creates more inter-cpu contention. But the 
performance increase comes from just alleviating pressure on those 
overloaded cpus.

I also notice that with the LPI spreading patchset, on average a cpu is 
an "owner" in arm_smmu_cmdq_issue_cmdlist() 1 in 8, as opposed to 1 in 3 
for mainline. This means that we're just creating longer chains of lists 
to be published.

But I found that for a non-owner, average msi cmd_sync polling time is 
12ms with the LPI spreading patchset. As such, it seems to be really 
taking approx (12*2/8-1=) ~3ms to consume a single list. This seems 
consistent with my finding that an owner polls consumption for 3ms also. 
Without the LPI speading patchset, polling time is approx 2 and 3ms for 
both owner and non-owner, respectively.

As an experiment, I did try to hack the code to use a spinlock again for 
protecting the command queue, instead of current solution - and always 
saw a performance drop there. To be expected. But maybe we can try to 
not use a spinlock, but still serialise production+consumption to 
alleviate the long polling periods.

Let me know your thoughts.

Cheers,
John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-05-25  5:57 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
2019-08-21 15:36   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
2019-08-21 15:50   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
2019-08-21 16:25   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
2019-08-22 12:36   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
2020-03-18 20:53   ` Will Deacon
2020-03-19 12:54     ` John Garry
2020-03-19 18:43       ` Jean-Philippe Brucker
2020-03-20 10:41         ` John Garry
2020-03-20 11:18           ` Jean-Philippe Brucker
2020-03-20 16:20             ` John Garry
2020-03-20 16:33               ` Marc Zyngier
2020-03-23  9:03                 ` John Garry
2020-03-23  9:16                   ` Marc Zyngier
2020-03-24  9:18                     ` John Garry
2020-03-24 10:43                       ` Marc Zyngier
2020-03-24 11:55                         ` John Garry
2020-03-24 12:07                           ` Robin Murphy
2020-03-24 12:37                             ` John Garry
2020-03-25 15:31                               ` John Garry
2020-05-22 14:52           ` John Garry
2020-05-25  5:57             ` Song Bao Hua (Barry Song)
     [not found]     ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
2020-04-06 15:11       ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).