* [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
` (7 subsequent siblings)
8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy
It turns out that we've always relied on some subtle ordering guarantees
when inserting commands into the SMMUv3 command queue. With the recent
changes to elide locking when possible, these guarantees become more
subtle and even more important.
Add a comment documented the barrier semantics of command insertion so
that we don't have to derive the behaviour from scratch each time it
comes up on the list.
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b36a99971401..3402b1bc8e94 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1286,6 +1286,22 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
}
}
+/*
+ * This is the actual insertion function, and provides the following
+ * ordering guarantees to callers:
+ *
+ * - There is a dma_wmb() before publishing any commands to the queue.
+ * This can be relied upon to order prior writes to data structures
+ * in memory (such as a CD or an STE) before the command.
+ *
+ * - On completion of a CMD_SYNC, there is a control dependency.
+ * This can be relied upon to order subsequent writes to memory (e.g.
+ * freeing an IOVA) after completion of the CMD_SYNC.
+ *
+ * - Command insertion is totally ordered, so if two CPUs each race to
+ * insert their own list of commands then all of the commands from one
+ * CPU will appear before any of the commands from the other CPU.
+ */
static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u64 *cmds, int n, bool sync)
{
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2019-08-21 15:36 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
` (6 subsequent siblings)
8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Robin Murphy, stable, Will Deacon
Detecting the ATS capability of the SMMU at probe time introduces a
spinlock into the ->unmap() fast path, even when ATS is not actually
in use. Furthermore, the ATC invalidation that exists is broken, as it
occurs before invalidation of the main SMMU TLB which leaves a window
where the ATC can be repopulated with stale entries.
Given that ATS is both a new feature and a specialist sport, disable it
for now whilst we fix it properly in subsequent patches. Since PRI
requires ATS, disable that too.
Cc: <stable@vger.kernel.org>
Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 3402b1bc8e94..7a368059cd7d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3295,11 +3295,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
}
/* Boolean feature flags */
+#if 0 /* ATS invalidation is slow and broken */
if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
smmu->features |= ARM_SMMU_FEAT_PRI;
if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
smmu->features |= ARM_SMMU_FEAT_ATS;
+#endif
if (reg & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
@ 2019-08-21 15:36 ` Robin Murphy
0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-21 15:36 UTC (permalink / raw)
To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker, stable
On 21/08/2019 16:17, Will Deacon wrote:
> Detecting the ATS capability of the SMMU at probe time introduces a
> spinlock into the ->unmap() fast path, even when ATS is not actually
> in use. Furthermore, the ATC invalidation that exists is broken, as it
> occurs before invalidation of the main SMMU TLB which leaves a window
> where the ATC can be repopulated with stale entries.
>
> Given that ATS is both a new feature and a specialist sport, disable it
> for now whilst we fix it properly in subsequent patches. Since PRI
> requires ATS, disable that too.
>
> Cc: <stable@vger.kernel.org>
> Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
Acked-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
> drivers/iommu/arm-smmu-v3.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 3402b1bc8e94..7a368059cd7d 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -3295,11 +3295,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
> }
>
> /* Boolean feature flags */
> +#if 0 /* ATS invalidation is slow and broken */
> if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
> smmu->features |= ARM_SMMU_FEAT_PRI;
>
> if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
> smmu->features |= ARM_SMMU_FEAT_ATS;
> +#endif
>
> if (reg & IDR0_SEV)
> smmu->features |= ARM_SMMU_FEAT_SEV;
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
` (5 subsequent siblings)
8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy
There's really no need for this to be a bitfield, particularly as we
don't have bitwise addressing on arm64.
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 7a368059cd7d..2be11a11bb8b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -637,7 +637,7 @@ struct arm_smmu_master {
struct list_head domain_head;
u32 *sids;
unsigned int num_sids;
- bool ats_enabled :1;
+ bool ats_enabled;
};
/* SMMU private data for an IOMMU domain */
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
` (2 preceding siblings ...)
2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
` (4 subsequent siblings)
8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy
Calling arm_smmu_tlb_inv_range() with a size of zero, perhaps due to
an empty 'iommu_iotlb_gather' structure, should be a NOP. Elide the
CMD_SYNC when there is no invalidation to be performed.
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 2be11a11bb8b..b7b3b0ff8ed6 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1977,6 +1977,9 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
},
};
+ if (!size)
+ return;
+
if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
cmd.opcode = CMDQ_OP_TLBI_NH_VA;
cmd.tlbi.asid = smmu_domain->s1_cfg.cd.asid;
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
` (3 preceding siblings ...)
2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2019-08-21 15:50 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
` (3 subsequent siblings)
8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy
To prevent any potential issues arising from speculative Address
Translation Requests from an ATS-enabled PCIe endpoint, rework our ATS
enabling/disabling logic so that we enable ATS at the SMMU before we
enable it at the endpoint, and disable things in the opposite order.
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 47 +++++++++++++++++++++++++++------------------
1 file changed, 28 insertions(+), 19 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b7b3b0ff8ed6..d7c65dfe42dc 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2286,44 +2286,52 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
}
}
-static int arm_smmu_enable_ats(struct arm_smmu_master *master)
+static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
{
- int ret;
- size_t stu;
struct pci_dev *pdev;
struct arm_smmu_device *smmu = master->smmu;
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
if (!(smmu->features & ARM_SMMU_FEAT_ATS) || !dev_is_pci(master->dev) ||
!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS) || pci_ats_disabled())
- return -ENXIO;
+ return false;
pdev = to_pci_dev(master->dev);
- if (pdev->untrusted)
- return -EPERM;
+ return !pdev->untrusted && pdev->ats_cap;
+}
- /* Smallest Translation Unit: log2 of the smallest supported granule */
- stu = __ffs(smmu->pgsize_bitmap);
+static void arm_smmu_enable_ats(struct arm_smmu_master *master)
+{
+ size_t stu;
+ struct pci_dev *pdev;
+ struct arm_smmu_device *smmu = master->smmu;
- ret = pci_enable_ats(pdev, stu);
- if (ret)
- return ret;
+ /* Don't enable ATS at the endpoint if it's not enabled in the STE */
+ if (!master->ats_enabled)
+ return;
- master->ats_enabled = true;
- return 0;
+ /* Smallest Translation Unit: log2 of the smallest supported granule */
+ stu = __ffs(smmu->pgsize_bitmap);
+ pdev = to_pci_dev(master->dev);
+ if (pci_enable_ats(pdev, stu))
+ dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
}
static void arm_smmu_disable_ats(struct arm_smmu_master *master)
{
struct arm_smmu_cmdq_ent cmd;
- if (!master->ats_enabled || !dev_is_pci(master->dev))
+ if (!master->ats_enabled)
return;
+ pci_disable_ats(to_pci_dev(master->dev));
+ /*
+ * Ensure ATS is disabled at the endpoint before we issue the
+ * ATC invalidation via the SMMU.
+ */
+ wmb();
arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
arm_smmu_atc_inv_master(master, &cmd);
- pci_disable_ats(to_pci_dev(master->dev));
- master->ats_enabled = false;
}
static void arm_smmu_detach_dev(struct arm_smmu_master *master)
@@ -2338,10 +2346,10 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
list_del(&master->domain_head);
spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+ arm_smmu_disable_ats(master);
master->domain = NULL;
+ master->ats_enabled = false;
arm_smmu_install_ste_for_dev(master);
-
- arm_smmu_disable_ats(master);
}
static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
@@ -2386,12 +2394,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
- arm_smmu_enable_ats(master);
+ master->ats_enabled = arm_smmu_ats_supported(master);
if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
arm_smmu_install_ste_for_dev(master);
+ arm_smmu_enable_ats(master);
out_unlock:
mutex_unlock(&smmu_domain->init_mutex);
return ret;
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
@ 2019-08-21 15:50 ` Robin Murphy
0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-21 15:50 UTC (permalink / raw)
To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker
On 21/08/2019 16:17, Will Deacon wrote:
> To prevent any potential issues arising from speculative Address
> Translation Requests from an ATS-enabled PCIe endpoint, rework our ATS
> enabling/disabling logic so that we enable ATS at the SMMU before we
> enable it at the endpoint, and disable things in the opposite order.
This time it looks right :)
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
> drivers/iommu/arm-smmu-v3.c | 47 +++++++++++++++++++++++++++------------------
> 1 file changed, 28 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index b7b3b0ff8ed6..d7c65dfe42dc 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -2286,44 +2286,52 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
> }
> }
>
> -static int arm_smmu_enable_ats(struct arm_smmu_master *master)
> +static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
> {
> - int ret;
> - size_t stu;
> struct pci_dev *pdev;
> struct arm_smmu_device *smmu = master->smmu;
> struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
>
> if (!(smmu->features & ARM_SMMU_FEAT_ATS) || !dev_is_pci(master->dev) ||
> !(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS) || pci_ats_disabled())
> - return -ENXIO;
> + return false;
>
> pdev = to_pci_dev(master->dev);
> - if (pdev->untrusted)
> - return -EPERM;
> + return !pdev->untrusted && pdev->ats_cap;
> +}
>
> - /* Smallest Translation Unit: log2 of the smallest supported granule */
> - stu = __ffs(smmu->pgsize_bitmap);
> +static void arm_smmu_enable_ats(struct arm_smmu_master *master)
> +{
> + size_t stu;
> + struct pci_dev *pdev;
> + struct arm_smmu_device *smmu = master->smmu;
>
> - ret = pci_enable_ats(pdev, stu);
> - if (ret)
> - return ret;
> + /* Don't enable ATS at the endpoint if it's not enabled in the STE */
> + if (!master->ats_enabled)
> + return;
>
> - master->ats_enabled = true;
> - return 0;
> + /* Smallest Translation Unit: log2 of the smallest supported granule */
> + stu = __ffs(smmu->pgsize_bitmap);
> + pdev = to_pci_dev(master->dev);
> + if (pci_enable_ats(pdev, stu))
> + dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
> }
>
> static void arm_smmu_disable_ats(struct arm_smmu_master *master)
> {
> struct arm_smmu_cmdq_ent cmd;
>
> - if (!master->ats_enabled || !dev_is_pci(master->dev))
> + if (!master->ats_enabled)
> return;
>
> + pci_disable_ats(to_pci_dev(master->dev));
> + /*
> + * Ensure ATS is disabled at the endpoint before we issue the
> + * ATC invalidation via the SMMU.
> + */
> + wmb();
> arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
> arm_smmu_atc_inv_master(master, &cmd);
> - pci_disable_ats(to_pci_dev(master->dev));
> - master->ats_enabled = false;
> }
>
> static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> @@ -2338,10 +2346,10 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> list_del(&master->domain_head);
> spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>
> + arm_smmu_disable_ats(master);
> master->domain = NULL;
> + master->ats_enabled = false;
> arm_smmu_install_ste_for_dev(master);
> -
> - arm_smmu_disable_ats(master);
> }
>
> static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> @@ -2386,12 +2394,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>
> if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
> - arm_smmu_enable_ats(master);
> + master->ats_enabled = arm_smmu_ats_supported(master);
>
> if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
> arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
>
> arm_smmu_install_ste_for_dev(master);
> + arm_smmu_enable_ats(master);
> out_unlock:
> mutex_unlock(&smmu_domain->init_mutex);
> return ret;
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
` (4 preceding siblings ...)
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2019-08-21 16:25 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
` (2 subsequent siblings)
8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy
When invalidating the ATC for an PCIe endpoint using ATS, we must take
care to complete invalidation of the main SMMU TLBs beforehand, otherwise
the device could immediately repopulate its ATC with stale translations.
Hooking the ATC invalidation into ->unmap() as we currently do does the
exact opposite: it ensures that the ATC is invalidated *before* the
main TLBs, which is bogus.
Move ATC invalidation into the actual (leaf) invalidation routines so
that it is always called after completing main TLB invalidation.
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index d7c65dfe42dc..ca504a60312d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1961,6 +1961,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
*/
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
arm_smmu_cmdq_issue_sync(smmu);
+ arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
}
static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
@@ -1969,7 +1970,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
{
u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
struct arm_smmu_device *smmu = smmu_domain->smmu;
- unsigned long end = iova + size;
+ unsigned long start = iova, end = iova + size;
int i = 0;
struct arm_smmu_cmdq_ent cmd = {
.tlbi = {
@@ -2001,6 +2002,12 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
}
arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, true);
+
+ /*
+ * Unfortunately, this can't be leaf-only since we may have
+ * zapped an entire table.
+ */
+ arm_smmu_atc_inv_domain(smmu_domain, 0, start, size);
}
static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
@@ -2420,18 +2427,13 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size, struct iommu_iotlb_gather *gather)
{
- int ret;
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
if (!ops)
return 0;
- ret = ops->unmap(ops, iova, size, gather);
- if (ret && arm_smmu_atc_inv_domain(smmu_domain, 0, iova, size))
- return 0;
-
- return ret;
+ return ops->unmap(ops, iova, size, gather);
}
static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
@ 2019-08-21 16:25 ` Robin Murphy
0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-21 16:25 UTC (permalink / raw)
To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker
On 21/08/2019 16:17, Will Deacon wrote:
> When invalidating the ATC for an PCIe endpoint using ATS, we must take
> care to complete invalidation of the main SMMU TLBs beforehand, otherwise
> the device could immediately repopulate its ATC with stale translations.
>
> Hooking the ATC invalidation into ->unmap() as we currently do does the
> exact opposite: it ensures that the ATC is invalidated *before* the
> main TLBs, which is bogus.
>
> Move ATC invalidation into the actual (leaf) invalidation routines so
> that it is always called after completing main TLB invalidation.
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
> drivers/iommu/arm-smmu-v3.c | 16 +++++++++-------
> 1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index d7c65dfe42dc..ca504a60312d 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -1961,6 +1961,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
> */
> arm_smmu_cmdq_issue_cmd(smmu, &cmd);
> arm_smmu_cmdq_issue_sync(smmu);
> + arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
> }
>
> static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> @@ -1969,7 +1970,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> {
> u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
> struct arm_smmu_device *smmu = smmu_domain->smmu;
> - unsigned long end = iova + size;
> + unsigned long start = iova, end = iova + size;
> int i = 0;
> struct arm_smmu_cmdq_ent cmd = {
> .tlbi = {
> @@ -2001,6 +2002,12 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> }
>
> arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, true);
> +
> + /*
> + * Unfortunately, this can't be leaf-only since we may have
> + * zapped an entire table.
> + */
> + arm_smmu_atc_inv_domain(smmu_domain, 0, start, size);
> }
>
> static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
> @@ -2420,18 +2427,13 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
> static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
> size_t size, struct iommu_iotlb_gather *gather)
> {
> - int ret;
> struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
>
> if (!ops)
> return 0;
>
> - ret = ops->unmap(ops, iova, size, gather);
> - if (ret && arm_smmu_atc_inv_domain(smmu_domain, 0, iova, size))
> - return 0;
> -
> - return ret;
> + return ops->unmap(ops, iova, size, gather);
> }
>
> static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
` (5 preceding siblings ...)
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2019-08-22 12:36 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
8 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy
When ATS is not in use, we can avoid taking the 'devices_lock' for the
domain on the invalidation path by simply caching the number of ATS
masters currently attached. The fiddly part is handling a concurrent
->attach() of an ATS-enabled master to a domain that is being
invalidated, but we can handle this using an 'smp_mb()' to ensure that
our check of the count is ordered after completion of our prior TLB
invalidation.
This also makes our ->attach() and ->detach() flows symmetric wrt ATS
interactions.
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 37 ++++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ca504a60312d..0e43529d55fe 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -654,6 +654,7 @@ struct arm_smmu_domain {
struct io_pgtable_ops *pgtbl_ops;
bool non_strict;
+ atomic_t nr_ats_masters;
enum arm_smmu_domain_stage stage;
union {
@@ -1926,6 +1927,23 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
return 0;
+ /*
+ * Ensure that we've completed prior invalidation of the main TLBs
+ * before we read 'nr_ats_masters' in case of a concurrent call to
+ * arm_smmu_enable_ats():
+ *
+ * // unmap() // arm_smmu_enable_ats()
+ * TLBI+SYNC atomic_inc(&nr_ats_masters);
+ * smp_mb(); [...]
+ * atomic_read(&nr_ats_masters); pci_enable_ats() // writel()
+ *
+ * Ensures that we always see the incremented 'nr_ats_masters' count if
+ * ATS was enabled at the PCI device before completion of the TLBI.
+ */
+ smp_mb();
+ if (!atomic_read(&smmu_domain->nr_ats_masters))
+ return 0;
+
arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);
spin_lock_irqsave(&smmu_domain->devices_lock, flags);
@@ -2312,6 +2330,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
size_t stu;
struct pci_dev *pdev;
struct arm_smmu_device *smmu = master->smmu;
+ struct arm_smmu_domain *smmu_domain = master->domain;
/* Don't enable ATS at the endpoint if it's not enabled in the STE */
if (!master->ats_enabled)
@@ -2320,6 +2339,9 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
/* Smallest Translation Unit: log2 of the smallest supported granule */
stu = __ffs(smmu->pgsize_bitmap);
pdev = to_pci_dev(master->dev);
+
+ atomic_inc(&smmu_domain->nr_ats_masters);
+ arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
if (pci_enable_ats(pdev, stu))
dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
}
@@ -2327,6 +2349,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
static void arm_smmu_disable_ats(struct arm_smmu_master *master)
{
struct arm_smmu_cmdq_ent cmd;
+ struct arm_smmu_domain *smmu_domain = master->domain;
if (!master->ats_enabled)
return;
@@ -2339,6 +2362,7 @@ static void arm_smmu_disable_ats(struct arm_smmu_master *master)
wmb();
arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
arm_smmu_atc_inv_master(master, &cmd);
+ atomic_dec(&smmu_domain->nr_ats_masters);
}
static void arm_smmu_detach_dev(struct arm_smmu_master *master)
@@ -2349,11 +2373,12 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
if (!smmu_domain)
return;
+ arm_smmu_disable_ats(master);
+
spin_lock_irqsave(&smmu_domain->devices_lock, flags);
list_del(&master->domain_head);
spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
- arm_smmu_disable_ats(master);
master->domain = NULL;
master->ats_enabled = false;
arm_smmu_install_ste_for_dev(master);
@@ -2396,10 +2421,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
master->domain = smmu_domain;
- spin_lock_irqsave(&smmu_domain->devices_lock, flags);
- list_add(&master->domain_head, &smmu_domain->devices);
- spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
-
if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
master->ats_enabled = arm_smmu_ats_supported(master);
@@ -2407,7 +2428,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
arm_smmu_install_ste_for_dev(master);
+
+ spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+ list_add(&master->domain_head, &smmu_domain->devices);
+ spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
arm_smmu_enable_ats(master);
+
out_unlock:
mutex_unlock(&smmu_domain->init_mutex);
return ret;
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
@ 2019-08-22 12:36 ` Robin Murphy
0 siblings, 0 replies; 32+ messages in thread
From: Robin Murphy @ 2019-08-22 12:36 UTC (permalink / raw)
To: Will Deacon, iommu; +Cc: Jean-Philippe Brucker
On 21/08/2019 16:17, Will Deacon wrote:
> When ATS is not in use, we can avoid taking the 'devices_lock' for the
> domain on the invalidation path by simply caching the number of ATS
> masters currently attached. The fiddly part is handling a concurrent
> ->attach() of an ATS-enabled master to a domain that is being
> invalidated, but we can handle this using an 'smp_mb()' to ensure that
> our check of the count is ordered after completion of our prior TLB
> invalidation.
>
> This also makes our ->attach() and ->detach() flows symmetric wrt ATS
> interactions.
I don't have the bandwidth just now for a deep dive into the ordering
subtleties, but on the surface I think this looks sound - provided that
we don't forget and start calling arm_smmu_atc_inv_master() directly
from anywhere other than detach.
Acked-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
> drivers/iommu/arm-smmu-v3.c | 37 ++++++++++++++++++++++++++++++++-----
> 1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index ca504a60312d..0e43529d55fe 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -654,6 +654,7 @@ struct arm_smmu_domain {
>
> struct io_pgtable_ops *pgtbl_ops;
> bool non_strict;
> + atomic_t nr_ats_masters;
>
> enum arm_smmu_domain_stage stage;
> union {
> @@ -1926,6 +1927,23 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
> if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
> return 0;
>
> + /*
> + * Ensure that we've completed prior invalidation of the main TLBs
> + * before we read 'nr_ats_masters' in case of a concurrent call to
> + * arm_smmu_enable_ats():
> + *
> + * // unmap() // arm_smmu_enable_ats()
> + * TLBI+SYNC atomic_inc(&nr_ats_masters);
> + * smp_mb(); [...]
> + * atomic_read(&nr_ats_masters); pci_enable_ats() // writel()
> + *
> + * Ensures that we always see the incremented 'nr_ats_masters' count if
> + * ATS was enabled at the PCI device before completion of the TLBI.
> + */
> + smp_mb();
> + if (!atomic_read(&smmu_domain->nr_ats_masters))
> + return 0;
> +
> arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);
>
> spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> @@ -2312,6 +2330,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
> size_t stu;
> struct pci_dev *pdev;
> struct arm_smmu_device *smmu = master->smmu;
> + struct arm_smmu_domain *smmu_domain = master->domain;
>
> /* Don't enable ATS at the endpoint if it's not enabled in the STE */
> if (!master->ats_enabled)
> @@ -2320,6 +2339,9 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
> /* Smallest Translation Unit: log2 of the smallest supported granule */
> stu = __ffs(smmu->pgsize_bitmap);
> pdev = to_pci_dev(master->dev);
> +
> + atomic_inc(&smmu_domain->nr_ats_masters);
> + arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
> if (pci_enable_ats(pdev, stu))
> dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
> }
> @@ -2327,6 +2349,7 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
> static void arm_smmu_disable_ats(struct arm_smmu_master *master)
> {
> struct arm_smmu_cmdq_ent cmd;
> + struct arm_smmu_domain *smmu_domain = master->domain;
>
> if (!master->ats_enabled)
> return;
> @@ -2339,6 +2362,7 @@ static void arm_smmu_disable_ats(struct arm_smmu_master *master)
> wmb();
> arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
> arm_smmu_atc_inv_master(master, &cmd);
> + atomic_dec(&smmu_domain->nr_ats_masters);
> }
>
> static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> @@ -2349,11 +2373,12 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> if (!smmu_domain)
> return;
>
> + arm_smmu_disable_ats(master);
> +
> spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> list_del(&master->domain_head);
> spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>
> - arm_smmu_disable_ats(master);
> master->domain = NULL;
> master->ats_enabled = false;
> arm_smmu_install_ste_for_dev(master);
> @@ -2396,10 +2421,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>
> master->domain = smmu_domain;
>
> - spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> - list_add(&master->domain_head, &smmu_domain->devices);
> - spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
> -
> if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
> master->ats_enabled = arm_smmu_ats_supported(master);
>
> @@ -2407,7 +2428,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> arm_smmu_write_ctx_desc(smmu, &smmu_domain->s1_cfg);
>
> arm_smmu_install_ste_for_dev(master);
> +
> + spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> + list_add(&master->domain_head, &smmu_domain->devices);
> + spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
> +
> arm_smmu_enable_ats(master);
> +
> out_unlock:
> mutex_unlock(&smmu_domain->init_mutex);
> return ret;
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI"
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
` (6 preceding siblings ...)
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
@ 2019-08-21 15:17 ` Will Deacon
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
8 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2019-08-21 15:17 UTC (permalink / raw)
To: iommu; +Cc: Jean-Philippe Brucker, Will Deacon, Robin Murphy
This reverts commit dc65dc6eaf3eb762f202bb3493492d372b662b3d.
Now that ATC invalidation is performed in the correct places and without
incurring a locking overhead for non-ATS systems, we can re-enable the
corresponding SMMU feature detection.
Signed-off-by: Will Deacon <will@kernel.org>
---
drivers/iommu/arm-smmu-v3.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 0e43529d55fe..b8049ea2e455 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3336,13 +3336,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
}
/* Boolean feature flags */
-#if 0 /* ATS invalidation is slow and broken */
if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
smmu->features |= ARM_SMMU_FEAT_PRI;
if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
smmu->features |= ARM_SMMU_FEAT_ATS;
-#endif
if (reg & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
--
2.11.0
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply related [flat|nested] 32+ messages in thread
* arm-smmu-v3 high cpu usage for NVMe
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
` (7 preceding siblings ...)
2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
@ 2020-01-02 17:44 ` John Garry
2020-03-18 20:53 ` Will Deacon
8 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-01-02 17:44 UTC (permalink / raw)
To: Will Deacon, iommu, Robin Murphy
Cc: Jean-Philippe Brucker, Ming Lei, Marc Zyngier
Hi Will, Robin,
While analyzing an arm64 issue in interrupt handling for NVMe [0], we
have noticed a worryingly high CPU utilization in the SMMU driver.
The background is that we may get CPU lockup for high-throughput NVMe
testing, and we noticed that disabling the SMMU during testing avoids
the issue. However this lockup is a cross-architecture issue and there
are attempts to address it, like [1]. To me, disabling the SMMU is just
avoiding that specific issue.
Anyway, we should still consider this high CPU loading:
PerfTop: 1694 irqs/sec kernel:97.3% exact: 0.0% lost: 0/0
drop: 0/0 [4000Hz cycles], (all, CPU: 0)
--------------------------------------------------------------------------------------------------------------------------
50.84% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
19.51% [kernel] [k] _raw_spin_unlock_irqrestore
5.14% [kernel] [k] __slab_free
2.37% [kernel] [k] bio_release_pages.part.42
2.20% [kernel] [k] fput_many
1.92% [kernel] [k] aio_complete_rw
1.85% [kernel] [k] __arm_lpae_unmap
1.71% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42
1.11% [kernel] [k] sbitmap_queue_clear
1.05% [kernel] [k] blk_mq_free_request
0.97% [kernel] [k] nvme_irq
0.71% [kernel] [k] blk_account_io_done
0.66% [kernel] [k] kmem_cache_free
0.66% [kernel] [k] blk_mq_complete_request
This is for a CPU servicing the NVMe interrupt and doing the DMA unmap.
The DMA unmap is done in threaded interrupt context.
And for the overall system, we have:
PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434
drop: 0/40116 [4000Hz cycles], (all, 96 CPUs)
--------------------------------------------------------------------------------------------------------------------------
27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
11.71% [kernel] [k] _raw_spin_unlock_irqrestore
6.35% [kernel] [k] _raw_spin_unlock_irq
2.65% [kernel] [k] get_user_pages_fast
2.03% [kernel] [k] __slab_free
1.55% [kernel] [k] tick_nohz_idle_exit
1.47% [kernel] [k] arm_lpae_map
1.39% [kernel] [k] __fget
1.14% [kernel] [k] __lock_text_start
1.09% [kernel] [k] _raw_spin_lock
1.08% [kernel] [k] bio_release_pages.part.42
1.03% [kernel] [k] __sbitmap_get_word
0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42
0.91% [kernel] [k] fput_many
0.88% [kernel] [k] __arm_lpae_map
One thing to note is that we still spend an appreciable amount of time
in arm_smmu_atc_inv_domain(), which is disappointing when considering it
should effectively be a noop.
As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing
our batch size is 1, so we're not seeing the real benefit of the
batching. I can't help but think that we could improve this code to try
to combine CMD SYNCs for small batches.
Anyway, let me know your thoughts or any questions. I'll have a look if
a get a chance for other possible bottlenecks.
[0]
https://lore.kernel.org/lkml/e815b5451ea86e99d42045f7067f455a@www.loen.fr/
[1]
https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/
Cheers,
John
On 21/08/2019 16:17, Will Deacon wrote:
> Hi again,
>
> This is version two of the patches I posted yesterday:
>
> v1: https://lkml.kernel.org/r/20190820154549.17018-1-will@kernel.org
>
> Changes since then include:
>
> * Fix 'ats_enabled' checking when enabling ATS
> * Remove redundant 'dev_is_pci()' calls
> * Remove bool bitfield
> * Add patch temporarily disabling ATS detection for -stable
> * Issue ATC invalidation even when non-leaf
> * Elide invalidation/SYNC for zero-sized address ranges
> * Shuffle the patches round a bit
>
> Thanks,
>
> Will
>
> Cc: Zhen Lei <thunder.leizhen@huawei.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
>
> --->8
>
> Will Deacon (8):
> iommu/arm-smmu-v3: Document ordering guarantees of command insertion
> iommu/arm-smmu-v3: Disable detection of ATS and PRI
> iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag
> iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations
> iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
> iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
> iommu/arm-smmu-v3: Avoid locking on invalidation path when not using
> ATS
> Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI"
>
> drivers/iommu/arm-smmu-v3.c | 117 ++++++++++++++++++++++++++++++++------------
> 1 file changed, 87 insertions(+), 30 deletions(-)
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
@ 2020-03-18 20:53 ` Will Deacon
2020-03-19 12:54 ` John Garry
[not found] ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
0 siblings, 2 replies; 32+ messages in thread
From: Will Deacon @ 2020-03-18 20:53 UTC (permalink / raw)
To: John Garry
Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, Robin Murphy
Hi John,
On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> And for the overall system, we have:
>
> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop:
> 0/40116 [4000Hz cycles], (all, 96 CPUs)
> --------------------------------------------------------------------------------------------------------------------------
>
> 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 11.71% [kernel] [k] _raw_spin_unlock_irqrestore
> 6.35% [kernel] [k] _raw_spin_unlock_irq
> 2.65% [kernel] [k] get_user_pages_fast
> 2.03% [kernel] [k] __slab_free
> 1.55% [kernel] [k] tick_nohz_idle_exit
> 1.47% [kernel] [k] arm_lpae_map
> 1.39% [kernel] [k] __fget
> 1.14% [kernel] [k] __lock_text_start
> 1.09% [kernel] [k] _raw_spin_lock
> 1.08% [kernel] [k] bio_release_pages.part.42
> 1.03% [kernel] [k] __sbitmap_get_word
> 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42
> 0.91% [kernel] [k] fput_many
> 0.88% [kernel] [k] __arm_lpae_map
>
> One thing to note is that we still spend an appreciable amount of time in
> arm_smmu_atc_inv_domain(), which is disappointing when considering it should
> effectively be a noop.
>
> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
> batch size is 1, so we're not seeing the real benefit of the batching. I
> can't help but think that we could improve this code to try to combine CMD
> SYNCs for small batches.
>
> Anyway, let me know your thoughts or any questions. I'll have a look if a
> get a chance for other possible bottlenecks.
Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.
Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-18 20:53 ` Will Deacon
@ 2020-03-19 12:54 ` John Garry
2020-03-19 18:43 ` Jean-Philippe Brucker
[not found] ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
1 sibling, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-19 12:54 UTC (permalink / raw)
To: Will Deacon
Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, Robin Murphy
Hi Will,
>
> On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
>> And for the overall system, we have:
>>
>> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop:
>> 0/40116 [4000Hz cycles], (all, 96 CPUs)
>> --------------------------------------------------------------------------------------------------------------------------
>>
>> 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
>> 11.71% [kernel] [k] _raw_spin_unlock_irqrestore
>> 6.35% [kernel] [k] _raw_spin_unlock_irq
>> 2.65% [kernel] [k] get_user_pages_fast
>> 2.03% [kernel] [k] __slab_free
>> 1.55% [kernel] [k] tick_nohz_idle_exit
>> 1.47% [kernel] [k] arm_lpae_map
>> 1.39% [kernel] [k] __fget
>> 1.14% [kernel] [k] __lock_text_start
>> 1.09% [kernel] [k] _raw_spin_lock
>> 1.08% [kernel] [k] bio_release_pages.part.42
>> 1.03% [kernel] [k] __sbitmap_get_word
>> 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42
>> 0.91% [kernel] [k] fput_many
>> 0.88% [kernel] [k] __arm_lpae_map
>>
>> One thing to note is that we still spend an appreciable amount of time in
>> arm_smmu_atc_inv_domain(), which is disappointing when considering it should
>> effectively be a noop.
>>
>> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
>> batch size is 1, so we're not seeing the real benefit of the batching. I
>> can't help but think that we could improve this code to try to combine CMD
>> SYNCs for small batches.
>>
>> Anyway, let me know your thoughts or any questions. I'll have a look if a
>> get a chance for other possible bottlenecks.
>
> Did you ever get any more information on this? I don't have any SMMUv3
> hardware any more, so I can't really dig into this myself.
I'm only getting back to look at this now, as SMMU performance is a bit
of a hot topic again for us.
So one thing we are doing which looks to help performance is this series
from Marc:
https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
So that is just spreading the per-CPU load for NVMe interrupt handling
(where the DMA unmapping is happening), so I'd say just side-stepping
any SMMU issue really.
Going back to the SMMU, I wanted to run epbf and perf annotate to help
profile this, but was having no luck getting them to work properly. I'll
look at this again now.
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-19 12:54 ` John Garry
@ 2020-03-19 18:43 ` Jean-Philippe Brucker
2020-03-20 10:41 ` John Garry
0 siblings, 1 reply; 32+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-19 18:43 UTC (permalink / raw)
To: John Garry; +Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy
On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
> Hi Will,
>
> >
> > On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> > > And for the overall system, we have:
> > >
> > > PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop:
> > > 0/40116 [4000Hz cycles], (all, 96 CPUs)
> > > --------------------------------------------------------------------------------------------------------------------------
> > >
> > > 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> > > 11.71% [kernel] [k] _raw_spin_unlock_irqrestore
> > > 6.35% [kernel] [k] _raw_spin_unlock_irq
> > > 2.65% [kernel] [k] get_user_pages_fast
> > > 2.03% [kernel] [k] __slab_free
> > > 1.55% [kernel] [k] tick_nohz_idle_exit
> > > 1.47% [kernel] [k] arm_lpae_map
> > > 1.39% [kernel] [k] __fget
> > > 1.14% [kernel] [k] __lock_text_start
> > > 1.09% [kernel] [k] _raw_spin_lock
> > > 1.08% [kernel] [k] bio_release_pages.part.42
> > > 1.03% [kernel] [k] __sbitmap_get_word
> > > 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42
> > > 0.91% [kernel] [k] fput_many
> > > 0.88% [kernel] [k] __arm_lpae_map
> > >
> > > One thing to note is that we still spend an appreciable amount of time in
> > > arm_smmu_atc_inv_domain(), which is disappointing when considering it should
> > > effectively be a noop.
> > >
> > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
> > > batch size is 1, so we're not seeing the real benefit of the batching. I
> > > can't help but think that we could improve this code to try to combine CMD
> > > SYNCs for small batches.
> > >
> > > Anyway, let me know your thoughts or any questions. I'll have a look if a
> > > get a chance for other possible bottlenecks.
> >
> > Did you ever get any more information on this? I don't have any SMMUv3
> > hardware any more, so I can't really dig into this myself.
>
> I'm only getting back to look at this now, as SMMU performance is a bit of a
> hot topic again for us.
>
> So one thing we are doing which looks to help performance is this series
> from Marc:
>
> https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
>
> So that is just spreading the per-CPU load for NVMe interrupt handling
> (where the DMA unmapping is happening), so I'd say just side-stepping any
> SMMU issue really.
>
> Going back to the SMMU, I wanted to run epbf and perf annotate to help
> profile this, but was having no luck getting them to work properly. I'll
> look at this again now.
Could you also try with the upcoming ATS change currently in Will's tree?
They won't improve your numbers but it'd be good to check that they don't
make things worse.
I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.
- 6.07% arm_smmu_iotlb_sync
- 5.74% arm_smmu_tlb_inv_range
5.09% arm_smmu_cmdq_issue_cmdlist
0.28% __pi_memset
0.08% __pi_memcpy
0.08% arm_smmu_atc_inv_domain.constprop.37
0.07% arm_smmu_cmdq_build_cmd
0.01% arm_smmu_cmdq_batch_add
0.31% __pi_memset
So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.
Thanks,
Jean
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-19 18:43 ` Jean-Philippe Brucker
@ 2020-03-20 10:41 ` John Garry
2020-03-20 11:18 ` Jean-Philippe Brucker
2020-05-22 14:52 ` John Garry
0 siblings, 2 replies; 32+ messages in thread
From: John Garry @ 2020-03-20 10:41 UTC (permalink / raw)
To: Jean-Philippe Brucker
Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy
On 19/03/2020 18:43, Jean-Philippe Brucker wrote:
> On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
>> Hi Will,
>>
>>>
>>> On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
>>>> And for the overall system, we have:
>>>>
>>>> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop:
>>>> 0/40116 [4000Hz cycles], (all, 96 CPUs)
>>>> --------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
>>>> 11.71% [kernel] [k] _raw_spin_unlock_irqrestore
>>>> 6.35% [kernel] [k] _raw_spin_unlock_irq
>>>> 2.65% [kernel] [k] get_user_pages_fast
>>>> 2.03% [kernel] [k] __slab_free
>>>> 1.55% [kernel] [k] tick_nohz_idle_exit
>>>> 1.47% [kernel] [k] arm_lpae_map
>>>> 1.39% [kernel] [k] __fget
>>>> 1.14% [kernel] [k] __lock_text_start
>>>> 1.09% [kernel] [k] _raw_spin_lock
>>>> 1.08% [kernel] [k] bio_release_pages.part.42
>>>> 1.03% [kernel] [k] __sbitmap_get_word
>>>> 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42
>>>> 0.91% [kernel] [k] fput_many
>>>> 0.88% [kernel] [k] __arm_lpae_map
>>>>
>>>> One thing to note is that we still spend an appreciable amount of time in
>>>> arm_smmu_atc_inv_domain(), which is disappointing when considering it should
>>>> effectively be a noop.
>>>>
>>>> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
>>>> batch size is 1, so we're not seeing the real benefit of the batching. I
>>>> can't help but think that we could improve this code to try to combine CMD
>>>> SYNCs for small batches.
>>>>
>>>> Anyway, let me know your thoughts or any questions. I'll have a look if a
>>>> get a chance for other possible bottlenecks.
>>>
>>> Did you ever get any more information on this? I don't have any SMMUv3
>>> hardware any more, so I can't really dig into this myself.
>>
>> I'm only getting back to look at this now, as SMMU performance is a bit of a
>> hot topic again for us.
>>
>> So one thing we are doing which looks to help performance is this series
>> from Marc:
>>
>> https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
>>
>> So that is just spreading the per-CPU load for NVMe interrupt handling
>> (where the DMA unmapping is happening), so I'd say just side-stepping any
>> SMMU issue really.
>>
>> Going back to the SMMU, I wanted to run epbf and perf annotate to help
>> profile this, but was having no luck getting them to work properly. I'll
>> look at this again now.
>
> Could you also try with the upcoming ATS change currently in Will's tree?
> They won't improve your numbers but it'd be good to check that they don't
> make things worse.
I can do when I get a chance.
>
> I've run a bunch of netperf instances on multiple cores and collecting
> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
> consistently.
>
> - 6.07% arm_smmu_iotlb_sync
> - 5.74% arm_smmu_tlb_inv_range
> 5.09% arm_smmu_cmdq_issue_cmdlist
> 0.28% __pi_memset
> 0.08% __pi_memcpy
> 0.08% arm_smmu_atc_inv_domain.constprop.37
> 0.07% arm_smmu_cmdq_build_cmd
> 0.01% arm_smmu_cmdq_batch_add
> 0.31% __pi_memset
>
> So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
> when ATS is not used. According to the annotations, the load from the
> atomic_read(), that checks whether the domain uses ATS, is 77% of the
> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
> there is much room for optimization there.
Well I did originally suggest using RCU protection to scan the list of
devices, instead of reading an atomic and checking for non-zero value.
But that would be an optimsation for ATS also, and there was no ATS
devices at the time (to verify performance).
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-20 10:41 ` John Garry
@ 2020-03-20 11:18 ` Jean-Philippe Brucker
2020-03-20 16:20 ` John Garry
2020-05-22 14:52 ` John Garry
1 sibling, 1 reply; 32+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-20 11:18 UTC (permalink / raw)
To: John Garry; +Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy
On Fri, Mar 20, 2020 at 10:41:44AM +0000, John Garry wrote:
> On 19/03/2020 18:43, Jean-Philippe Brucker wrote:
> > On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
> > > Hi Will,
> > >
> > > >
> > > > On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> > > > > And for the overall system, we have:
> > > > >
> > > > > PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop:
> > > > > 0/40116 [4000Hz cycles], (all, 96 CPUs)
> > > > > --------------------------------------------------------------------------------------------------------------------------
> > > > >
> > > > > 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> > > > > 11.71% [kernel] [k] _raw_spin_unlock_irqrestore
> > > > > 6.35% [kernel] [k] _raw_spin_unlock_irq
> > > > > 2.65% [kernel] [k] get_user_pages_fast
> > > > > 2.03% [kernel] [k] __slab_free
> > > > > 1.55% [kernel] [k] tick_nohz_idle_exit
> > > > > 1.47% [kernel] [k] arm_lpae_map
> > > > > 1.39% [kernel] [k] __fget
> > > > > 1.14% [kernel] [k] __lock_text_start
> > > > > 1.09% [kernel] [k] _raw_spin_lock
> > > > > 1.08% [kernel] [k] bio_release_pages.part.42
> > > > > 1.03% [kernel] [k] __sbitmap_get_word
> > > > > 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42
> > > > > 0.91% [kernel] [k] fput_many
> > > > > 0.88% [kernel] [k] __arm_lpae_map
> > > > >
> > > > > One thing to note is that we still spend an appreciable amount of time in
> > > > > arm_smmu_atc_inv_domain(), which is disappointing when considering it should
> > > > > effectively be a noop.
> > > > >
> > > > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
> > > > > batch size is 1, so we're not seeing the real benefit of the batching. I
> > > > > can't help but think that we could improve this code to try to combine CMD
> > > > > SYNCs for small batches.
> > > > >
> > > > > Anyway, let me know your thoughts or any questions. I'll have a look if a
> > > > > get a chance for other possible bottlenecks.
> > > >
> > > > Did you ever get any more information on this? I don't have any SMMUv3
> > > > hardware any more, so I can't really dig into this myself.
> > >
> > > I'm only getting back to look at this now, as SMMU performance is a bit of a
> > > hot topic again for us.
> > >
> > > So one thing we are doing which looks to help performance is this series
> > > from Marc:
> > >
> > > https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fce52@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
> > >
> > > So that is just spreading the per-CPU load for NVMe interrupt handling
> > > (where the DMA unmapping is happening), so I'd say just side-stepping any
> > > SMMU issue really.
> > >
> > > Going back to the SMMU, I wanted to run epbf and perf annotate to help
> > > profile this, but was having no luck getting them to work properly. I'll
> > > look at this again now.
> >
> > Could you also try with the upcoming ATS change currently in Will's tree?
> > They won't improve your numbers but it'd be good to check that they don't
> > make things worse.
>
> I can do when I get a chance.
>
> >
> > I've run a bunch of netperf instances on multiple cores and collecting
> > SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
> > consistently.
> >
> > - 6.07% arm_smmu_iotlb_sync
> > - 5.74% arm_smmu_tlb_inv_range
> > 5.09% arm_smmu_cmdq_issue_cmdlist
> > 0.28% __pi_memset
> > 0.08% __pi_memcpy
> > 0.08% arm_smmu_atc_inv_domain.constprop.37
> > 0.07% arm_smmu_cmdq_build_cmd
> > 0.01% arm_smmu_cmdq_batch_add
> > 0.31% __pi_memset
> >
> > So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
> > when ATS is not used. According to the annotations, the load from the
> > atomic_read(), that checks whether the domain uses ATS, is 77% of the
> > samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
> > there is much room for optimization there.
>
> Well I did originally suggest using RCU protection to scan the list of
> devices, instead of reading an atomic and checking for non-zero value. But
> that would be an optimsation for ATS also, and there was no ATS devices at
> the time (to verify performance).
Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
performance, but I agree that using RCU to scan the list should get better
results when using ATS.
When ATS isn't in use however, I suspect reading nr_ats_masters should be
more efficient than taking the RCU lock + reading an "ats_devices" list
(since the smmu_domain->devices list also serves context descriptor
invalidation, even when ATS isn't in use). I'll run some tests however, to
see if I can micro-optimize this case, but I don't expect noticeable
improvements.
Thanks,
Jean
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-20 11:18 ` Jean-Philippe Brucker
@ 2020-03-20 16:20 ` John Garry
2020-03-20 16:33 ` Marc Zyngier
0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-20 16:20 UTC (permalink / raw)
To: Jean-Philippe Brucker
Cc: Will Deacon, Ming Lei, iommu, Marc Zyngier, Robin Murphy
>>
>>>
>>> I've run a bunch of netperf instances on multiple cores and collecting
>>> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
>>> consistently.
>>>
>>> - 6.07% arm_smmu_iotlb_sync
>>> - 5.74% arm_smmu_tlb_inv_range
>>> 5.09% arm_smmu_cmdq_issue_cmdlist
>>> 0.28% __pi_memset
>>> 0.08% __pi_memcpy
>>> 0.08% arm_smmu_atc_inv_domain.constprop.37
>>> 0.07% arm_smmu_cmdq_build_cmd
>>> 0.01% arm_smmu_cmdq_batch_add
>>> 0.31% __pi_memset
>>>
>>> So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
>>> when ATS is not used. According to the annotations, the load from the
>>> atomic_read(), that checks whether the domain uses ATS, is 77% of the
>>> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
>>> there is much room for optimization there.
>>
>> Well I did originally suggest using RCU protection to scan the list of
>> devices, instead of reading an atomic and checking for non-zero value. But
>> that would be an optimsation for ATS also, and there was no ATS devices at
>> the time (to verify performance).
>
> Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
> performance, but I agree that using RCU to scan the list should get better
> results when using ATS.
>
> When ATS isn't in use however, I suspect reading nr_ats_masters should be
> more efficient than taking the RCU lock + reading an "ats_devices" list
> (since the smmu_domain->devices list also serves context descriptor
> invalidation, even when ATS isn't in use). I'll run some tests however, to
> see if I can micro-optimize this case, but I don't expect noticeable
> improvements.
ok, cheers. I, too, would not expect a significant improvement there.
JFYI, I've been playing for "perf annotate" today and it's giving
strange results for my NVMe testing. So "report" looks somewhat sane, if
not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
55.39% irq/342-nvme0q1 [kernel.kallsyms] [k]
arm_smmu_cmdq_issue_cmdlist
9.74% irq/342-nvme0q1 [kernel.kallsyms] [k]
_raw_spin_unlock_irqrestore
2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq
1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many
1.73% irq/342-nvme0q1 [kernel.kallsyms] [k]
arm_smmu_atc_inv_domain.constprop.42
1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap
1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw
But "annotate" consistently tells me that a specific instruction
consumes ~99% of the load for the enqueue function:
: /* 5. If we are inserting a CMD_SYNC,
we must wait for it to complete */
: if (sync) {
0.00 : ffff80001071c948: ldr w0, [x29, #108]
: int ret = 0;
0.00 : ffff80001071c94c: mov w24, #0x0
// #0
: if (sync) {
0.00 : ffff80001071c950: cbnz w0, ffff80001071c990
<arm_smmu_cmdq_issue_cmdlist+0x420>
: arch_local_irq_restore():
0.00 : ffff80001071c954: msr daif, x21
: arm_smmu_cmdq_issue_cmdlist():
: }
: }
:
: local_irq_restore(flags);
: return ret;
: }
99.51 : ffff80001071c958: adrp x0, ffff800011909000
<page_wait_table+0x14c0>
0.00 : ffff80001071c95c: add x21, x0, #0x908
0.02 : ffff80001071c960: ldr x2, [x29, #488]
0.14 : ffff80001071c964: ldr x1, [x21]
0.00 : ffff80001071c968: eor x1, x2, x1
0.00 : ffff80001071c96c: mov w0, w24
But there may be a hint that we're getting consuming a lot of time in
polling the CMD_SYNC consumption.
The files are available here:
https://raw.githubusercontent.com/hisilicon/kernel-dev/private-topic-nvme-5.6-profiling/ann.txt,
report
Or maybe I'm just not using the tool properly ...
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-20 16:20 ` John Garry
@ 2020-03-20 16:33 ` Marc Zyngier
2020-03-23 9:03 ` John Garry
0 siblings, 1 reply; 32+ messages in thread
From: Marc Zyngier @ 2020-03-20 16:33 UTC (permalink / raw)
To: John Garry
Cc: Jean-Philippe Brucker, Robin Murphy, Ming Lei, iommu, Will Deacon
Hi John,
On 2020-03-20 16:20, John Garry wrote:
>>>
>>>>
>>>> I've run a bunch of netperf instances on multiple cores and
>>>> collecting
>>>> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
>>>> consistently.
>>>>
>>>> - 6.07% arm_smmu_iotlb_sync
>>>> - 5.74% arm_smmu_tlb_inv_range
>>>> 5.09% arm_smmu_cmdq_issue_cmdlist
>>>> 0.28% __pi_memset
>>>> 0.08% __pi_memcpy
>>>> 0.08% arm_smmu_atc_inv_domain.constprop.37
>>>> 0.07% arm_smmu_cmdq_build_cmd
>>>> 0.01% arm_smmu_cmdq_batch_add
>>>> 0.31% __pi_memset
>>>>
>>>> So arm_smmu_atc_inv_domain() takes about 1.4% of
>>>> arm_smmu_iotlb_sync(),
>>>> when ATS is not used. According to the annotations, the load from
>>>> the
>>>> atomic_read(), that checks whether the domain uses ATS, is 77% of
>>>> the
>>>> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm
>>>> not sure
>>>> there is much room for optimization there.
>>>
>>> Well I did originally suggest using RCU protection to scan the list
>>> of
>>> devices, instead of reading an atomic and checking for non-zero
>>> value. But
>>> that would be an optimsation for ATS also, and there was no ATS
>>> devices at
>>> the time (to verify performance).
>>
>> Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
>> performance, but I agree that using RCU to scan the list should get
>> better
>> results when using ATS.
>>
>> When ATS isn't in use however, I suspect reading nr_ats_masters should
>> be
>> more efficient than taking the RCU lock + reading an "ats_devices"
>> list
>> (since the smmu_domain->devices list also serves context descriptor
>> invalidation, even when ATS isn't in use). I'll run some tests
>> however, to
>> see if I can micro-optimize this case, but I don't expect noticeable
>> improvements.
>
> ok, cheers. I, too, would not expect a significant improvement there.
>
> JFYI, I've been playing for "perf annotate" today and it's giving
> strange results for my NVMe testing. So "report" looks somewhat sane,
> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
>
>
> 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k]
> arm_smmu_cmdq_issue_cmdlist
> 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k]
> _raw_spin_unlock_irqrestore
> 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq
> 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many
> 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k]
> arm_smmu_atc_inv_domain.constprop.42
> 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap
> 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw
>
> But "annotate" consistently tells me that a specific instruction
> consumes ~99% of the load for the enqueue function:
>
> : /* 5. If we are inserting a CMD_SYNC,
> we must wait for it to complete */
> : if (sync) {
> 0.00 : ffff80001071c948: ldr w0, [x29, #108]
> : int ret = 0;
> 0.00 : ffff80001071c94c: mov w24, #0x0 // #0
> : if (sync) {
> 0.00 : ffff80001071c950: cbnz w0, ffff80001071c990
> <arm_smmu_cmdq_issue_cmdlist+0x420>
> : arch_local_irq_restore():
> 0.00 : ffff80001071c954: msr daif, x21
> : arm_smmu_cmdq_issue_cmdlist():
> : }
> : }
> :
> : local_irq_restore(flags);
> : return ret;
> : }
> 99.51 : ffff80001071c958: adrp x0, ffff800011909000
> <page_wait_table+0x14c0>
This is likely the side effect of the re-enabling of interrupts (msr
daif, x21)
on the previous instruction which causes the perf interrupt to fire
right after.
Time to enable pseudo-NMIs in the PMUv3 driver...
M.
--
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-20 16:33 ` Marc Zyngier
@ 2020-03-23 9:03 ` John Garry
2020-03-23 9:16 ` Marc Zyngier
0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-23 9:03 UTC (permalink / raw)
To: Marc Zyngier
Cc: Jean-Philippe Brucker, Robin Murphy, Ming Lei, iommu, Will Deacon
On 20/03/2020 16:33, Marc Zyngier wrote:
>> JFYI, I've been playing for "perf annotate" today and it's giving
>> strange results for my NVMe testing. So "report" looks somewhat sane,
>> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
>>
>>
>> 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k]
>> arm_smmu_cmdq_issue_cmdlist
>> 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k]
>> _raw_spin_unlock_irqrestore
>> 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq
>> 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many
>> 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k]
>> arm_smmu_atc_inv_domain.constprop.42
>> 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap
>> 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw
>>
>> But "annotate" consistently tells me that a specific instruction
>> consumes ~99% of the load for the enqueue function:
>>
>> : /* 5. If we are inserting a CMD_SYNC,
>> we must wait for it to complete */
>> : if (sync) {
>> 0.00 : ffff80001071c948: ldr w0, [x29, #108]
>> : int ret = 0;
>> 0.00 : ffff80001071c94c: mov w24, #0x0 // #0
>> : if (sync) {
>> 0.00 : ffff80001071c950: cbnz w0, ffff80001071c990
>> <arm_smmu_cmdq_issue_cmdlist+0x420>
>> : arch_local_irq_restore():
>> 0.00 : ffff80001071c954: msr daif, x21
>> : arm_smmu_cmdq_issue_cmdlist():
>> : }
>> : }
>> :
>> : local_irq_restore(flags);
>> : return ret;
>> : }
>> 99.51 : ffff80001071c958: adrp x0, ffff800011909000
>> <page_wait_table+0x14c0>
>
Hi Marc,
> This is likely the side effect of the re-enabling of interrupts (msr
> daif, x21)
> on the previous instruction which causes the perf interrupt to fire
> right after.
ok, makes sense.
>
> Time to enable pseudo-NMIs in the PMUv3 driver...
>
Do you know if there is any plan for this?
In the meantime, maybe I can do some trickery by putting the
local_irq_restore() in a separate function, outside
arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function.
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-23 9:03 ` John Garry
@ 2020-03-23 9:16 ` Marc Zyngier
2020-03-24 9:18 ` John Garry
0 siblings, 1 reply; 32+ messages in thread
From: Marc Zyngier @ 2020-03-23 9:16 UTC (permalink / raw)
To: John Garry
Cc: Jean-Philippe Brucker, Robin Murphy, Ming Lei, iommu, Will Deacon
On 2020-03-23 09:03, John Garry wrote:
> On 20/03/2020 16:33, Marc Zyngier wrote:
>>> JFYI, I've been playing for "perf annotate" today and it's giving
>>> strange results for my NVMe testing. So "report" looks somewhat sane,
>>> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
>>>
>>>
>>> 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k]
>>> arm_smmu_cmdq_issue_cmdlist
>>> 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k]
>>> _raw_spin_unlock_irqrestore
>>> 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq
>>> 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many
>>> 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k]
>>> arm_smmu_atc_inv_domain.constprop.42
>>> 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap
>>> 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw
>>>
>>> But "annotate" consistently tells me that a specific instruction
>>> consumes ~99% of the load for the enqueue function:
>>>
>>> : /* 5. If we are inserting a CMD_SYNC,
>>> we must wait for it to complete */
>>> : if (sync) {
>>> 0.00 : ffff80001071c948: ldr w0, [x29, #108]
>>> : int ret = 0;
>>> 0.00 : ffff80001071c94c: mov w24, #0x0 // #0
>>> : if (sync) {
>>> 0.00 : ffff80001071c950: cbnz w0, ffff80001071c990
>>> <arm_smmu_cmdq_issue_cmdlist+0x420>
>>> : arch_local_irq_restore():
>>> 0.00 : ffff80001071c954: msr daif, x21
>>> : arm_smmu_cmdq_issue_cmdlist():
>>> : }
>>> : }
>>> :
>>> : local_irq_restore(flags);
>>> : return ret;
>>> : }
>>> 99.51 : ffff80001071c958: adrp x0, ffff800011909000
>>> <page_wait_table+0x14c0>
>>
>
> Hi Marc,
>
>> This is likely the side effect of the re-enabling of interrupts (msr
>> daif, x21)
>> on the previous instruction which causes the perf interrupt to fire
>> right after.
>
> ok, makes sense.
>
>>
>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>
>
> Do you know if there is any plan for this?
There was. Julien Thierry has a bunch of patches for that [1], but they
needs
reviving.
>
> In the meantime, maybe I can do some trickery by putting the
> local_irq_restore() in a separate function, outside
> arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same
> function.
I don't see how you can improve the profiling without compromising
the locking in this case...
Thanks,
M.
[1] https://patchwork.kernel.org/cover/11047407/
--
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-23 9:16 ` Marc Zyngier
@ 2020-03-24 9:18 ` John Garry
2020-03-24 10:43 ` Marc Zyngier
0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-24 9:18 UTC (permalink / raw)
To: Marc Zyngier
Cc: Mark Rutland, Jean-Philippe Brucker, Robin Murphy, Ming Lei,
iommu, Will Deacon, Julien Thierry
On 23/03/2020 09:16, Marc Zyngier wrote:
+ Julien, Mark
Hi Marc,
>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>
>>
>> Do you know if there is any plan for this?
>
> There was. Julien Thierry has a bunch of patches for that [1], but they
> needs
> reviving.
>
So those patches still apply cleanly (apart from the kvm patch, which I
can skip, I suppose) and build, so I can try this I figure. Is there
anything else which I should ensure or know about? Apart from enable
CONFIG_ARM64_PSUEDO_NMI.
A quickly taken perf annotate and report is at the tip here:
https://github.com/hisilicon/kernel-dev/commits/private-topic-nvme-5.6-profiling
>>
>> In the meantime, maybe I can do some trickery by putting the
>> local_irq_restore() in a separate function, outside
>> arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same
>> function.
>
Scratch that :)
> I don't see how you can improve the profiling without compromising
> the locking in this case...
>
Cheers,
John
[1] https://patchwork.kernel.org/cover/11047407/
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-24 9:18 ` John Garry
@ 2020-03-24 10:43 ` Marc Zyngier
2020-03-24 11:55 ` John Garry
0 siblings, 1 reply; 32+ messages in thread
From: Marc Zyngier @ 2020-03-24 10:43 UTC (permalink / raw)
To: John Garry
Cc: Mark Rutland, Jean-Philippe Brucker, Robin Murphy, Ming Lei,
iommu, Will Deacon, Julien Thierry
On Tue, 24 Mar 2020 09:18:10 +0000
John Garry <john.garry@huawei.com> wrote:
> On 23/03/2020 09:16, Marc Zyngier wrote:
>
> + Julien, Mark
>
> Hi Marc,
>
> >>> Time to enable pseudo-NMIs in the PMUv3 driver...
> >>>
> >>
> >> Do you know if there is any plan for this?
> >
> > There was. Julien Thierry has a bunch of patches for that [1], but they > needs
> > reviving.
> >
>
> So those patches still apply cleanly (apart from the kvm patch, which
> I can skip, I suppose) and build, so I can try this I figure. Is
> there anything else which I should ensure or know about? Apart from
> enable CONFIG_ARM64_PSUEDO_NMI.
You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge... ;-).
M.
--
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-24 10:43 ` Marc Zyngier
@ 2020-03-24 11:55 ` John Garry
2020-03-24 12:07 ` Robin Murphy
0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-24 11:55 UTC (permalink / raw)
To: Marc Zyngier
Cc: Mark Rutland, Jean-Philippe Brucker, Robin Murphy, Ming Lei,
iommu, Will Deacon, Julien Thierry
On 24/03/2020 10:43, Marc Zyngier wrote:
> On Tue, 24 Mar 2020 09:18:10 +0000
> John Garry<john.garry@huawei.com> wrote:
>
>> On 23/03/2020 09:16, Marc Zyngier wrote:
>>
>> + Julien, Mark
>>
>> Hi Marc,
>>
>>>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>>>
>>>> Do you know if there is any plan for this?
>>> There was. Julien Thierry has a bunch of patches for that [1], but they > needs
>>> reviving.
>>>
>> So those patches still apply cleanly (apart from the kvm patch, which
>> I can skip, I suppose) and build, so I can try this I figure. Is
>> there anything else which I should ensure or know about? Apart from
>> enable CONFIG_ARM64_PSUEDO_NMI.
> You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
> has it set to 0, preventing me from being able to use the feature
> (hint, nudge...;-).
Yeah, apparently it's set on our D06CS board, but I just need to double
check the FW version with our FW guy.
As for D05, there has not been a FW update there in quite a long time
and no plans for it. Sorry.
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-24 11:55 ` John Garry
@ 2020-03-24 12:07 ` Robin Murphy
2020-03-24 12:37 ` John Garry
0 siblings, 1 reply; 32+ messages in thread
From: Robin Murphy @ 2020-03-24 12:07 UTC (permalink / raw)
To: John Garry, Marc Zyngier
Cc: Mark Rutland, Jean-Philippe Brucker, Ming Lei, iommu,
Will Deacon, Julien Thierry
On 2020-03-24 11:55 am, John Garry wrote:
> On 24/03/2020 10:43, Marc Zyngier wrote:
>> On Tue, 24 Mar 2020 09:18:10 +0000
>> John Garry<john.garry@huawei.com> wrote:
>>
>>> On 23/03/2020 09:16, Marc Zyngier wrote:
>>>
>>> + Julien, Mark
>>>
>>> Hi Marc,
>>>
>>>>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>>>>
>>>>> Do you know if there is any plan for this?
>>>> There was. Julien Thierry has a bunch of patches for that [1], but
>>>> they > needs
>>>> reviving.
>>>>
>>> So those patches still apply cleanly (apart from the kvm patch, which
>>> I can skip, I suppose) and build, so I can try this I figure. Is
>>> there anything else which I should ensure or know about? Apart from
>>> enable CONFIG_ARM64_PSUEDO_NMI.
>> You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
>> has it set to 0, preventing me from being able to use the feature
>> (hint, nudge...;-).
>
> Yeah, apparently it's set on our D06CS board, but I just need to double
> check the FW version with our FW guy.
Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but
since it's not been called out:
This high priority configuration for interrupts needs to be
explicitly enabled by setting the kernel parameter
"irqchip.gicv3_pseudo_nmi" to 1.
FWIW I believe is is still on the plan for someone here to dust off the
PMU pNMI patches at some point.
Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-24 12:07 ` Robin Murphy
@ 2020-03-24 12:37 ` John Garry
2020-03-25 15:31 ` John Garry
0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2020-03-24 12:37 UTC (permalink / raw)
To: Robin Murphy, Marc Zyngier
Cc: Mark Rutland, Jean-Philippe Brucker, Ming Lei, iommu,
Will Deacon, Julien Thierry
On 24/03/2020 12:07, Robin Murphy wrote:
> On 2020-03-24 11:55 am, John Garry wrote:
>> On 24/03/2020 10:43, Marc Zyngier wrote:
>>> On Tue, 24 Mar 2020 09:18:10 +0000
>>> John Garry<john.garry@huawei.com> wrote:
>>>
>>>> On 23/03/2020 09:16, Marc Zyngier wrote:
>>>>
>>>> + Julien, Mark
>>>>
>>>> Hi Marc,
>>>>
>>>>>>> Time to enable pseudo-NMIs in the PMUv3 driver...
>>>>>>>
>>>>>> Do you know if there is any plan for this?
>>>>> There was. Julien Thierry has a bunch of patches for that [1], but
>>>>> they > needs
>>>>> reviving.
>>>>>
>>>> So those patches still apply cleanly (apart from the kvm patch, which
>>>> I can skip, I suppose) and build, so I can try this I figure. Is
>>>> there anything else which I should ensure or know about? Apart from
>>>> enable CONFIG_ARM64_PSUEDO_NMI.
>>> You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
>>> has it set to 0, preventing me from being able to use the feature
>>> (hint, nudge...;-).
>>
>> Yeah, apparently it's set on our D06CS board, but I just need to
>> double check the FW version with our FW guy.
>
> Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but
> since it's not been called out:
>
> This high priority configuration for interrupts needs to be
> explicitly enabled by setting the kernel parameter
> "irqchip.gicv3_pseudo_nmi" to 1.
Yeah, I saw that by chance somewhere else previously.
>
> FWIW I believe is is still on the plan for someone here to dust off the
> PMU pNMI patches at some point.
Cool. Well I can try to experiment with what Julien had at v4 for now.
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-24 12:37 ` John Garry
@ 2020-03-25 15:31 ` John Garry
0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2020-03-25 15:31 UTC (permalink / raw)
To: Robin Murphy, Marc Zyngier
Cc: Mark Rutland, Jean-Philippe Brucker, Ming Lei, iommu,
Will Deacon, Julien Thierry
>
>>
>> FWIW I believe is is still on the plan for someone here to dust off
>> the PMU pNMI patches at some point.
>
> Cool. Well I can try to experiment with what Julien had at v4 for now.
>
JFYI, I have done some more perf record capturing, and updated the
"annotate" and "report" output here
https://raw.githubusercontent.com/hisilicon/kernel-dev/679eca1008b1d11b42e1b5fa8a205266c240d1e1/ann.txt
and .../report
This capture is just for cpu0, since NVMe irq handling+dma unmapping
will occur on specific CPUs, cpu0 being one of them.
The reports look somewhat sane. So we no longer have ~99% of time
attributed to re-enabling interrupts, now that's like:
3.14 : ffff80001071eae0: ldr w0, [x29, #108]
: int ret = 0;
0.00 : ffff80001071eae4: mov w24, #0x0
// #0
: if (sync) {
0.00 : ffff80001071eae8: cbnz w0, ffff80001071eb44
<arm_smmu_cmdq_issue_cmdlist+0x44c>
: arch_local_irq_restore():
: asm volatile(ALTERNATIVE(
0.00 : ffff80001071eaec: msr daif, x21
: arch_static_branch():
0.25 : ffff80001071eaf0: nop
: arm_smmu_cmdq_issue_cmdlist():
: }
: }
:
: local_irq_restore(flags);
: return ret;
: }
One observation (if these reports are to be believed) is that we may
spend a lot of time in the CAS loop, trying to get a place in the queue
initially:
: __CMPXCHG_CASE(w, , , 32, )
: __CMPXCHG_CASE(x, , , 64, )
0.00 : ffff80001071e828: mov x0, x27
0.00 : ffff80001071e82c: mov x4, x1
0.00 : ffff80001071e830: cas x4, x2, [x27]
28.61 : ffff80001071e834: mov x0, x4
: arm_smmu_cmdq_issue_cmdlist():
: if (old == llq.val)
0.00 : ffff80001071e838: ldr x1, [x23]
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: arm-smmu-v3 high cpu usage for NVMe
2020-03-20 10:41 ` John Garry
2020-03-20 11:18 ` Jean-Philippe Brucker
@ 2020-05-22 14:52 ` John Garry
2020-05-25 5:57 ` Song Bao Hua (Barry Song)
1 sibling, 1 reply; 32+ messages in thread
From: John Garry @ 2020-05-22 14:52 UTC (permalink / raw)
To: Will Deacon, Robin Murphy
Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, alexandru.elisei
On 20/03/2020 10:41, John Garry wrote:
+ Barry, Alexandru
>>>>> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost:
>>>>> 0/34434 drop:
>>>>> 0/40116 [4000Hz cycles], (all, 96 CPUs)
>>>>> --------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
>>>>> 11.71% [kernel] [k] _raw_spin_unlock_irqrestore
>>>>> 6.35% [kernel] [k] _raw_spin_unlock_irq
>>>>> 2.65% [kernel] [k] get_user_pages_fast
>>>>> 2.03% [kernel] [k] __slab_free
>>>>> 1.55% [kernel] [k] tick_nohz_idle_exit
>>>>> 1.47% [kernel] [k] arm_lpae_map
>>>>> 1.39% [kernel] [k] __fget
>>>>> 1.14% [kernel] [k] __lock_text_start
>>>>> 1.09% [kernel] [k] _raw_spin_lock
>>>>> 1.08% [kernel] [k] bio_release_pages.part.42
>>>>> 1.03% [kernel] [k] __sbitmap_get_word
>>>>> 0.97% [kernel] [k]
>>>>> arm_smmu_atc_inv_domain.constprop.42
>>>>> 0.91% [kernel] [k] fput_many
>>>>> 0.88% [kernel] [k] __arm_lpae_map
>>>>>
Hi Will, Robin,
I'm just getting around to look at this topic again. Here's the current
picture for my NVMe test:
perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear
* only certain CPUs run the dma unmap for my scenario, cpu0 being one of
them.
Colleague Barry has similar findings for some other scenarios.
So we tried the latest perf NMI support wip patches, and noticed a few
hotspots (see
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate
and
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt)
when running some NVMe traffic:
- initial cmpxchg to get a place in the queue
- when more CPUs get involved, we start failing at an exponential rate
0.00 : ffff8000107a3500: cas x4, x2, [x27]
26.52 : ffff8000107a3504: mov x0, x4 :
arm_smmu_cmdq_issue_cmdlist():
- the queue locking
- polling cmd_sync
Some ideas to optimise:
a. initial cmpxchg
So this cmpxchg could be considered unfair. In addition, with all the
contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
around the system.
Maybe we can implement something similar to the idea of queued/ticketed
spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
initial cmpxchg fails, released by its leader, and releasing subsequent
followers
b. Drop the queue_full checking in certain circumstances
If we cannot theoretically fill the queue, then stop the checking for
queue full or similar. This should also help current problem of a., as
the less time between cmpxchg, the less chance of failing (as we check
queue available space between cmpxchg attempts).
So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
always issue a cmd_sync for a batch (regardless of whether requested),
then we should never fill (I think).
c. Don't do queue locking in certain circumstances
If we implement (and support) b. and support MSI polling, then I don't
think that this is required.
d. More minor ideas are to move forward when the "owner" stops gathering
to reduce time of advancing the prod, hopefully reducing cmd_sync
polling time; and also use a smaller word size for the valid bitmap
operations, maybe 32b atomic operations are overall more efficient (than
64b) - mostly valid range check is < 16 bits from my observation.
Let me know your thoughts or any other ideas.
Thanks,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: arm-smmu-v3 high cpu usage for NVMe
2020-05-22 14:52 ` John Garry
@ 2020-05-25 5:57 ` Song Bao Hua (Barry Song)
0 siblings, 0 replies; 32+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-05-25 5:57 UTC (permalink / raw)
To: John Garry, Will Deacon, Robin Murphy
Cc: Jean-Philippe Brucker, Marc Zyngier, Linuxarm, Ming Lei, iommu,
alexandru.elisei
> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
>
> On 20/03/2020 10:41, John Garry wrote:
>
> + Barry, Alexandru
>
> >>>>> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost:
> >>>>> 0/34434 drop:
> >>>>> 0/40116 [4000Hz cycles], (all, 96 CPUs)
> >>>>>
> ----------------------------------------------------------------------------------------------------------------
> ----------
> >>>>>
> >>>>>
> >>>>> 27.43% [kernel] [k]
> arm_smmu_cmdq_issue_cmdlist
> >>>>> 11.71% [kernel] [k]
> _raw_spin_unlock_irqrestore
> >>>>> 6.35% [kernel] [k] _raw_spin_unlock_irq
> >>>>> 2.65% [kernel] [k] get_user_pages_fast
> >>>>> 2.03% [kernel] [k] __slab_free
> >>>>> 1.55% [kernel] [k] tick_nohz_idle_exit
> >>>>> 1.47% [kernel] [k] arm_lpae_map
> >>>>> 1.39% [kernel] [k] __fget
> >>>>> 1.14% [kernel] [k] __lock_text_start
> >>>>> 1.09% [kernel] [k] _raw_spin_lock
> >>>>> 1.08% [kernel] [k] bio_release_pages.part.42
> >>>>> 1.03% [kernel] [k] __sbitmap_get_word
> >>>>> 0.97% [kernel] [k]
> >>>>> arm_smmu_atc_inv_domain.constprop.42
> >>>>> 0.91% [kernel] [k] fput_many
> >>>>> 0.88% [kernel] [k] __arm_lpae_map
> >>>>>
>
> Hi Will, Robin,
>
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
>
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
>
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
>
> Colleague Barry has similar findings for some other scenarios.
I wrote a test module and use the parameter "ways" to simulate how busy SMMU is and
compare the latency under different degrees of contentions.
1. static int ways=16;
2. module_param(ways, int, S_IRUGO);
3.
4. static int seconds=120;
5. module_param(seconds, int, S_IRUGO);
6.
7. extern struct device *get_zip_dev(void);
8.
9. static noinline void test_mapsingle(struct device *dev, void *buf, int size)
10. {
11. dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);
12. dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);
13. }
14.
15. static noinline void test_memcpy(void *out, void *in, int size)
16. {
17. memcpy(out, in, size);
18. }
19.
20. static int testthread(void *data)
21. {
22. unsigned long stop = jiffies +seconds*HZ;
23. struct device *dev = get_zip_dev();
24.
25. char *input = kzalloc(4096, GFP_KERNEL);
26. if (!input)
27. return -ENOMEM;
28.
29. char *output = kzalloc(4096, GFP_KERNEL);
30. if (!output)
31. return -ENOMEM;
32.
33. while (time_before(jiffies, stop)) {
34. test_mapsingle(dev, input, 4096);
35. test_memcpy(output, input, 4096);
36. }
37.
38. kfree(output);
39. kfree(input);
40.
41. return 0;
42. }
43.
44. static int __init test_init(void)
45. {
46. struct task_struct *tsk;
47. int i;
50.
51. for(i=0;i<ways;i++) {
52. tsk = kthread_run(testthread, &ways, "map_test-%d", i);
53. if (IS_ERR(tsk))
54. printk(KERN_ERR "create test thread failed\n");
55. }
56.
57. return 0;
58. }
59.
60. static void __exit test_exit(void)
61. {
62. }
63.
64. module_init(test_init);
65. module_exit(test_exit);
66. MODULE_LICENSE("GPL");
While ways=1, smmu is quite free with only one user, arm_smmu_cmdq_issue_cmdlist() will spend more than 60% time
on arm_smmu_cmdq_poll_until_sync(). It seems SMMU reports the completion of CMD_SYNC quite slowly.
When I increased "ways", I found the contention would increase rapidly. When ways=16, more than 40% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)
when ways=64, more than 60% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)
here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with different ways:
whole unmap(ns) arm_smmu_cmdq_issue_cmdlist()ns wait CMD_SYNC(ns)
Ways=1 1956 1328 883
Ways=16 8891 7474 4000
Ways=32 22043 19519 6879
Ways=64 60842 55895 16746
Ways=96 101880 93649 24429
As you can see, while ways=1, we still need 2us to unmap, and arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC
takes more than 60% time of arm_smmu_cmdq_issue_cmdlist().
When SMMU is very busy, dma_unmap latency can be very large due to contention, more than 100us.
Thanks
Barry
>
> So we tried the latest perf NMI support wip patches, and noticed a few
> hotspots (see
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/perf%20annotate
> and
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/report.txt)
> when running some NVMe traffic:
>
> - initial cmpxchg to get a place in the queue
> - when more CPUs get involved, we start failing at an exponential rate
> 0.00 : ffff8000107a3500: cas x4, x2, [x27]
> 26.52 : ffff8000107a3504: mov x0, x4 :
> arm_smmu_cmdq_issue_cmdlist():
>
> - the queue locking
> - polling cmd_sync
>
> Some ideas to optimise:
>
> a. initial cmpxchg
> So this cmpxchg could be considered unfair. In addition, with all the
> contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
> around the system.
> Maybe we can implement something similar to the idea of queued/ticketed
> spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
> initial cmpxchg fails, released by its leader, and releasing subsequent
> followers
>
> b. Drop the queue_full checking in certain circumstances
> If we cannot theoretically fill the queue, then stop the checking for
> queue full or similar. This should also help current problem of a., as
> the less time between cmpxchg, the less chance of failing (as we check
> queue available space between cmpxchg attempts).
>
> So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
> always issue a cmd_sync for a batch (regardless of whether requested),
> then we should never fill (I think).
>
> c. Don't do queue locking in certain circumstances
> If we implement (and support) b. and support MSI polling, then I don't
> think that this is required.
>
> d. More minor ideas are to move forward when the "owner" stops gathering
> to reduce time of advancing the prod, hopefully reducing cmd_sync
> polling time; and also use a smaller word size for the valid bitmap
> operations, maybe 32b atomic operations are overall more efficient (than
> 64b) - mostly valid range check is < 16 bits from my observation.
>
> Let me know your thoughts or any other ideas.
>
> Thanks,
> John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>]
* Re: arm-smmu-v3 high cpu usage for NVMe
[not found] ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
@ 2020-04-06 15:11 ` John Garry
0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2020-04-06 15:11 UTC (permalink / raw)
To: Will Deacon
Cc: Jean-Philippe Brucker, Marc Zyngier, Ming Lei, iommu, Robin Murphy
On 02/04/2020 13:10, John Garry wrote:
> On 18/03/2020 20:53, Will Deacon wrote:
>>> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the
>>> testing our
>>> batch size is 1, so we're not seeing the real benefit of the batching. I
>>> can't help but think that we could improve this code to try to
>>> combine CMD
>>> SYNCs for small batches.
>>>
>>> Anyway, let me know your thoughts or any questions. I'll have a look
>>> if a
>>> get a chance for other possible bottlenecks.
>> Did you ever get any more information on this? I don't have any SMMUv3
>> hardware any more, so I can't really dig into this myself.
>>
>
Hi Will,
JFYI, I added some debug in arm_smmu_cmdq_issue_cmdlist() to get some
idea of what is going on. Perf annotate did not tell much.
I tested NVMe performance with and without Marc's patchset to spread
LPIs for managed interrupts.
Average duration of arm_smmu_cmdq_issue_cmdlist() mainline [all results
are approximations]:
owner: 6ms
non-owner: 4ms
mainline + LPI spreading patchset:
owner: 25ms
non-owner: 22ms
For this, a list would be a itlb+cmd_sync.
Please note that the LPI spreading patchset is still giving circa 25%
NVMe throughput increase. What happens there would be that we get many
more cpus involved, which creates more inter-cpu contention. But the
performance increase comes from just alleviating pressure on those
overloaded cpus.
I also notice that with the LPI spreading patchset, on average a cpu is
an "owner" in arm_smmu_cmdq_issue_cmdlist() 1 in 8, as opposed to 1 in 3
for mainline. This means that we're just creating longer chains of lists
to be published.
But I found that for a non-owner, average msi cmd_sync polling time is
12ms with the LPI spreading patchset. As such, it seems to be really
taking approx (12*2/8-1=) ~3ms to consume a single list. This seems
consistent with my finding that an owner polls consumption for 3ms also.
Without the LPI speading patchset, polling time is approx 2 and 3ms for
both owner and non-owner, respectively.
As an experiment, I did try to hack the code to use a spinlock again for
protecting the command queue, instead of current solution - and always
saw a performance drop there. To be expected. But maybe we can try to
not use a spinlock, but still serialise production+consumption to
alleviate the long polling periods.
Let me know your thoughts.
Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 32+ messages in thread