All of lore.kernel.org
 help / color / mirror / Atom feed
From: Will Deacon <will@kernel.org>
To: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
	Jean-Philippe Brucker <jean-philippe@linaro.org>,
	John Garry <john.garry@huawei.com>,
	Robin Murphy <robin.murphy@arm.com>,
	Joerg Roedel <joro@8bytes.org>,
	iommu <iommu@lists.linux-foundation.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 2/2] iommu/arm-smmu-v3: add nr_ats_masters for quickly check
Date: Thu, 15 Aug 2019 16:23:14 +0100	[thread overview]
Message-ID: <20190815152313.apa2d5rzhqa34l7l@willie-the-truck> (raw)
In-Reply-To: <20190815054439.30652-3-thunder.leizhen@huawei.com>

On Thu, Aug 15, 2019 at 01:44:39PM +0800, Zhen Lei wrote:
> When (smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS) is true, even if a
> smmu domain does not contain any ats master, the operations of
> arm_smmu_atc_inv_to_cmd() and lock protection in arm_smmu_atc_inv_domain()
> are always executed. This will impact performance, especially in
> multi-core and stress scenarios. For my FIO test scenario, about 8%
> performance reduced.
> 
> In fact, we can use a struct member to record how many ats masters that
> the smmu contains. And check that without traverse the list and check all
> masters one by one in the lock protection.
> 
> Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 29056d9bb12aa01..154334d3310c9b8 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -631,6 +631,7 @@ struct arm_smmu_domain {
>  
>  	struct io_pgtable_ops		*pgtbl_ops;
>  	bool				non_strict;
> +	int				nr_ats_masters;
>  
>  	enum arm_smmu_domain_stage	stage;
>  	union {
> @@ -1531,7 +1532,16 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>  	struct arm_smmu_cmdq_ent cmd;
>  	struct arm_smmu_master *master;
>  
> -	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
> +	/*
> +	 * The protectiom of spinlock(&iommu_domain->devices_lock) is omitted.
> +	 * Because for a given master, its map/unmap operations should only be
> +	 * happened after it has been attached and before it has been detached.
> +	 * So that, if at least one master need to be atc invalidated, the
> +	 * value of smmu_domain->nr_ats_masters can not be zero.
> +	 *
> +	 * This can alleviate performance loss in multi-core scenarios.
> +	 */

I find this reasoning pretty dubious, since I think you're assuming that
an endpoint cannot issue speculative ATS translation requests once its
ATS capability is enabled. That said, I think it also means we should enable
ATS in the STE *before* enabling it in the endpoint -- the current logic
looks like it's the wrong way round to me (including in detach()).

Anyway, these speculative translations could race with a concurrent unmap()
call and end up with the ATC containing translations for unmapped pages,
which I think we should try to avoid.

Did the RCU approach not work out? You could use an rwlock instead as a
temporary bodge if the performance doesn't hurt too much.

Alternatively... maybe we could change the attach flow to do something
like:

	enable_ats_in_ste(master);
	enable_ats_at_pcie_endpoint(master);
	spin_lock(devices_lock)
	add_to_device_list(master);
	nr_ats_masters++;
	spin_unlock(devices_lock);
	invalidate_atc(master);

in which case, the concurrent unmapper will be doing something like:

	issue_tlbi();
	smp_mb();
	if (READ_ONCE(nr_ats_masters)) {
		...
	}

and I *think* that means that either the unmapper will see the
nr_ats_masters update and perform the invalidation, or they'll miss
the update but the attach will invalidate the ATC /after/ the TLBI
in the command queue.

Also, John's idea of converting this stuff over to my command batching
mechanism should help a lot if we can defer this to sync time using the
gather structure. Maybe an rwlock would be alright for that. Dunno.

Will

WARNING: multiple messages have this Message-ID (diff)
From: Will Deacon <will@kernel.org>
To: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	iommu <iommu@lists.linux-foundation.org>,
	Robin Murphy <robin.murphy@arm.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v2 2/2] iommu/arm-smmu-v3: add nr_ats_masters for quickly check
Date: Thu, 15 Aug 2019 16:23:14 +0100	[thread overview]
Message-ID: <20190815152313.apa2d5rzhqa34l7l@willie-the-truck> (raw)
In-Reply-To: <20190815054439.30652-3-thunder.leizhen@huawei.com>

On Thu, Aug 15, 2019 at 01:44:39PM +0800, Zhen Lei wrote:
> When (smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS) is true, even if a
> smmu domain does not contain any ats master, the operations of
> arm_smmu_atc_inv_to_cmd() and lock protection in arm_smmu_atc_inv_domain()
> are always executed. This will impact performance, especially in
> multi-core and stress scenarios. For my FIO test scenario, about 8%
> performance reduced.
> 
> In fact, we can use a struct member to record how many ats masters that
> the smmu contains. And check that without traverse the list and check all
> masters one by one in the lock protection.
> 
> Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 29056d9bb12aa01..154334d3310c9b8 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -631,6 +631,7 @@ struct arm_smmu_domain {
>  
>  	struct io_pgtable_ops		*pgtbl_ops;
>  	bool				non_strict;
> +	int				nr_ats_masters;
>  
>  	enum arm_smmu_domain_stage	stage;
>  	union {
> @@ -1531,7 +1532,16 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>  	struct arm_smmu_cmdq_ent cmd;
>  	struct arm_smmu_master *master;
>  
> -	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
> +	/*
> +	 * The protectiom of spinlock(&iommu_domain->devices_lock) is omitted.
> +	 * Because for a given master, its map/unmap operations should only be
> +	 * happened after it has been attached and before it has been detached.
> +	 * So that, if at least one master need to be atc invalidated, the
> +	 * value of smmu_domain->nr_ats_masters can not be zero.
> +	 *
> +	 * This can alleviate performance loss in multi-core scenarios.
> +	 */

I find this reasoning pretty dubious, since I think you're assuming that
an endpoint cannot issue speculative ATS translation requests once its
ATS capability is enabled. That said, I think it also means we should enable
ATS in the STE *before* enabling it in the endpoint -- the current logic
looks like it's the wrong way round to me (including in detach()).

Anyway, these speculative translations could race with a concurrent unmap()
call and end up with the ATC containing translations for unmapped pages,
which I think we should try to avoid.

Did the RCU approach not work out? You could use an rwlock instead as a
temporary bodge if the performance doesn't hurt too much.

Alternatively... maybe we could change the attach flow to do something
like:

	enable_ats_in_ste(master);
	enable_ats_at_pcie_endpoint(master);
	spin_lock(devices_lock)
	add_to_device_list(master);
	nr_ats_masters++;
	spin_unlock(devices_lock);
	invalidate_atc(master);

in which case, the concurrent unmapper will be doing something like:

	issue_tlbi();
	smp_mb();
	if (READ_ONCE(nr_ats_masters)) {
		...
	}

and I *think* that means that either the unmapper will see the
nr_ats_masters update and perform the invalidation, or they'll miss
the update but the attach will invalidate the ATC /after/ the TLBI
in the command queue.

Also, John's idea of converting this stuff over to my command batching
mechanism should help a lot if we can defer this to sync time using the
gather structure. Maybe an rwlock would be alright for that. Dunno.

Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

WARNING: multiple messages have this Message-ID (diff)
From: Will Deacon <will@kernel.org>
To: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
	Joerg Roedel <joro@8bytes.org>,
	John Garry <john.garry@huawei.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	iommu <iommu@lists.linux-foundation.org>,
	Robin Murphy <robin.murphy@arm.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v2 2/2] iommu/arm-smmu-v3: add nr_ats_masters for quickly check
Date: Thu, 15 Aug 2019 16:23:14 +0100	[thread overview]
Message-ID: <20190815152313.apa2d5rzhqa34l7l@willie-the-truck> (raw)
In-Reply-To: <20190815054439.30652-3-thunder.leizhen@huawei.com>

On Thu, Aug 15, 2019 at 01:44:39PM +0800, Zhen Lei wrote:
> When (smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS) is true, even if a
> smmu domain does not contain any ats master, the operations of
> arm_smmu_atc_inv_to_cmd() and lock protection in arm_smmu_atc_inv_domain()
> are always executed. This will impact performance, especially in
> multi-core and stress scenarios. For my FIO test scenario, about 8%
> performance reduced.
> 
> In fact, we can use a struct member to record how many ats masters that
> the smmu contains. And check that without traverse the list and check all
> masters one by one in the lock protection.
> 
> Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 29056d9bb12aa01..154334d3310c9b8 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -631,6 +631,7 @@ struct arm_smmu_domain {
>  
>  	struct io_pgtable_ops		*pgtbl_ops;
>  	bool				non_strict;
> +	int				nr_ats_masters;
>  
>  	enum arm_smmu_domain_stage	stage;
>  	union {
> @@ -1531,7 +1532,16 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>  	struct arm_smmu_cmdq_ent cmd;
>  	struct arm_smmu_master *master;
>  
> -	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
> +	/*
> +	 * The protectiom of spinlock(&iommu_domain->devices_lock) is omitted.
> +	 * Because for a given master, its map/unmap operations should only be
> +	 * happened after it has been attached and before it has been detached.
> +	 * So that, if at least one master need to be atc invalidated, the
> +	 * value of smmu_domain->nr_ats_masters can not be zero.
> +	 *
> +	 * This can alleviate performance loss in multi-core scenarios.
> +	 */

I find this reasoning pretty dubious, since I think you're assuming that
an endpoint cannot issue speculative ATS translation requests once its
ATS capability is enabled. That said, I think it also means we should enable
ATS in the STE *before* enabling it in the endpoint -- the current logic
looks like it's the wrong way round to me (including in detach()).

Anyway, these speculative translations could race with a concurrent unmap()
call and end up with the ATC containing translations for unmapped pages,
which I think we should try to avoid.

Did the RCU approach not work out? You could use an rwlock instead as a
temporary bodge if the performance doesn't hurt too much.

Alternatively... maybe we could change the attach flow to do something
like:

	enable_ats_in_ste(master);
	enable_ats_at_pcie_endpoint(master);
	spin_lock(devices_lock)
	add_to_device_list(master);
	nr_ats_masters++;
	spin_unlock(devices_lock);
	invalidate_atc(master);

in which case, the concurrent unmapper will be doing something like:

	issue_tlbi();
	smp_mb();
	if (READ_ONCE(nr_ats_masters)) {
		...
	}

and I *think* that means that either the unmapper will see the
nr_ats_masters update and perform the invalidation, or they'll miss
the update but the attach will invalidate the ATC /after/ the TLBI
in the command queue.

Also, John's idea of converting this stuff over to my command batching
mechanism should help a lot if we can defer this to sync time using the
gather structure. Maybe an rwlock would be alright for that. Dunno.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2019-08-15 15:23 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-15  5:44 [PATCH v2 0/2] add nr_ats_masters for quickly check Zhen Lei
2019-08-15  5:44 ` Zhen Lei
2019-08-15  5:44 ` Zhen Lei
2019-08-15  5:44 ` [PATCH v2 1/2] iommu/arm-smmu-v3: don't add a master into smmu_domain before it's ready Zhen Lei
2019-08-15  5:44   ` Zhen Lei
2019-08-15  5:44   ` Zhen Lei
2019-08-15  5:44 ` [PATCH v2 2/2] iommu/arm-smmu-v3: add nr_ats_masters for quickly check Zhen Lei
2019-08-15  5:44   ` Zhen Lei
2019-08-15  5:44   ` Zhen Lei
2019-08-15 15:23   ` Will Deacon [this message]
2019-08-15 15:23     ` Will Deacon
2019-08-15 15:23     ` Will Deacon
2019-08-16 10:12     ` Leizhen (ThunderTown)
2019-08-16 10:12       ` Leizhen (ThunderTown)
2019-08-16 10:12       ` Leizhen (ThunderTown)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190815152313.apa2d5rzhqa34l7l@willie-the-truck \
    --to=will@kernel.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jean-philippe.brucker@arm.com \
    --cc=jean-philippe@linaro.org \
    --cc=john.garry@huawei.com \
    --cc=joro@8bytes.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=thunder.leizhen@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.