All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>
To: Robin Murphy <robin.murphy@arm.com>,
	Will Deacon <will@kernel.org>, "Joerg Roedel" <joro@8bytes.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	iommu <iommu@lists.linux-foundation.org>,
	<linux-kernel@vger.kernel.org>
Cc: John Garry <john.garry@huawei.com>
Subject: Re: [PATCH] iommu/arm-smmu-v3: Simplify useless instructions in arm_smmu_cmdq_build_cmd()
Date: Wed, 18 Aug 2021 10:28:09 +0800	[thread overview]
Message-ID: <54b94259-628a-1763-0f1e-e2e7c2b2a297@huawei.com> (raw)
In-Reply-To: <6fea9ce0-7b8d-bd46-6b85-f3f9ba3ddd48@arm.com>



On 2021/8/17 21:23, Robin Murphy wrote:
> On 2021-08-17 12:34, Zhen Lei wrote:
>> Although the parameter 'cmd' is always passed by a local array variable,
>> and only this function modifies it, the compiler does not know this. The
>> compiler almost always reads the value of cmd[i] from memory rather than
>> directly using the value cached in the register. This generates many
>> useless instruction operations and affects the performance to some extent.
> 
> Which compiler? GCC 4.9 does not make the same codegen decisions that GCC 10 does; Clang is different again. There are also various config options which affect a compiler's inlining/optimisation choices either directly or indirectly.

gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05)

In addition, yesterday morning I also purposely compiled a compiler with the latest
GCC source code. The result is the same.

gcc version 11.2.0 (GCC)

> 
> If it's something that newer compilers can get right anyway, then micro-optimising just for older ones might warrant a bit more justification.
> 
>> To guide the compiler for proper optimization, 'cmd' is defined as a local
>> array variable, marked as register, and copied to the output parameter at
>> a time when the function is returned.
>>
>> The optimization effect can be viewed by running the "size arm-smmu-v3.o"
>> command.
>>
>> Before:
>>     text    data     bss     dec     hex
>>    27602    1348      56   29006    714e
>>
>> After:
>>     text    data     bss     dec     hex
>>    27402    1348      56   28806    7086
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +++++++++++++++---
>>   1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index d76bbbde558b776..50a9db5bac466c7 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -233,11 +233,19 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
>>       return 0;
>>   }
>>   +#define arm_smmu_cmdq_copy_cmd(dst, src)    \
>> +    do {                    \
>> +        dst[0] = src[0];        \
>> +        dst[1] = src[1];        \
>> +    } while (0)
>> +
>>   /* High-level queue accessors */
>> -static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>> +static int arm_smmu_cmdq_build_cmd(u64 *out_cmd, struct arm_smmu_cmdq_ent *ent)
>>   {
>> -    memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
>> -    cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> +    register u64 cmd[CMDQ_ENT_DWORDS];
>> +
>> +    cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> +    cmd[1] = 0;
>>         switch (ent->opcode) {
>>       case CMDQ_OP_TLBI_EL2_ALL:
>> @@ -309,6 +317,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>           case PRI_RESP_SUCC:
>>               break;
>>           default:
>> +            arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> Why bother writing back a partial command when we're telling the caller it's invalid anyway?

Some callers do not check the return value of arm_smmu_cmdq_build_cmd().
In particular, the arm_smmu_cmdq_batch_add has no judgment. Yes, I should
add judgment there.

> 
>>               return -EINVAL;
>>           }
>>           cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
>> @@ -329,9 +338,12 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>           cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
>>           break;
>>       default:
>> +        arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> Ditto.
> 
>>           return -ENOENT;
>>       }
>>   +    arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> ...and then it would be simpler to open-code the assignment here.

Right, I'll modify it in v2. I also don't like the addition of arm_smmu_cmdq_copy_cmd().

> 
> I guess if you're really concerned with avoiding temporary commands being written back to the stack and reloaded, it might be worth experimenting with wrapping them in a struct which can be passed around by value - AAPCS64 allows passing a 16-byte composite type purely in registers.

'out_cmd' is an output parameter. Use 16-byte composite type need to modify
many functions, this may not be good..

> 
> Robin.
> 
>> +
>>       return 0;
>>   }
>>  
> .
> 

WARNING: multiple messages have this Message-ID (diff)
From: "Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>
To: Robin Murphy <robin.murphy@arm.com>,
	Will Deacon <will@kernel.org>, "Joerg Roedel" <joro@8bytes.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	iommu <iommu@lists.linux-foundation.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] iommu/arm-smmu-v3: Simplify useless instructions in arm_smmu_cmdq_build_cmd()
Date: Wed, 18 Aug 2021 10:28:09 +0800	[thread overview]
Message-ID: <54b94259-628a-1763-0f1e-e2e7c2b2a297@huawei.com> (raw)
In-Reply-To: <6fea9ce0-7b8d-bd46-6b85-f3f9ba3ddd48@arm.com>



On 2021/8/17 21:23, Robin Murphy wrote:
> On 2021-08-17 12:34, Zhen Lei wrote:
>> Although the parameter 'cmd' is always passed by a local array variable,
>> and only this function modifies it, the compiler does not know this. The
>> compiler almost always reads the value of cmd[i] from memory rather than
>> directly using the value cached in the register. This generates many
>> useless instruction operations and affects the performance to some extent.
> 
> Which compiler? GCC 4.9 does not make the same codegen decisions that GCC 10 does; Clang is different again. There are also various config options which affect a compiler's inlining/optimisation choices either directly or indirectly.

gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05)

In addition, yesterday morning I also purposely compiled a compiler with the latest
GCC source code. The result is the same.

gcc version 11.2.0 (GCC)

> 
> If it's something that newer compilers can get right anyway, then micro-optimising just for older ones might warrant a bit more justification.
> 
>> To guide the compiler for proper optimization, 'cmd' is defined as a local
>> array variable, marked as register, and copied to the output parameter at
>> a time when the function is returned.
>>
>> The optimization effect can be viewed by running the "size arm-smmu-v3.o"
>> command.
>>
>> Before:
>>     text    data     bss     dec     hex
>>    27602    1348      56   29006    714e
>>
>> After:
>>     text    data     bss     dec     hex
>>    27402    1348      56   28806    7086
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +++++++++++++++---
>>   1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index d76bbbde558b776..50a9db5bac466c7 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -233,11 +233,19 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
>>       return 0;
>>   }
>>   +#define arm_smmu_cmdq_copy_cmd(dst, src)    \
>> +    do {                    \
>> +        dst[0] = src[0];        \
>> +        dst[1] = src[1];        \
>> +    } while (0)
>> +
>>   /* High-level queue accessors */
>> -static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>> +static int arm_smmu_cmdq_build_cmd(u64 *out_cmd, struct arm_smmu_cmdq_ent *ent)
>>   {
>> -    memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
>> -    cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> +    register u64 cmd[CMDQ_ENT_DWORDS];
>> +
>> +    cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> +    cmd[1] = 0;
>>         switch (ent->opcode) {
>>       case CMDQ_OP_TLBI_EL2_ALL:
>> @@ -309,6 +317,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>           case PRI_RESP_SUCC:
>>               break;
>>           default:
>> +            arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> Why bother writing back a partial command when we're telling the caller it's invalid anyway?

Some callers do not check the return value of arm_smmu_cmdq_build_cmd().
In particular, the arm_smmu_cmdq_batch_add has no judgment. Yes, I should
add judgment there.

> 
>>               return -EINVAL;
>>           }
>>           cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
>> @@ -329,9 +338,12 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>           cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
>>           break;
>>       default:
>> +        arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> Ditto.
> 
>>           return -ENOENT;
>>       }
>>   +    arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> ...and then it would be simpler to open-code the assignment here.

Right, I'll modify it in v2. I also don't like the addition of arm_smmu_cmdq_copy_cmd().

> 
> I guess if you're really concerned with avoiding temporary commands being written back to the stack and reloaded, it might be worth experimenting with wrapping them in a struct which can be passed around by value - AAPCS64 allows passing a 16-byte composite type purely in registers.

'out_cmd' is an output parameter. Use 16-byte composite type need to modify
many functions, this may not be good..

> 
> Robin.
> 
>> +
>>       return 0;
>>   }
>>  
> .
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

WARNING: multiple messages have this Message-ID (diff)
From: "Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>
To: Robin Murphy <robin.murphy@arm.com>,
	Will Deacon <will@kernel.org>, "Joerg Roedel" <joro@8bytes.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	iommu <iommu@lists.linux-foundation.org>,
	<linux-kernel@vger.kernel.org>
Cc: John Garry <john.garry@huawei.com>
Subject: Re: [PATCH] iommu/arm-smmu-v3: Simplify useless instructions in arm_smmu_cmdq_build_cmd()
Date: Wed, 18 Aug 2021 10:28:09 +0800	[thread overview]
Message-ID: <54b94259-628a-1763-0f1e-e2e7c2b2a297@huawei.com> (raw)
In-Reply-To: <6fea9ce0-7b8d-bd46-6b85-f3f9ba3ddd48@arm.com>



On 2021/8/17 21:23, Robin Murphy wrote:
> On 2021-08-17 12:34, Zhen Lei wrote:
>> Although the parameter 'cmd' is always passed by a local array variable,
>> and only this function modifies it, the compiler does not know this. The
>> compiler almost always reads the value of cmd[i] from memory rather than
>> directly using the value cached in the register. This generates many
>> useless instruction operations and affects the performance to some extent.
> 
> Which compiler? GCC 4.9 does not make the same codegen decisions that GCC 10 does; Clang is different again. There are also various config options which affect a compiler's inlining/optimisation choices either directly or indirectly.

gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05)

In addition, yesterday morning I also purposely compiled a compiler with the latest
GCC source code. The result is the same.

gcc version 11.2.0 (GCC)

> 
> If it's something that newer compilers can get right anyway, then micro-optimising just for older ones might warrant a bit more justification.
> 
>> To guide the compiler for proper optimization, 'cmd' is defined as a local
>> array variable, marked as register, and copied to the output parameter at
>> a time when the function is returned.
>>
>> The optimization effect can be viewed by running the "size arm-smmu-v3.o"
>> command.
>>
>> Before:
>>     text    data     bss     dec     hex
>>    27602    1348      56   29006    714e
>>
>> After:
>>     text    data     bss     dec     hex
>>    27402    1348      56   28806    7086
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +++++++++++++++---
>>   1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index d76bbbde558b776..50a9db5bac466c7 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -233,11 +233,19 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
>>       return 0;
>>   }
>>   +#define arm_smmu_cmdq_copy_cmd(dst, src)    \
>> +    do {                    \
>> +        dst[0] = src[0];        \
>> +        dst[1] = src[1];        \
>> +    } while (0)
>> +
>>   /* High-level queue accessors */
>> -static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>> +static int arm_smmu_cmdq_build_cmd(u64 *out_cmd, struct arm_smmu_cmdq_ent *ent)
>>   {
>> -    memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
>> -    cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> +    register u64 cmd[CMDQ_ENT_DWORDS];
>> +
>> +    cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> +    cmd[1] = 0;
>>         switch (ent->opcode) {
>>       case CMDQ_OP_TLBI_EL2_ALL:
>> @@ -309,6 +317,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>           case PRI_RESP_SUCC:
>>               break;
>>           default:
>> +            arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> Why bother writing back a partial command when we're telling the caller it's invalid anyway?

Some callers do not check the return value of arm_smmu_cmdq_build_cmd().
In particular, the arm_smmu_cmdq_batch_add has no judgment. Yes, I should
add judgment there.

> 
>>               return -EINVAL;
>>           }
>>           cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
>> @@ -329,9 +338,12 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>           cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
>>           break;
>>       default:
>> +        arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> Ditto.
> 
>>           return -ENOENT;
>>       }
>>   +    arm_smmu_cmdq_copy_cmd(out_cmd, cmd);
> 
> ...and then it would be simpler to open-code the assignment here.

Right, I'll modify it in v2. I also don't like the addition of arm_smmu_cmdq_copy_cmd().

> 
> I guess if you're really concerned with avoiding temporary commands being written back to the stack and reloaded, it might be worth experimenting with wrapping them in a struct which can be passed around by value - AAPCS64 allows passing a 16-byte composite type purely in registers.

'out_cmd' is an output parameter. Use 16-byte composite type need to modify
many functions, this may not be good..

> 
> Robin.
> 
>> +
>>       return 0;
>>   }
>>  
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-08-18  2:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-17 11:34 [PATCH] iommu/arm-smmu-v3: Simplify useless instructions in arm_smmu_cmdq_build_cmd() Zhen Lei
2021-08-17 11:34 ` Zhen Lei
2021-08-17 11:34 ` Zhen Lei
2021-08-17 13:23 ` Robin Murphy
2021-08-17 13:23   ` Robin Murphy
2021-08-17 13:23   ` Robin Murphy
2021-08-18  2:28   ` Leizhen (ThunderTown) [this message]
2021-08-18  2:28     ` Leizhen (ThunderTown)
2021-08-18  2:28     ` Leizhen (ThunderTown)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54b94259-628a-1763-0f1e-e2e7c2b2a297@huawei.com \
    --to=thunder.leizhen@huawei.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=john.garry@huawei.com \
    --cc=joro@8bytes.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.