All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Rutland <mark.rutland@arm.com>
To: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: entry: Improve the performance of system calls
Date: Tue, 14 Sep 2021 10:55:16 +0100	[thread overview]
Message-ID: <20210914095436.GA26544@C02TD0UTHF1T.local> (raw)
In-Reply-To: <20210903121950.2284-1-thunder.leizhen@huawei.com>

Hi,

On Fri, Sep 03, 2021 at 08:19:50PM +0800, Zhen Lei wrote:
> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
> of functions from assembly to C, this greatly improves readability. But
> el0_svc()/el0_svc_compat() is in response to system call requests from
> user mode and may be in the hot path.
> 
> Although the SVC is in the first case of the switch statement in C, the
> compiler optimizes the switch statement as a whole, and does not give SVC
> a small boost.
> 
> Use "likely()" to help SVC directly invoke its handler after a simple
> judgment to avoid entering the switch table lookup process.
> 
> After:
> 0000000000000ff0 <el0t_64_sync_handler>:
>      ff0:       d503245f        bti     c
>      ff4:       d503233f        paciasp
>      ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>      ffc:       910003fd        mov     x29, sp
>     1000:       d5385201        mrs     x1, esr_el1
>     1004:       531a7c22        lsr     w2, w1, #26
>     1008:       f100545f        cmp     x2, #0x15
>     100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
>     1010:       97fffe14        bl      860 <el0_svc>
>     1014:       a8c17bfd        ldp     x29, x30, [sp], #16
>     1018:       d50323bf        autiasp
>     101c:       d65f03c0        ret
>     1020:       f100705f        cmp     x2, #0x1c

It would be helpful if you could state which toolchain and config was
used to generate the above.

For comparison, what was the code generation like before? I assume
el0_svc wasn't the target of the first test and branch? Assuming so, how
many tests and branches were there before the call to el0_svc()?

At a high-level, I'm not too keen on special-casing things unless
necessary.

I wonder if we could get similar results without special-casing by using
a static const array of handlers indexed by the EC, since (with GCC
11.1.0 from the kernel.org crosstool page) that can result in code like:

0000000000001010 <el0t_64_sync_handler>:
    1010:       d503245f        bti     c
    1014:       d503233f        paciasp
    1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
    101c:       910003fd        mov     x29, sp
    1020:       d5385201        mrs     x1, esr_el1
    1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
    1028:       531a7c23        lsr     w3, w1, #26
    102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
    1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
    1034:       d63f0040        blr     x2
    1038:       a8c17bfd        ldp     x29, x30, [sp], #16
    103c:       d50323bf        autiasp
    1040:       d65f03c0        ret

... which might do better by virtue of reducing a chain of potential
mispredicts down to a single potential mispredict, and dynamic branch
prediction hopefully does a good job of predicting the common case at
runtime. That said, the resulting tables will be pretty big...

> 
> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
> about 10ns.
> 
> Before:
> Simple syscall: 0.2365 microseconds
> Simple syscall: 0.2354 microseconds
> Simple syscall: 0.2339 microseconds
> 
> After:
> Simple syscall: 0.2255 microseconds
> Simple syscall: 0.2254 microseconds
> Simple syscall: 0.2256 microseconds

I appreciate this can be seen by a microbenchmark, but does this have an
impact on a real workload? I'd imagine that real syscall usage will
dominate this in practice, and this would fall into the noise.

Thanks,
Mark.

> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
>  asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC64:
> +	if (likely(ec == ESR_ELx_EC_SVC64)) {
>  		el0_svc(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;
> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
>  asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC32:
> +	if (likely(ec == ESR_ELx_EC_SVC32)) {
>  		el0_svc_compat(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;
> -- 
> 2.25.1
> 

WARNING: multiple messages have this Message-ID (diff)
From: Mark Rutland <mark.rutland@arm.com>
To: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: entry: Improve the performance of system calls
Date: Tue, 14 Sep 2021 10:55:16 +0100	[thread overview]
Message-ID: <20210914095436.GA26544@C02TD0UTHF1T.local> (raw)
In-Reply-To: <20210903121950.2284-1-thunder.leizhen@huawei.com>

Hi,

On Fri, Sep 03, 2021 at 08:19:50PM +0800, Zhen Lei wrote:
> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
> of functions from assembly to C, this greatly improves readability. But
> el0_svc()/el0_svc_compat() is in response to system call requests from
> user mode and may be in the hot path.
> 
> Although the SVC is in the first case of the switch statement in C, the
> compiler optimizes the switch statement as a whole, and does not give SVC
> a small boost.
> 
> Use "likely()" to help SVC directly invoke its handler after a simple
> judgment to avoid entering the switch table lookup process.
> 
> After:
> 0000000000000ff0 <el0t_64_sync_handler>:
>      ff0:       d503245f        bti     c
>      ff4:       d503233f        paciasp
>      ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>      ffc:       910003fd        mov     x29, sp
>     1000:       d5385201        mrs     x1, esr_el1
>     1004:       531a7c22        lsr     w2, w1, #26
>     1008:       f100545f        cmp     x2, #0x15
>     100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
>     1010:       97fffe14        bl      860 <el0_svc>
>     1014:       a8c17bfd        ldp     x29, x30, [sp], #16
>     1018:       d50323bf        autiasp
>     101c:       d65f03c0        ret
>     1020:       f100705f        cmp     x2, #0x1c

It would be helpful if you could state which toolchain and config was
used to generate the above.

For comparison, what was the code generation like before? I assume
el0_svc wasn't the target of the first test and branch? Assuming so, how
many tests and branches were there before the call to el0_svc()?

At a high-level, I'm not too keen on special-casing things unless
necessary.

I wonder if we could get similar results without special-casing by using
a static const array of handlers indexed by the EC, since (with GCC
11.1.0 from the kernel.org crosstool page) that can result in code like:

0000000000001010 <el0t_64_sync_handler>:
    1010:       d503245f        bti     c
    1014:       d503233f        paciasp
    1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
    101c:       910003fd        mov     x29, sp
    1020:       d5385201        mrs     x1, esr_el1
    1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
    1028:       531a7c23        lsr     w3, w1, #26
    102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
    1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
    1034:       d63f0040        blr     x2
    1038:       a8c17bfd        ldp     x29, x30, [sp], #16
    103c:       d50323bf        autiasp
    1040:       d65f03c0        ret

... which might do better by virtue of reducing a chain of potential
mispredicts down to a single potential mispredict, and dynamic branch
prediction hopefully does a good job of predicting the common case at
runtime. That said, the resulting tables will be pretty big...

> 
> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
> about 10ns.
> 
> Before:
> Simple syscall: 0.2365 microseconds
> Simple syscall: 0.2354 microseconds
> Simple syscall: 0.2339 microseconds
> 
> After:
> Simple syscall: 0.2255 microseconds
> Simple syscall: 0.2254 microseconds
> Simple syscall: 0.2256 microseconds

I appreciate this can be seen by a microbenchmark, but does this have an
impact on a real workload? I'd imagine that real syscall usage will
dominate this in practice, and this would fall into the noise.

Thanks,
Mark.

> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
>  asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC64:
> +	if (likely(ec == ESR_ELx_EC_SVC64)) {
>  		el0_svc(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;
> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
>  asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC32:
> +	if (likely(ec == ESR_ELx_EC_SVC32)) {
>  		el0_svc_compat(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;
> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2021-09-14  9:55 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-03 12:19 [PATCH] arm64: entry: Improve the performance of system calls Zhen Lei
2021-09-03 12:19 ` Zhen Lei
2021-09-13 23:01 ` Punit Agrawal
2021-09-13 23:01   ` Punit Agrawal
2021-09-14  2:01   ` Leizhen (ThunderTown)
2021-09-14  2:01     ` Leizhen (ThunderTown)
2021-09-14  9:55 ` Mark Rutland [this message]
2021-09-14  9:55   ` Mark Rutland
2021-09-14 11:23   ` Leizhen (ThunderTown)
2021-09-14 11:23     ` Leizhen (ThunderTown)
2021-09-14 11:48     ` Leizhen (ThunderTown)
2021-09-14 11:48       ` Leizhen (ThunderTown)
2021-09-14 15:17   ` Joey Gouly
2021-09-14 15:17     ` Joey Gouly

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210914095436.GA26544@C02TD0UTHF1T.local \
    --to=mark.rutland@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=thunder.leizhen@huawei.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.