From: Mark Rutland <mark.rutland@arm.com> To: Zhen Lei <thunder.leizhen@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, linux-arm-kernel <linux-arm-kernel@lists.infradead.org>, linux-kernel@vger.kernel.org Subject: Re: [PATCH] arm64: entry: Improve the performance of system calls Date: Tue, 14 Sep 2021 10:55:16 +0100 [thread overview] Message-ID: <20210914095436.GA26544@C02TD0UTHF1T.local> (raw) In-Reply-To: <20210903121950.2284-1-thunder.leizhen@huawei.com> Hi, On Fri, Sep 03, 2021 at 08:19:50PM +0800, Zhen Lei wrote: > Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots > of functions from assembly to C, this greatly improves readability. But > el0_svc()/el0_svc_compat() is in response to system call requests from > user mode and may be in the hot path. > > Although the SVC is in the first case of the switch statement in C, the > compiler optimizes the switch statement as a whole, and does not give SVC > a small boost. > > Use "likely()" to help SVC directly invoke its handler after a simple > judgment to avoid entering the switch table lookup process. > > After: > 0000000000000ff0 <el0t_64_sync_handler>: > ff0: d503245f bti c > ff4: d503233f paciasp > ff8: a9bf7bfd stp x29, x30, [sp, #-16]! > ffc: 910003fd mov x29, sp > 1000: d5385201 mrs x1, esr_el1 > 1004: 531a7c22 lsr w2, w1, #26 > 1008: f100545f cmp x2, #0x15 > 100c: 540000a1 b.ne 1020 <el0t_64_sync_handler+0x30> > 1010: 97fffe14 bl 860 <el0_svc> > 1014: a8c17bfd ldp x29, x30, [sp], #16 > 1018: d50323bf autiasp > 101c: d65f03c0 ret > 1020: f100705f cmp x2, #0x1c It would be helpful if you could state which toolchain and config was used to generate the above. For comparison, what was the code generation like before? I assume el0_svc wasn't the target of the first test and branch? Assuming so, how many tests and branches were there before the call to el0_svc()? At a high-level, I'm not too keen on special-casing things unless necessary. I wonder if we could get similar results without special-casing by using a static const array of handlers indexed by the EC, since (with GCC 11.1.0 from the kernel.org crosstool page) that can result in code like: 0000000000001010 <el0t_64_sync_handler>: 1010: d503245f bti c 1014: d503233f paciasp 1018: a9bf7bfd stp x29, x30, [sp, #-16]! 101c: 910003fd mov x29, sp 1020: d5385201 mrs x1, esr_el1 1024: 90000002 adrp x2, 0 <el0t_64_sync_handlers> 1028: 531a7c23 lsr w3, w1, #26 102c: 91000042 add x2, x2, #:lo12:<el0t_64_sync_handlers> 1030: f8637842 ldr x2, [x2, x3, lsl #3] 1034: d63f0040 blr x2 1038: a8c17bfd ldp x29, x30, [sp], #16 103c: d50323bf autiasp 1040: d65f03c0 ret ... which might do better by virtue of reducing a chain of potential mispredicts down to a single potential mispredict, and dynamic branch prediction hopefully does a good job of predicting the common case at runtime. That said, the resulting tables will be pretty big... > > Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save > about 10ns. > > Before: > Simple syscall: 0.2365 microseconds > Simple syscall: 0.2354 microseconds > Simple syscall: 0.2339 microseconds > > After: > Simple syscall: 0.2255 microseconds > Simple syscall: 0.2254 microseconds > Simple syscall: 0.2256 microseconds I appreciate this can be seen by a microbenchmark, but does this have an impact on a real workload? I'd imagine that real syscall usage will dominate this in practice, and this would fall into the noise. Thanks, Mark. > Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> > --- > arch/arm64/kernel/entry-common.c | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c > index 32f9796c4ffe77b..062eb5a895ec6f3 100644 > --- a/arch/arm64/kernel/entry-common.c > +++ b/arch/arm64/kernel/entry-common.c > @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr) > asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs) > { > unsigned long esr = read_sysreg(esr_el1); > + unsigned long ec = ESR_ELx_EC(esr); > > - switch (ESR_ELx_EC(esr)) { > - case ESR_ELx_EC_SVC64: > + if (likely(ec == ESR_ELx_EC_SVC64)) { > el0_svc(regs); > - break; > + return; > + } > + > + switch (ec) { > case ESR_ELx_EC_DABT_LOW: > el0_da(regs, esr); > break; > @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs) > asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs) > { > unsigned long esr = read_sysreg(esr_el1); > + unsigned long ec = ESR_ELx_EC(esr); > > - switch (ESR_ELx_EC(esr)) { > - case ESR_ELx_EC_SVC32: > + if (likely(ec == ESR_ELx_EC_SVC32)) { > el0_svc_compat(regs); > - break; > + return; > + } > + > + switch (ec) { > case ESR_ELx_EC_DABT_LOW: > el0_da(regs, esr); > break; > -- > 2.25.1 >
WARNING: multiple messages have this Message-ID (diff)
From: Mark Rutland <mark.rutland@arm.com> To: Zhen Lei <thunder.leizhen@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, linux-arm-kernel <linux-arm-kernel@lists.infradead.org>, linux-kernel@vger.kernel.org Subject: Re: [PATCH] arm64: entry: Improve the performance of system calls Date: Tue, 14 Sep 2021 10:55:16 +0100 [thread overview] Message-ID: <20210914095436.GA26544@C02TD0UTHF1T.local> (raw) In-Reply-To: <20210903121950.2284-1-thunder.leizhen@huawei.com> Hi, On Fri, Sep 03, 2021 at 08:19:50PM +0800, Zhen Lei wrote: > Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots > of functions from assembly to C, this greatly improves readability. But > el0_svc()/el0_svc_compat() is in response to system call requests from > user mode and may be in the hot path. > > Although the SVC is in the first case of the switch statement in C, the > compiler optimizes the switch statement as a whole, and does not give SVC > a small boost. > > Use "likely()" to help SVC directly invoke its handler after a simple > judgment to avoid entering the switch table lookup process. > > After: > 0000000000000ff0 <el0t_64_sync_handler>: > ff0: d503245f bti c > ff4: d503233f paciasp > ff8: a9bf7bfd stp x29, x30, [sp, #-16]! > ffc: 910003fd mov x29, sp > 1000: d5385201 mrs x1, esr_el1 > 1004: 531a7c22 lsr w2, w1, #26 > 1008: f100545f cmp x2, #0x15 > 100c: 540000a1 b.ne 1020 <el0t_64_sync_handler+0x30> > 1010: 97fffe14 bl 860 <el0_svc> > 1014: a8c17bfd ldp x29, x30, [sp], #16 > 1018: d50323bf autiasp > 101c: d65f03c0 ret > 1020: f100705f cmp x2, #0x1c It would be helpful if you could state which toolchain and config was used to generate the above. For comparison, what was the code generation like before? I assume el0_svc wasn't the target of the first test and branch? Assuming so, how many tests and branches were there before the call to el0_svc()? At a high-level, I'm not too keen on special-casing things unless necessary. I wonder if we could get similar results without special-casing by using a static const array of handlers indexed by the EC, since (with GCC 11.1.0 from the kernel.org crosstool page) that can result in code like: 0000000000001010 <el0t_64_sync_handler>: 1010: d503245f bti c 1014: d503233f paciasp 1018: a9bf7bfd stp x29, x30, [sp, #-16]! 101c: 910003fd mov x29, sp 1020: d5385201 mrs x1, esr_el1 1024: 90000002 adrp x2, 0 <el0t_64_sync_handlers> 1028: 531a7c23 lsr w3, w1, #26 102c: 91000042 add x2, x2, #:lo12:<el0t_64_sync_handlers> 1030: f8637842 ldr x2, [x2, x3, lsl #3] 1034: d63f0040 blr x2 1038: a8c17bfd ldp x29, x30, [sp], #16 103c: d50323bf autiasp 1040: d65f03c0 ret ... which might do better by virtue of reducing a chain of potential mispredicts down to a single potential mispredict, and dynamic branch prediction hopefully does a good job of predicting the common case at runtime. That said, the resulting tables will be pretty big... > > Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save > about 10ns. > > Before: > Simple syscall: 0.2365 microseconds > Simple syscall: 0.2354 microseconds > Simple syscall: 0.2339 microseconds > > After: > Simple syscall: 0.2255 microseconds > Simple syscall: 0.2254 microseconds > Simple syscall: 0.2256 microseconds I appreciate this can be seen by a microbenchmark, but does this have an impact on a real workload? I'd imagine that real syscall usage will dominate this in practice, and this would fall into the noise. Thanks, Mark. > Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> > --- > arch/arm64/kernel/entry-common.c | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c > index 32f9796c4ffe77b..062eb5a895ec6f3 100644 > --- a/arch/arm64/kernel/entry-common.c > +++ b/arch/arm64/kernel/entry-common.c > @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr) > asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs) > { > unsigned long esr = read_sysreg(esr_el1); > + unsigned long ec = ESR_ELx_EC(esr); > > - switch (ESR_ELx_EC(esr)) { > - case ESR_ELx_EC_SVC64: > + if (likely(ec == ESR_ELx_EC_SVC64)) { > el0_svc(regs); > - break; > + return; > + } > + > + switch (ec) { > case ESR_ELx_EC_DABT_LOW: > el0_da(regs, esr); > break; > @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs) > asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs) > { > unsigned long esr = read_sysreg(esr_el1); > + unsigned long ec = ESR_ELx_EC(esr); > > - switch (ESR_ELx_EC(esr)) { > - case ESR_ELx_EC_SVC32: > + if (likely(ec == ESR_ELx_EC_SVC32)) { > el0_svc_compat(regs); > - break; > + return; > + } > + > + switch (ec) { > case ESR_ELx_EC_DABT_LOW: > el0_da(regs, esr); > break; > -- > 2.25.1 > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2021-09-14 9:55 UTC|newest] Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-09-03 12:19 [PATCH] arm64: entry: Improve the performance of system calls Zhen Lei 2021-09-03 12:19 ` Zhen Lei 2021-09-13 23:01 ` Punit Agrawal 2021-09-13 23:01 ` Punit Agrawal 2021-09-14 2:01 ` Leizhen (ThunderTown) 2021-09-14 2:01 ` Leizhen (ThunderTown) 2021-09-14 9:55 ` Mark Rutland [this message] 2021-09-14 9:55 ` Mark Rutland 2021-09-14 11:23 ` Leizhen (ThunderTown) 2021-09-14 11:23 ` Leizhen (ThunderTown) 2021-09-14 11:48 ` Leizhen (ThunderTown) 2021-09-14 11:48 ` Leizhen (ThunderTown) 2021-09-14 15:17 ` Joey Gouly 2021-09-14 15:17 ` Joey Gouly
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210914095436.GA26544@C02TD0UTHF1T.local \ --to=mark.rutland@arm.com \ --cc=catalin.marinas@arm.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=thunder.leizhen@huawei.com \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.