Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

From: Tiezhu Yang <yangtiezhu@loongson.cn>
To: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>,
	Xuefeng Li <lixuefeng@loongson.cn>,
	Jianmin Lv <lvjianmin@loongson.cn>, Jun Yi <yijun@loongson.cn>,
	Rui Wang <wangrui@loongson.cn>,
	LKML <linux-kernel@vger.kernel.org>,
	Jiaxun Yang <jiaxun.yang@flygoat.com>,
	loongarch@lists.linux.dev, Arnd Bergmann <arnd@arndb.de>,
	Guo Ren <guoren@kernel.org>
Subject: Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls
Date: Sat, 25 Jun 2022 10:09:59 +0800	[thread overview]
Message-ID: <d782be07-1cfa-626f-d9f5-d151bd091214@loongson.cn> (raw)
In-Reply-To: <1a431c16-ecef-f7e9-4c4f-936e4bb3aeea@loongson.cn>

Cc loongarch@lists.linux.dev
Arnd Bergmann <arnd@arndb.de>
Guo Ren <guoren@kernel.org>

On 06/23/2022 08:43 AM, Tiezhu Yang wrote:
> Cc Jiaxun Yang <jiaxun.yang@flygoat.com>
>
> On 06/22/2022 06:01 PM, Huacai Chen wrote:
>> Hi, Tiezhu,
>>
>> On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <yangtiezhu@loongson.cn>
>> wrote:
>>>
>>> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
>>> for all syscalls.
>>>
>>> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
>>> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>>
>>> This patch only adds the minimal changes as simple as possible
>>> to reduce the code complexity, at the same time, it can reduce
>>> many load instructions.
>>>
>>> Here are the test environments:
>>>
>>>   Hardware: Loongson-LS3A5000-7A1000-1w-A2101
>>>   Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
>>>   System: loongarch64-clfs-system-5.0 [2]
>>>
>>> The system passed functional testing used with the following
>>> test case without and with this patch:
>>>
>>>   git clone https://github.com/hevz/sigaction-test.git
>>>   cd sigaction-test
>>>   make check
>>>
>>> Additionally, use UnixBench syscall to test the performance:
>>>
>>>   git clone https://github.com/kdlucas/byte-unixbench.git
>>>   cd byte-unixbench/UnixBench/
>>>   make
>>>   pgms/syscall 600
>>>
>>> In order to avoid the performance impact, add init=/bin/bash
>>> to the boot cmdline.
>>>
>>> Here is the test result, the bigger the better, it shows about
>>> 1.2% gain tested with close, getpid and exec [3]:
>>>
>>>   duration  without_this_patch  with_this_patch
>>>   600 s     626558267 lps       634244079 lps
>>>
>>> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
>>> [2]
>>> https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
>>> [3]
>>> https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
>>>
>> I test your patch and the whole UnixBench result is like this:
>>
>> Before patch, single thread:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0    9235787.7
>> 791.4
>> Double-Precision Whetstone                       55.0       2758.7
>> 501.6
>> Execl Throughput                                 43.0       2386.8
>> 555.1
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     191752.0
>> 484.2
>> File Copy 256 bufsize 500 maxblocks            1655.0      78737.9
>> 475.8
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     297402.5
>> 512.8
>> Pipe Throughput                               12440.0     353658.1
>> 284.3
>> Pipe-based Context Switching                   4000.0     120140.8
>> 300.4
>> Process Creation                                126.0       5735.0
>> 455.2
>> Shell Scripts (1 concurrent)                     42.4       2701.5
>> 637.1
>> Shell Scripts (8 concurrent)                      6.0        894.9
>> 1491.5
>> System Call Overhead                          15000.0     557467.4
>> 371.6
>>
>> ========
>> System Benchmarks Index Score
>> 516.1
>>
>> After patch, single thread:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0    9235688.9
>> 791.4
>> Double-Precision Whetstone                       55.0       2758.7
>> 501.6
>> Execl Throughput                                 43.0       2377.8
>> 553.0
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     192545.5
>> 486.2
>> File Copy 256 bufsize 500 maxblocks            1655.0      79735.0
>> 481.8
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     299621.9
>> 516.6
>> Pipe Throughput                               12440.0     354969.1
>> 285.3
>> Pipe-based Context Switching                   4000.0     118307.5
>> 295.8
>> Process Creation                                126.0       5757.0
>> 456.9
>> Shell Scripts (1 concurrent)                     42.4       2695.4
>> 635.7
>> Shell Scripts (8 concurrent)                      6.0        894.4
>> 1490.6
>> System Call Overhead                          15000.0     563582.7
>> 375.7
>>
>> ========
>> System Benchmarks Index Score
>> 517.0
>>
>> Before patch, multi-threads:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0   36943633.4
>> 3165.7
>> Double-Precision Whetstone                       55.0      11035.8
>> 2006.5
>> Execl Throughput                                 43.0       8800.1
>> 2046.5
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     277638.3
>> 701.1
>> File Copy 256 bufsize 500 maxblocks            1655.0      92530.5
>> 559.1
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     524344.3
>> 904.0
>> Pipe Throughput                               12440.0    1359237.2
>> 1092.6
>> Pipe-based Context Switching                   4000.0     571511.4
>> 1428.8
>> Process Creation                                126.0      20823.3
>> 1652.6
>> Shell Scripts (1 concurrent)                     42.4       6883.9
>> 1623.6
>> Shell Scripts (8 concurrent)                      6.0        981.7
>> 1636.1
>> System Call Overhead                          15000.0    2029539.8
>> 1353.0
>>
>> ========
>> System Benchmarks Index Score
>> 1367.4
>>
>> After patch, multi-threads:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0   36943793.6
>> 3165.7
>> Double-Precision Whetstone                       55.0      11035.5
>> 2006.4
>> Execl Throughput                                 43.0       8768.3
>> 2039.1
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     277962.9
>> 701.9
>> File Copy 256 bufsize 500 maxblocks            1655.0      92059.7
>> 556.3
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     525937.5
>> 906.8
>> Pipe Throughput                               12440.0    1361566.6
>> 1094.5
>> Pipe-based Context Switching                   4000.0     575835.4
>> 1439.6
>> Process Creation                                126.0      20426.4
>> 1621.1
>> Shell Scripts (1 concurrent)                     42.4       6877.5
>> 1622.0
>> Shell Scripts (8 concurrent)                      6.0        980.3
>> 1633.8
>> System Call Overhead                          15000.0    2049771.6
>> 1366.5
>>
>> ========
>> System Benchmarks Index Score
>> 1366.6
>>
>> From my point of view, the benefit is negligible.
>
> There is another way to look at what is going on.
> This patch is related with syscall, I prefer to
> observe "System Call Overhead" in the test results.
>
> Here are the INDEX of "System Call Overhead" in your test results:
>
> thread   before_patch    after_patch    gain
> single   371.6           375.7          1.103%
> multi    1353.0          1366.5         0.998%
>
> For now, I would like to wait for other people's review.
> If the conclusion is the optimization is meaningless,
> I am fine with ignoring this patch.

Any comments will be much appreciated.

Here is the link:

https://lore.kernel.org/lkml/1655806074-17454-3-git-send-email-yangtiezhu@loongson.cn/

Thanks,
Tiezhu

>
> Thanks,
> Tiezhu
>
>>
>>
>> Huacai
>>
>>>
>>> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
>>> ---
>>>  arch/loongarch/include/asm/stackframe.h |  5 +++++
>>>  arch/loongarch/kernel/entry.S           | 15 +++++++++++++++
>>>  2 files changed, 20 insertions(+)
>>>
>>> diff --git a/arch/loongarch/include/asm/stackframe.h
>>> b/arch/loongarch/include/asm/stackframe.h
>>> index 4ca9530..551ab8f 100644
>>> --- a/arch/loongarch/include/asm/stackframe.h
>>> +++ b/arch/loongarch/include/asm/stackframe.h
>>> @@ -216,4 +216,9 @@
>>>         RESTORE_SP_AND_RET \docfi
>>>         .endm
>>>
>>> +       .macro  RESTORE_STATIC_SOME_SP_AND_RET docfi=0
>>> +       RESTORE_STATIC \docfi
>>> +       RESTORE_SOME \docfi
>>> +       RESTORE_SP_AND_RET \docfi
>>> +       .endm
>>>  #endif /* _ASM_STACKFRAME_H */
>>> diff --git a/arch/loongarch/kernel/entry.S
>>> b/arch/loongarch/kernel/entry.S
>>> index d5b3dbc..c764c99 100644
>>> --- a/arch/loongarch/kernel/entry.S
>>> +++ b/arch/loongarch/kernel/entry.S
>>> @@ -14,6 +14,7 @@
>>>  #include <asm/regdef.h>
>>>  #include <asm/stackframe.h>
>>>  #include <asm/thread_info.h>
>>> +#include <asm/unistd.h>
>>>
>>>         .text
>>>         .cfi_sections   .debug_frame
>>> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
>>>         li.d    tp, ~_THREAD_MASK
>>>         and     tp, tp, sp
>>>
>>> +       /* Syscall number held in a7, we can store it in TI_SYSCALL. */
>>> +        LONG_S  a7, tp, TI_SYSCALL
>>> +
>>>         move    a0, sp
>>>         bl      do_syscall
>>>
>>> +       /*
>>> +        * Syscall number held in a7 which is stored in TI_SYSCALL.
>>> +        * rt_sigreturn call RESTORE_ALL_AND_RET.
>>> +        * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>> +        */
>>> +       LONG_L  t3, tp, TI_SYSCALL
>>> +       li.w    t4, __NR_rt_sigreturn
>>> +       beq     t3, t4, 1f
>>> +
>>> +       RESTORE_STATIC_SOME_SP_AND_RET
>>> +1:
>>>         RESTORE_ALL_AND_RET
>>>  SYM_FUNC_END(handle_syscall)
>>>
>>> --
>>> 2.1.0
>>>