Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

From: Tiezhu Yang <yangtiezhu@loongson.cn>
To: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>,
	Xuefeng Li <lixuefeng@loongson.cn>,
	Jianmin Lv <lvjianmin@loongson.cn>, Jun Yi <yijun@loongson.cn>,
	Rui Wang <wangrui@loongson.cn>,
	LKML <linux-kernel@vger.kernel.org>,
	Jiaxun Yang <jiaxun.yang@flygoat.com>
Subject: Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls
Date: Thu, 23 Jun 2022 08:43:35 +0800	[thread overview]
Message-ID: <1a431c16-ecef-f7e9-4c4f-936e4bb3aeea@loongson.cn> (raw)
In-Reply-To: <CAAhV-H6YYhMAyOG+-PDKiLBv1fDoFmkQUSBa4zt=Lf4=HZGxHQ@mail.gmail.com>

Cc Jiaxun Yang <jiaxun.yang@flygoat.com>

On 06/22/2022 06:01 PM, Huacai Chen wrote:
> Hi, Tiezhu,
>
> On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <yangtiezhu@loongson.cn> wrote:
>>
>> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
>> for all syscalls.
>>
>> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
>> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>
>> This patch only adds the minimal changes as simple as possible
>> to reduce the code complexity, at the same time, it can reduce
>> many load instructions.
>>
>> Here are the test environments:
>>
>>   Hardware: Loongson-LS3A5000-7A1000-1w-A2101
>>   Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
>>   System: loongarch64-clfs-system-5.0 [2]
>>
>> The system passed functional testing used with the following
>> test case without and with this patch:
>>
>>   git clone https://github.com/hevz/sigaction-test.git
>>   cd sigaction-test
>>   make check
>>
>> Additionally, use UnixBench syscall to test the performance:
>>
>>   git clone https://github.com/kdlucas/byte-unixbench.git
>>   cd byte-unixbench/UnixBench/
>>   make
>>   pgms/syscall 600
>>
>> In order to avoid the performance impact, add init=/bin/bash
>> to the boot cmdline.
>>
>> Here is the test result, the bigger the better, it shows about
>> 1.2% gain tested with close, getpid and exec [3]:
>>
>>   duration  without_this_patch  with_this_patch
>>   600 s     626558267 lps       634244079 lps
>>
>> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
>> [2] https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
>> [3] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
> I test your patch and the whole UnixBench result is like this:
>
> Before patch, single thread:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0    9235787.7    791.4
> Double-Precision Whetstone                       55.0       2758.7    501.6
> Execl Throughput                                 43.0       2386.8    555.1
> File Copy 1024 bufsize 2000 maxblocks          3960.0     191752.0    484.2
> File Copy 256 bufsize 500 maxblocks            1655.0      78737.9    475.8
> File Copy 4096 bufsize 8000 maxblocks          5800.0     297402.5    512.8
> Pipe Throughput                               12440.0     353658.1    284.3
> Pipe-based Context Switching                   4000.0     120140.8    300.4
> Process Creation                                126.0       5735.0    455.2
> Shell Scripts (1 concurrent)                     42.4       2701.5    637.1
> Shell Scripts (8 concurrent)                      6.0        894.9   1491.5
> System Call Overhead                          15000.0     557467.4    371.6
>                                                                    ========
> System Benchmarks Index Score                                         516.1
>
> After patch, single thread:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0    9235688.9    791.4
> Double-Precision Whetstone                       55.0       2758.7    501.6
> Execl Throughput                                 43.0       2377.8    553.0
> File Copy 1024 bufsize 2000 maxblocks          3960.0     192545.5    486.2
> File Copy 256 bufsize 500 maxblocks            1655.0      79735.0    481.8
> File Copy 4096 bufsize 8000 maxblocks          5800.0     299621.9    516.6
> Pipe Throughput                               12440.0     354969.1    285.3
> Pipe-based Context Switching                   4000.0     118307.5    295.8
> Process Creation                                126.0       5757.0    456.9
> Shell Scripts (1 concurrent)                     42.4       2695.4    635.7
> Shell Scripts (8 concurrent)                      6.0        894.4   1490.6
> System Call Overhead                          15000.0     563582.7    375.7
>                                                                    ========
> System Benchmarks Index Score                                         517.0
>
> Before patch, multi-threads:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0   36943633.4   3165.7
> Double-Precision Whetstone                       55.0      11035.8   2006.5
> Execl Throughput                                 43.0       8800.1   2046.5
> File Copy 1024 bufsize 2000 maxblocks          3960.0     277638.3    701.1
> File Copy 256 bufsize 500 maxblocks            1655.0      92530.5    559.1
> File Copy 4096 bufsize 8000 maxblocks          5800.0     524344.3    904.0
> Pipe Throughput                               12440.0    1359237.2   1092.6
> Pipe-based Context Switching                   4000.0     571511.4   1428.8
> Process Creation                                126.0      20823.3   1652.6
> Shell Scripts (1 concurrent)                     42.4       6883.9   1623.6
> Shell Scripts (8 concurrent)                      6.0        981.7   1636.1
> System Call Overhead                          15000.0    2029539.8   1353.0
>                                                                    ========
> System Benchmarks Index Score                                        1367.4
>
> After patch, multi-threads:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0   36943793.6   3165.7
> Double-Precision Whetstone                       55.0      11035.5   2006.4
> Execl Throughput                                 43.0       8768.3   2039.1
> File Copy 1024 bufsize 2000 maxblocks          3960.0     277962.9    701.9
> File Copy 256 bufsize 500 maxblocks            1655.0      92059.7    556.3
> File Copy 4096 bufsize 8000 maxblocks          5800.0     525937.5    906.8
> Pipe Throughput                               12440.0    1361566.6   1094.5
> Pipe-based Context Switching                   4000.0     575835.4   1439.6
> Process Creation                                126.0      20426.4   1621.1
> Shell Scripts (1 concurrent)                     42.4       6877.5   1622.0
> Shell Scripts (8 concurrent)                      6.0        980.3   1633.8
> System Call Overhead                          15000.0    2049771.6   1366.5
>                                                                    ========
> System Benchmarks Index Score                                        1366.6
>
> From my point of view, the benefit is negligible.

There is another way to look at what is going on.
This patch is related with syscall, I prefer to
observe "System Call Overhead" in the test results.

Here are the INDEX of "System Call Overhead" in your test results:

thread   before_patch    after_patch    gain
single   371.6           375.7          1.103%
multi    1353.0          1366.5         0.998%

For now, I would like to wait for other people's review.
If the conclusion is the optimization is meaningless,
I am fine with ignoring this patch.

Thanks,
Tiezhu

>
>
> Huacai
>
>>
>> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
>> ---
>>  arch/loongarch/include/asm/stackframe.h |  5 +++++
>>  arch/loongarch/kernel/entry.S           | 15 +++++++++++++++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/arch/loongarch/include/asm/stackframe.h b/arch/loongarch/include/asm/stackframe.h
>> index 4ca9530..551ab8f 100644
>> --- a/arch/loongarch/include/asm/stackframe.h
>> +++ b/arch/loongarch/include/asm/stackframe.h
>> @@ -216,4 +216,9 @@
>>         RESTORE_SP_AND_RET \docfi
>>         .endm
>>
>> +       .macro  RESTORE_STATIC_SOME_SP_AND_RET docfi=0
>> +       RESTORE_STATIC \docfi
>> +       RESTORE_SOME \docfi
>> +       RESTORE_SP_AND_RET \docfi
>> +       .endm
>>  #endif /* _ASM_STACKFRAME_H */
>> diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S
>> index d5b3dbc..c764c99 100644
>> --- a/arch/loongarch/kernel/entry.S
>> +++ b/arch/loongarch/kernel/entry.S
>> @@ -14,6 +14,7 @@
>>  #include <asm/regdef.h>
>>  #include <asm/stackframe.h>
>>  #include <asm/thread_info.h>
>> +#include <asm/unistd.h>
>>
>>         .text
>>         .cfi_sections   .debug_frame
>> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
>>         li.d    tp, ~_THREAD_MASK
>>         and     tp, tp, sp
>>
>> +       /* Syscall number held in a7, we can store it in TI_SYSCALL. */
>> +        LONG_S  a7, tp, TI_SYSCALL
>> +
>>         move    a0, sp
>>         bl      do_syscall
>>
>> +       /*
>> +        * Syscall number held in a7 which is stored in TI_SYSCALL.
>> +        * rt_sigreturn call RESTORE_ALL_AND_RET.
>> +        * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>> +        */
>> +       LONG_L  t3, tp, TI_SYSCALL
>> +       li.w    t4, __NR_rt_sigreturn
>> +       beq     t3, t4, 1f
>> +
>> +       RESTORE_STATIC_SOME_SP_AND_RET
>> +1:
>>         RESTORE_ALL_AND_RET
>>  SYM_FUNC_END(handle_syscall)
>>
>> --
>> 2.1.0
>>