All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] LoongArch: Modify handle_syscall
@ 2022-06-21 10:07 Tiezhu Yang
  2022-06-21 10:07 ` [PATCH v2 1/2] LoongArch: Add TI_SYSCALL in output_thread_info_defines() Tiezhu Yang
  2022-06-21 10:07 ` [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls Tiezhu Yang
  0 siblings, 2 replies; 6+ messages in thread
From: Tiezhu Yang @ 2022-06-21 10:07 UTC (permalink / raw)
  To: Huacai Chen, WANG Xuerui
  Cc: Xuefeng Li, Jianmin Lv, Jun Yi, Rui Wang, linux-kernel

v2: update the commit message of patch #2 to fix a typo,
    sorry for that.

Tiezhu Yang (2):
  LoongArch: Add TI_SYSCALL in output_thread_info_defines()
  LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

 arch/loongarch/include/asm/stackframe.h |  5 +++++
 arch/loongarch/kernel/asm-offsets.c     |  1 +
 arch/loongarch/kernel/entry.S           | 15 +++++++++++++++
 3 files changed, 21 insertions(+)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] LoongArch: Add TI_SYSCALL in output_thread_info_defines()
  2022-06-21 10:07 [PATCH v2 0/2] LoongArch: Modify handle_syscall Tiezhu Yang
@ 2022-06-21 10:07 ` Tiezhu Yang
  2022-06-21 10:07 ` [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls Tiezhu Yang
  1 sibling, 0 replies; 6+ messages in thread
From: Tiezhu Yang @ 2022-06-21 10:07 UTC (permalink / raw)
  To: Huacai Chen, WANG Xuerui
  Cc: Xuefeng Li, Jianmin Lv, Jun Yi, Rui Wang, linux-kernel

The initial idea was to store the syscall number in PT_R11,
and then we can get the syscall number from PT_R11 to check
before RESTORE in handle_syscall, but we know PT_R11 may be
overwritten by the signal handler and the syscall number
will be lost.

Add TI_SYSCALL in output_thread_info_defines(), then we can
store the syscall number in TI_SYSCALL. This is preparation
for later patch.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---
 arch/loongarch/kernel/asm-offsets.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/loongarch/kernel/asm-offsets.c b/arch/loongarch/kernel/asm-offsets.c
index bfb65eb..4757ebe 100644
--- a/arch/loongarch/kernel/asm-offsets.c
+++ b/arch/loongarch/kernel/asm-offsets.c
@@ -81,6 +81,7 @@ void output_thread_info_defines(void)
 	OFFSET(TI_CPU, thread_info, cpu);
 	OFFSET(TI_PRE_COUNT, thread_info, preempt_count);
 	OFFSET(TI_REGS, thread_info, regs);
+	OFFSET(TI_SYSCALL, thread_info, syscall);
 	DEFINE(_THREAD_SIZE, THREAD_SIZE);
 	DEFINE(_THREAD_MASK, THREAD_MASK);
 	DEFINE(_IRQ_STACK_SIZE, IRQ_STACK_SIZE);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls
  2022-06-21 10:07 [PATCH v2 0/2] LoongArch: Modify handle_syscall Tiezhu Yang
  2022-06-21 10:07 ` [PATCH v2 1/2] LoongArch: Add TI_SYSCALL in output_thread_info_defines() Tiezhu Yang
@ 2022-06-21 10:07 ` Tiezhu Yang
  2022-06-22 10:01   ` Huacai Chen
  1 sibling, 1 reply; 6+ messages in thread
From: Tiezhu Yang @ 2022-06-21 10:07 UTC (permalink / raw)
  To: Huacai Chen, WANG Xuerui
  Cc: Xuefeng Li, Jianmin Lv, Jun Yi, Rui Wang, linux-kernel

In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
for all syscalls.

(1) rt_sigreturn call RESTORE_ALL_AND_RET.
(2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.

This patch only adds the minimal changes as simple as possible
to reduce the code complexity, at the same time, it can reduce
many load instructions.

Here are the test environments:

  Hardware: Loongson-LS3A5000-7A1000-1w-A2101
  Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
  System: loongarch64-clfs-system-5.0 [2]

The system passed functional testing used with the following
test case without and with this patch:

  git clone https://github.com/hevz/sigaction-test.git
  cd sigaction-test
  make check

Additionally, use UnixBench syscall to test the performance:

  git clone https://github.com/kdlucas/byte-unixbench.git
  cd byte-unixbench/UnixBench/
  make
  pgms/syscall 600

In order to avoid the performance impact, add init=/bin/bash
to the boot cmdline.

Here is the test result, the bigger the better, it shows about
1.2% gain tested with close, getpid and exec [3]:

  duration  without_this_patch  with_this_patch
  600 s     626558267 lps       634244079 lps

[1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
[2] https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
[3] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---
 arch/loongarch/include/asm/stackframe.h |  5 +++++
 arch/loongarch/kernel/entry.S           | 15 +++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/loongarch/include/asm/stackframe.h b/arch/loongarch/include/asm/stackframe.h
index 4ca9530..551ab8f 100644
--- a/arch/loongarch/include/asm/stackframe.h
+++ b/arch/loongarch/include/asm/stackframe.h
@@ -216,4 +216,9 @@
 	RESTORE_SP_AND_RET \docfi
 	.endm
 
+	.macro	RESTORE_STATIC_SOME_SP_AND_RET docfi=0
+	RESTORE_STATIC \docfi
+	RESTORE_SOME \docfi
+	RESTORE_SP_AND_RET \docfi
+	.endm
 #endif /* _ASM_STACKFRAME_H */
diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S
index d5b3dbc..c764c99 100644
--- a/arch/loongarch/kernel/entry.S
+++ b/arch/loongarch/kernel/entry.S
@@ -14,6 +14,7 @@
 #include <asm/regdef.h>
 #include <asm/stackframe.h>
 #include <asm/thread_info.h>
+#include <asm/unistd.h>
 
 	.text
 	.cfi_sections	.debug_frame
@@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
 	li.d	tp, ~_THREAD_MASK
 	and	tp, tp, sp
 
+	/* Syscall number held in a7, we can store it in TI_SYSCALL. */
+        LONG_S  a7, tp, TI_SYSCALL
+
 	move	a0, sp
 	bl	do_syscall
 
+	/*
+	 * Syscall number held in a7 which is stored in TI_SYSCALL.
+	 * rt_sigreturn call RESTORE_ALL_AND_RET.
+	 * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
+	 */
+	LONG_L	t3, tp, TI_SYSCALL
+	li.w	t4, __NR_rt_sigreturn
+	beq	t3, t4, 1f
+
+	RESTORE_STATIC_SOME_SP_AND_RET
+1:
 	RESTORE_ALL_AND_RET
 SYM_FUNC_END(handle_syscall)
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls
  2022-06-21 10:07 ` [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls Tiezhu Yang
@ 2022-06-22 10:01   ` Huacai Chen
  2022-06-23  0:43     ` Tiezhu Yang
  0 siblings, 1 reply; 6+ messages in thread
From: Huacai Chen @ 2022-06-22 10:01 UTC (permalink / raw)
  To: Tiezhu Yang; +Cc: WANG Xuerui, Xuefeng Li, Jianmin Lv, Jun Yi, Rui Wang, LKML

Hi, Tiezhu,

On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <yangtiezhu@loongson.cn> wrote:
>
> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
> for all syscalls.
>
> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>
> This patch only adds the minimal changes as simple as possible
> to reduce the code complexity, at the same time, it can reduce
> many load instructions.
>
> Here are the test environments:
>
>   Hardware: Loongson-LS3A5000-7A1000-1w-A2101
>   Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
>   System: loongarch64-clfs-system-5.0 [2]
>
> The system passed functional testing used with the following
> test case without and with this patch:
>
>   git clone https://github.com/hevz/sigaction-test.git
>   cd sigaction-test
>   make check
>
> Additionally, use UnixBench syscall to test the performance:
>
>   git clone https://github.com/kdlucas/byte-unixbench.git
>   cd byte-unixbench/UnixBench/
>   make
>   pgms/syscall 600
>
> In order to avoid the performance impact, add init=/bin/bash
> to the boot cmdline.
>
> Here is the test result, the bigger the better, it shows about
> 1.2% gain tested with close, getpid and exec [3]:
>
>   duration  without_this_patch  with_this_patch
>   600 s     626558267 lps       634244079 lps
>
> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
> [2] https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
> [3] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
I test your patch and the whole UnixBench result is like this:

Before patch, single thread:

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0    9235787.7    791.4
Double-Precision Whetstone                       55.0       2758.7    501.6
Execl Throughput                                 43.0       2386.8    555.1
File Copy 1024 bufsize 2000 maxblocks          3960.0     191752.0    484.2
File Copy 256 bufsize 500 maxblocks            1655.0      78737.9    475.8
File Copy 4096 bufsize 8000 maxblocks          5800.0     297402.5    512.8
Pipe Throughput                               12440.0     353658.1    284.3
Pipe-based Context Switching                   4000.0     120140.8    300.4
Process Creation                                126.0       5735.0    455.2
Shell Scripts (1 concurrent)                     42.4       2701.5    637.1
Shell Scripts (8 concurrent)                      6.0        894.9   1491.5
System Call Overhead                          15000.0     557467.4    371.6
                                                                   ========
System Benchmarks Index Score                                         516.1

After patch, single thread:

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0    9235688.9    791.4
Double-Precision Whetstone                       55.0       2758.7    501.6
Execl Throughput                                 43.0       2377.8    553.0
File Copy 1024 bufsize 2000 maxblocks          3960.0     192545.5    486.2
File Copy 256 bufsize 500 maxblocks            1655.0      79735.0    481.8
File Copy 4096 bufsize 8000 maxblocks          5800.0     299621.9    516.6
Pipe Throughput                               12440.0     354969.1    285.3
Pipe-based Context Switching                   4000.0     118307.5    295.8
Process Creation                                126.0       5757.0    456.9
Shell Scripts (1 concurrent)                     42.4       2695.4    635.7
Shell Scripts (8 concurrent)                      6.0        894.4   1490.6
System Call Overhead                          15000.0     563582.7    375.7
                                                                   ========
System Benchmarks Index Score                                         517.0

Before patch, multi-threads:

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   36943633.4   3165.7
Double-Precision Whetstone                       55.0      11035.8   2006.5
Execl Throughput                                 43.0       8800.1   2046.5
File Copy 1024 bufsize 2000 maxblocks          3960.0     277638.3    701.1
File Copy 256 bufsize 500 maxblocks            1655.0      92530.5    559.1
File Copy 4096 bufsize 8000 maxblocks          5800.0     524344.3    904.0
Pipe Throughput                               12440.0    1359237.2   1092.6
Pipe-based Context Switching                   4000.0     571511.4   1428.8
Process Creation                                126.0      20823.3   1652.6
Shell Scripts (1 concurrent)                     42.4       6883.9   1623.6
Shell Scripts (8 concurrent)                      6.0        981.7   1636.1
System Call Overhead                          15000.0    2029539.8   1353.0
                                                                   ========
System Benchmarks Index Score                                        1367.4

After patch, multi-threads:

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   36943793.6   3165.7
Double-Precision Whetstone                       55.0      11035.5   2006.4
Execl Throughput                                 43.0       8768.3   2039.1
File Copy 1024 bufsize 2000 maxblocks          3960.0     277962.9    701.9
File Copy 256 bufsize 500 maxblocks            1655.0      92059.7    556.3
File Copy 4096 bufsize 8000 maxblocks          5800.0     525937.5    906.8
Pipe Throughput                               12440.0    1361566.6   1094.5
Pipe-based Context Switching                   4000.0     575835.4   1439.6
Process Creation                                126.0      20426.4   1621.1
Shell Scripts (1 concurrent)                     42.4       6877.5   1622.0
Shell Scripts (8 concurrent)                      6.0        980.3   1633.8
System Call Overhead                          15000.0    2049771.6   1366.5
                                                                   ========
System Benchmarks Index Score                                        1366.6

From my point of view, the benefit is negligible.


Huacai

>
> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
> ---
>  arch/loongarch/include/asm/stackframe.h |  5 +++++
>  arch/loongarch/kernel/entry.S           | 15 +++++++++++++++
>  2 files changed, 20 insertions(+)
>
> diff --git a/arch/loongarch/include/asm/stackframe.h b/arch/loongarch/include/asm/stackframe.h
> index 4ca9530..551ab8f 100644
> --- a/arch/loongarch/include/asm/stackframe.h
> +++ b/arch/loongarch/include/asm/stackframe.h
> @@ -216,4 +216,9 @@
>         RESTORE_SP_AND_RET \docfi
>         .endm
>
> +       .macro  RESTORE_STATIC_SOME_SP_AND_RET docfi=0
> +       RESTORE_STATIC \docfi
> +       RESTORE_SOME \docfi
> +       RESTORE_SP_AND_RET \docfi
> +       .endm
>  #endif /* _ASM_STACKFRAME_H */
> diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S
> index d5b3dbc..c764c99 100644
> --- a/arch/loongarch/kernel/entry.S
> +++ b/arch/loongarch/kernel/entry.S
> @@ -14,6 +14,7 @@
>  #include <asm/regdef.h>
>  #include <asm/stackframe.h>
>  #include <asm/thread_info.h>
> +#include <asm/unistd.h>
>
>         .text
>         .cfi_sections   .debug_frame
> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
>         li.d    tp, ~_THREAD_MASK
>         and     tp, tp, sp
>
> +       /* Syscall number held in a7, we can store it in TI_SYSCALL. */
> +        LONG_S  a7, tp, TI_SYSCALL
> +
>         move    a0, sp
>         bl      do_syscall
>
> +       /*
> +        * Syscall number held in a7 which is stored in TI_SYSCALL.
> +        * rt_sigreturn call RESTORE_ALL_AND_RET.
> +        * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
> +        */
> +       LONG_L  t3, tp, TI_SYSCALL
> +       li.w    t4, __NR_rt_sigreturn
> +       beq     t3, t4, 1f
> +
> +       RESTORE_STATIC_SOME_SP_AND_RET
> +1:
>         RESTORE_ALL_AND_RET
>  SYM_FUNC_END(handle_syscall)
>
> --
> 2.1.0
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls
  2022-06-22 10:01   ` Huacai Chen
@ 2022-06-23  0:43     ` Tiezhu Yang
  2022-06-25  2:09       ` Tiezhu Yang
  0 siblings, 1 reply; 6+ messages in thread
From: Tiezhu Yang @ 2022-06-23  0:43 UTC (permalink / raw)
  To: Huacai Chen
  Cc: WANG Xuerui, Xuefeng Li, Jianmin Lv, Jun Yi, Rui Wang, LKML, Jiaxun Yang

Cc Jiaxun Yang <jiaxun.yang@flygoat.com>

On 06/22/2022 06:01 PM, Huacai Chen wrote:
> Hi, Tiezhu,
>
> On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <yangtiezhu@loongson.cn> wrote:
>>
>> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
>> for all syscalls.
>>
>> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
>> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>
>> This patch only adds the minimal changes as simple as possible
>> to reduce the code complexity, at the same time, it can reduce
>> many load instructions.
>>
>> Here are the test environments:
>>
>>   Hardware: Loongson-LS3A5000-7A1000-1w-A2101
>>   Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
>>   System: loongarch64-clfs-system-5.0 [2]
>>
>> The system passed functional testing used with the following
>> test case without and with this patch:
>>
>>   git clone https://github.com/hevz/sigaction-test.git
>>   cd sigaction-test
>>   make check
>>
>> Additionally, use UnixBench syscall to test the performance:
>>
>>   git clone https://github.com/kdlucas/byte-unixbench.git
>>   cd byte-unixbench/UnixBench/
>>   make
>>   pgms/syscall 600
>>
>> In order to avoid the performance impact, add init=/bin/bash
>> to the boot cmdline.
>>
>> Here is the test result, the bigger the better, it shows about
>> 1.2% gain tested with close, getpid and exec [3]:
>>
>>   duration  without_this_patch  with_this_patch
>>   600 s     626558267 lps       634244079 lps
>>
>> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
>> [2] https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
>> [3] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
> I test your patch and the whole UnixBench result is like this:
>
> Before patch, single thread:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0    9235787.7    791.4
> Double-Precision Whetstone                       55.0       2758.7    501.6
> Execl Throughput                                 43.0       2386.8    555.1
> File Copy 1024 bufsize 2000 maxblocks          3960.0     191752.0    484.2
> File Copy 256 bufsize 500 maxblocks            1655.0      78737.9    475.8
> File Copy 4096 bufsize 8000 maxblocks          5800.0     297402.5    512.8
> Pipe Throughput                               12440.0     353658.1    284.3
> Pipe-based Context Switching                   4000.0     120140.8    300.4
> Process Creation                                126.0       5735.0    455.2
> Shell Scripts (1 concurrent)                     42.4       2701.5    637.1
> Shell Scripts (8 concurrent)                      6.0        894.9   1491.5
> System Call Overhead                          15000.0     557467.4    371.6
>                                                                    ========
> System Benchmarks Index Score                                         516.1
>
> After patch, single thread:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0    9235688.9    791.4
> Double-Precision Whetstone                       55.0       2758.7    501.6
> Execl Throughput                                 43.0       2377.8    553.0
> File Copy 1024 bufsize 2000 maxblocks          3960.0     192545.5    486.2
> File Copy 256 bufsize 500 maxblocks            1655.0      79735.0    481.8
> File Copy 4096 bufsize 8000 maxblocks          5800.0     299621.9    516.6
> Pipe Throughput                               12440.0     354969.1    285.3
> Pipe-based Context Switching                   4000.0     118307.5    295.8
> Process Creation                                126.0       5757.0    456.9
> Shell Scripts (1 concurrent)                     42.4       2695.4    635.7
> Shell Scripts (8 concurrent)                      6.0        894.4   1490.6
> System Call Overhead                          15000.0     563582.7    375.7
>                                                                    ========
> System Benchmarks Index Score                                         517.0
>
> Before patch, multi-threads:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0   36943633.4   3165.7
> Double-Precision Whetstone                       55.0      11035.8   2006.5
> Execl Throughput                                 43.0       8800.1   2046.5
> File Copy 1024 bufsize 2000 maxblocks          3960.0     277638.3    701.1
> File Copy 256 bufsize 500 maxblocks            1655.0      92530.5    559.1
> File Copy 4096 bufsize 8000 maxblocks          5800.0     524344.3    904.0
> Pipe Throughput                               12440.0    1359237.2   1092.6
> Pipe-based Context Switching                   4000.0     571511.4   1428.8
> Process Creation                                126.0      20823.3   1652.6
> Shell Scripts (1 concurrent)                     42.4       6883.9   1623.6
> Shell Scripts (8 concurrent)                      6.0        981.7   1636.1
> System Call Overhead                          15000.0    2029539.8   1353.0
>                                                                    ========
> System Benchmarks Index Score                                        1367.4
>
> After patch, multi-threads:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0   36943793.6   3165.7
> Double-Precision Whetstone                       55.0      11035.5   2006.4
> Execl Throughput                                 43.0       8768.3   2039.1
> File Copy 1024 bufsize 2000 maxblocks          3960.0     277962.9    701.9
> File Copy 256 bufsize 500 maxblocks            1655.0      92059.7    556.3
> File Copy 4096 bufsize 8000 maxblocks          5800.0     525937.5    906.8
> Pipe Throughput                               12440.0    1361566.6   1094.5
> Pipe-based Context Switching                   4000.0     575835.4   1439.6
> Process Creation                                126.0      20426.4   1621.1
> Shell Scripts (1 concurrent)                     42.4       6877.5   1622.0
> Shell Scripts (8 concurrent)                      6.0        980.3   1633.8
> System Call Overhead                          15000.0    2049771.6   1366.5
>                                                                    ========
> System Benchmarks Index Score                                        1366.6
>
> From my point of view, the benefit is negligible.

There is another way to look at what is going on.
This patch is related with syscall, I prefer to
observe "System Call Overhead" in the test results.

Here are the INDEX of "System Call Overhead" in your test results:

thread   before_patch    after_patch    gain
single   371.6           375.7          1.103%
multi    1353.0          1366.5         0.998%

For now, I would like to wait for other people's review.
If the conclusion is the optimization is meaningless,
I am fine with ignoring this patch.

Thanks,
Tiezhu

>
>
> Huacai
>
>>
>> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
>> ---
>>  arch/loongarch/include/asm/stackframe.h |  5 +++++
>>  arch/loongarch/kernel/entry.S           | 15 +++++++++++++++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/arch/loongarch/include/asm/stackframe.h b/arch/loongarch/include/asm/stackframe.h
>> index 4ca9530..551ab8f 100644
>> --- a/arch/loongarch/include/asm/stackframe.h
>> +++ b/arch/loongarch/include/asm/stackframe.h
>> @@ -216,4 +216,9 @@
>>         RESTORE_SP_AND_RET \docfi
>>         .endm
>>
>> +       .macro  RESTORE_STATIC_SOME_SP_AND_RET docfi=0
>> +       RESTORE_STATIC \docfi
>> +       RESTORE_SOME \docfi
>> +       RESTORE_SP_AND_RET \docfi
>> +       .endm
>>  #endif /* _ASM_STACKFRAME_H */
>> diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S
>> index d5b3dbc..c764c99 100644
>> --- a/arch/loongarch/kernel/entry.S
>> +++ b/arch/loongarch/kernel/entry.S
>> @@ -14,6 +14,7 @@
>>  #include <asm/regdef.h>
>>  #include <asm/stackframe.h>
>>  #include <asm/thread_info.h>
>> +#include <asm/unistd.h>
>>
>>         .text
>>         .cfi_sections   .debug_frame
>> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
>>         li.d    tp, ~_THREAD_MASK
>>         and     tp, tp, sp
>>
>> +       /* Syscall number held in a7, we can store it in TI_SYSCALL. */
>> +        LONG_S  a7, tp, TI_SYSCALL
>> +
>>         move    a0, sp
>>         bl      do_syscall
>>
>> +       /*
>> +        * Syscall number held in a7 which is stored in TI_SYSCALL.
>> +        * rt_sigreturn call RESTORE_ALL_AND_RET.
>> +        * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>> +        */
>> +       LONG_L  t3, tp, TI_SYSCALL
>> +       li.w    t4, __NR_rt_sigreturn
>> +       beq     t3, t4, 1f
>> +
>> +       RESTORE_STATIC_SOME_SP_AND_RET
>> +1:
>>         RESTORE_ALL_AND_RET
>>  SYM_FUNC_END(handle_syscall)
>>
>> --
>> 2.1.0
>>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls
  2022-06-23  0:43     ` Tiezhu Yang
@ 2022-06-25  2:09       ` Tiezhu Yang
  0 siblings, 0 replies; 6+ messages in thread
From: Tiezhu Yang @ 2022-06-25  2:09 UTC (permalink / raw)
  To: Huacai Chen
  Cc: WANG Xuerui, Xuefeng Li, Jianmin Lv, Jun Yi, Rui Wang, LKML,
	Jiaxun Yang, loongarch, Arnd Bergmann, Guo Ren

Cc loongarch@lists.linux.dev
Arnd Bergmann <arnd@arndb.de>
Guo Ren <guoren@kernel.org>

On 06/23/2022 08:43 AM, Tiezhu Yang wrote:
> Cc Jiaxun Yang <jiaxun.yang@flygoat.com>
>
> On 06/22/2022 06:01 PM, Huacai Chen wrote:
>> Hi, Tiezhu,
>>
>> On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <yangtiezhu@loongson.cn>
>> wrote:
>>>
>>> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
>>> for all syscalls.
>>>
>>> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
>>> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>>
>>> This patch only adds the minimal changes as simple as possible
>>> to reduce the code complexity, at the same time, it can reduce
>>> many load instructions.
>>>
>>> Here are the test environments:
>>>
>>>   Hardware: Loongson-LS3A5000-7A1000-1w-A2101
>>>   Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
>>>   System: loongarch64-clfs-system-5.0 [2]
>>>
>>> The system passed functional testing used with the following
>>> test case without and with this patch:
>>>
>>>   git clone https://github.com/hevz/sigaction-test.git
>>>   cd sigaction-test
>>>   make check
>>>
>>> Additionally, use UnixBench syscall to test the performance:
>>>
>>>   git clone https://github.com/kdlucas/byte-unixbench.git
>>>   cd byte-unixbench/UnixBench/
>>>   make
>>>   pgms/syscall 600
>>>
>>> In order to avoid the performance impact, add init=/bin/bash
>>> to the boot cmdline.
>>>
>>> Here is the test result, the bigger the better, it shows about
>>> 1.2% gain tested with close, getpid and exec [3]:
>>>
>>>   duration  without_this_patch  with_this_patch
>>>   600 s     626558267 lps       634244079 lps
>>>
>>> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
>>> [2]
>>> https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
>>> [3]
>>> https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
>>>
>> I test your patch and the whole UnixBench result is like this:
>>
>> Before patch, single thread:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0    9235787.7
>> 791.4
>> Double-Precision Whetstone                       55.0       2758.7
>> 501.6
>> Execl Throughput                                 43.0       2386.8
>> 555.1
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     191752.0
>> 484.2
>> File Copy 256 bufsize 500 maxblocks            1655.0      78737.9
>> 475.8
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     297402.5
>> 512.8
>> Pipe Throughput                               12440.0     353658.1
>> 284.3
>> Pipe-based Context Switching                   4000.0     120140.8
>> 300.4
>> Process Creation                                126.0       5735.0
>> 455.2
>> Shell Scripts (1 concurrent)                     42.4       2701.5
>> 637.1
>> Shell Scripts (8 concurrent)                      6.0        894.9
>> 1491.5
>> System Call Overhead                          15000.0     557467.4
>> 371.6
>>
>> ========
>> System Benchmarks Index Score
>> 516.1
>>
>> After patch, single thread:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0    9235688.9
>> 791.4
>> Double-Precision Whetstone                       55.0       2758.7
>> 501.6
>> Execl Throughput                                 43.0       2377.8
>> 553.0
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     192545.5
>> 486.2
>> File Copy 256 bufsize 500 maxblocks            1655.0      79735.0
>> 481.8
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     299621.9
>> 516.6
>> Pipe Throughput                               12440.0     354969.1
>> 285.3
>> Pipe-based Context Switching                   4000.0     118307.5
>> 295.8
>> Process Creation                                126.0       5757.0
>> 456.9
>> Shell Scripts (1 concurrent)                     42.4       2695.4
>> 635.7
>> Shell Scripts (8 concurrent)                      6.0        894.4
>> 1490.6
>> System Call Overhead                          15000.0     563582.7
>> 375.7
>>
>> ========
>> System Benchmarks Index Score
>> 517.0
>>
>> Before patch, multi-threads:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0   36943633.4
>> 3165.7
>> Double-Precision Whetstone                       55.0      11035.8
>> 2006.5
>> Execl Throughput                                 43.0       8800.1
>> 2046.5
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     277638.3
>> 701.1
>> File Copy 256 bufsize 500 maxblocks            1655.0      92530.5
>> 559.1
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     524344.3
>> 904.0
>> Pipe Throughput                               12440.0    1359237.2
>> 1092.6
>> Pipe-based Context Switching                   4000.0     571511.4
>> 1428.8
>> Process Creation                                126.0      20823.3
>> 1652.6
>> Shell Scripts (1 concurrent)                     42.4       6883.9
>> 1623.6
>> Shell Scripts (8 concurrent)                      6.0        981.7
>> 1636.1
>> System Call Overhead                          15000.0    2029539.8
>> 1353.0
>>
>> ========
>> System Benchmarks Index Score
>> 1367.4
>>
>> After patch, multi-threads:
>>
>> System Benchmarks Index Values               BASELINE       RESULT
>> INDEX
>> Dhrystone 2 using register variables         116700.0   36943793.6
>> 3165.7
>> Double-Precision Whetstone                       55.0      11035.5
>> 2006.4
>> Execl Throughput                                 43.0       8768.3
>> 2039.1
>> File Copy 1024 bufsize 2000 maxblocks          3960.0     277962.9
>> 701.9
>> File Copy 256 bufsize 500 maxblocks            1655.0      92059.7
>> 556.3
>> File Copy 4096 bufsize 8000 maxblocks          5800.0     525937.5
>> 906.8
>> Pipe Throughput                               12440.0    1361566.6
>> 1094.5
>> Pipe-based Context Switching                   4000.0     575835.4
>> 1439.6
>> Process Creation                                126.0      20426.4
>> 1621.1
>> Shell Scripts (1 concurrent)                     42.4       6877.5
>> 1622.0
>> Shell Scripts (8 concurrent)                      6.0        980.3
>> 1633.8
>> System Call Overhead                          15000.0    2049771.6
>> 1366.5
>>
>> ========
>> System Benchmarks Index Score
>> 1366.6
>>
>> From my point of view, the benefit is negligible.
>
> There is another way to look at what is going on.
> This patch is related with syscall, I prefer to
> observe "System Call Overhead" in the test results.
>
> Here are the INDEX of "System Call Overhead" in your test results:
>
> thread   before_patch    after_patch    gain
> single   371.6           375.7          1.103%
> multi    1353.0          1366.5         0.998%
>
> For now, I would like to wait for other people's review.
> If the conclusion is the optimization is meaningless,
> I am fine with ignoring this patch.

Any comments will be much appreciated.

Here is the link:

https://lore.kernel.org/lkml/1655806074-17454-3-git-send-email-yangtiezhu@loongson.cn/

Thanks,
Tiezhu

>
> Thanks,
> Tiezhu
>
>>
>>
>> Huacai
>>
>>>
>>> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
>>> ---
>>>  arch/loongarch/include/asm/stackframe.h |  5 +++++
>>>  arch/loongarch/kernel/entry.S           | 15 +++++++++++++++
>>>  2 files changed, 20 insertions(+)
>>>
>>> diff --git a/arch/loongarch/include/asm/stackframe.h
>>> b/arch/loongarch/include/asm/stackframe.h
>>> index 4ca9530..551ab8f 100644
>>> --- a/arch/loongarch/include/asm/stackframe.h
>>> +++ b/arch/loongarch/include/asm/stackframe.h
>>> @@ -216,4 +216,9 @@
>>>         RESTORE_SP_AND_RET \docfi
>>>         .endm
>>>
>>> +       .macro  RESTORE_STATIC_SOME_SP_AND_RET docfi=0
>>> +       RESTORE_STATIC \docfi
>>> +       RESTORE_SOME \docfi
>>> +       RESTORE_SP_AND_RET \docfi
>>> +       .endm
>>>  #endif /* _ASM_STACKFRAME_H */
>>> diff --git a/arch/loongarch/kernel/entry.S
>>> b/arch/loongarch/kernel/entry.S
>>> index d5b3dbc..c764c99 100644
>>> --- a/arch/loongarch/kernel/entry.S
>>> +++ b/arch/loongarch/kernel/entry.S
>>> @@ -14,6 +14,7 @@
>>>  #include <asm/regdef.h>
>>>  #include <asm/stackframe.h>
>>>  #include <asm/thread_info.h>
>>> +#include <asm/unistd.h>
>>>
>>>         .text
>>>         .cfi_sections   .debug_frame
>>> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
>>>         li.d    tp, ~_THREAD_MASK
>>>         and     tp, tp, sp
>>>
>>> +       /* Syscall number held in a7, we can store it in TI_SYSCALL. */
>>> +        LONG_S  a7, tp, TI_SYSCALL
>>> +
>>>         move    a0, sp
>>>         bl      do_syscall
>>>
>>> +       /*
>>> +        * Syscall number held in a7 which is stored in TI_SYSCALL.
>>> +        * rt_sigreturn call RESTORE_ALL_AND_RET.
>>> +        * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>> +        */
>>> +       LONG_L  t3, tp, TI_SYSCALL
>>> +       li.w    t4, __NR_rt_sigreturn
>>> +       beq     t3, t4, 1f
>>> +
>>> +       RESTORE_STATIC_SOME_SP_AND_RET
>>> +1:
>>>         RESTORE_ALL_AND_RET
>>>  SYM_FUNC_END(handle_syscall)
>>>
>>> --
>>> 2.1.0
>>>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-06-25  2:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-21 10:07 [PATCH v2 0/2] LoongArch: Modify handle_syscall Tiezhu Yang
2022-06-21 10:07 ` [PATCH v2 1/2] LoongArch: Add TI_SYSCALL in output_thread_info_defines() Tiezhu Yang
2022-06-21 10:07 ` [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls Tiezhu Yang
2022-06-22 10:01   ` Huacai Chen
2022-06-23  0:43     ` Tiezhu Yang
2022-06-25  2:09       ` Tiezhu Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.