linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
@ 2021-06-19 11:21 Akira Tsukamoto
  2021-06-19 11:34 ` [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code Akira Tsukamoto
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-19 11:21 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Akira Tsukamoto,
	linux-kernel, linux-riscv

Optimizing copy_to_user and copy_from_user.

I rewrote the functions in v2, heavily influenced by Garry's memcpy
function [1].
The functions must be written in assembler to handle page faults manually
inside the function.

With the changes, improves in the percentage usage and some performance
of network speed in UDP packets.
Only patching copy_user. Using the original memcpy.

All results are from the same base kernel, same rootfs and same
BeagleV beta board.

Comparison by "perf top -Ue task-clock" while running iperf3.

--- TCP recv ---
  * Before
   40.40%  [kernel]  [k] memcpy
   33.09%  [kernel]  [k] __asm_copy_to_user
  * After
   50.35%  [kernel]  [k] memcpy
   13.76%  [kernel]  [k] __asm_copy_to_user

--- TCP send ---
  * Before
   19.96%  [kernel]  [k] memcpy
    9.84%  [kernel]  [k] __asm_copy_to_user
  * After
   14.27%  [kernel]  [k] memcpy
    7.37%  [kernel]  [k] __asm_copy_to_user

--- UDP send ---
  * Before
   25.18%  [kernel]  [k] memcpy
   22.50%  [kernel]  [k] __asm_copy_to_user
  * After
   28.90%  [kernel]  [k] memcpy
    9.49%  [kernel]  [k] __asm_copy_to_user

--- UDP recv ---
  * Before
   44.45%  [kernel]  [k] memcpy
   31.04%  [kernel]  [k] __asm_copy_to_user
  * After
   55.62%  [kernel]  [k] memcpy
   11.22%  [kernel]  [k] __asm_copy_to_user

Processing network packets require a lot of unaligned access for the packet
header, which is not able to change the design of the header format to be
aligned.
And user applications call system calls with a large buffer for send/recf()
and sendto/recvfrom() to repeat less function calls for the optimization.

v1 -> v2:
- Added shift copy
- Separated patches for readability of changes in assembler
- Using perf results

[1] https://lkml.org/lkml/2021/2/16/778

Akira Tsukamoto (5):
   riscv: __asm_to/copy_from_user: delete existing code
   riscv: __asm_to/copy_from_user: Adding byte copy first
   riscv: __asm_to/copy_from_user: Copy until dst is aligned address
   riscv: __asm_to/copy_from_user: Bulk copy while shifting misaligned
     data
   riscv: __asm_to/copy_from_user: Bulk copy when both src dst are
     aligned

  arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
  1 file changed, 146 insertions(+), 35 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
@ 2021-06-19 11:34 ` Akira Tsukamoto
  2021-06-21 11:45   ` David Laight
  2021-06-19 11:35 ` [PATCH 2/5] riscv: __asm_to/copy_from_user: Adding byte copy first Akira Tsukamoto
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-19 11:34 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-kernel, linux-riscv


This is to make the diff easier to read, since the diff on
assembler is horrible to read.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
---
 arch/riscv/lib/uaccess.S | 40 ----------------------------------------
 1 file changed, 40 deletions(-)

diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index fceaeb18cc64..da9536e1e9cb 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -19,50 +19,10 @@ ENTRY(__asm_copy_from_user)
 	li t6, SR_SUM
 	csrs CSR_STATUS, t6
 
-	add a3, a1, a2
-	/* Use word-oriented copy only if low-order bits match */
-	andi t0, a0, SZREG-1
-	andi t1, a1, SZREG-1
-	bne t0, t1, 2f
-
-	addi t0, a1, SZREG-1
-	andi t1, a3, ~(SZREG-1)
-	andi t0, t0, ~(SZREG-1)
-	/*
-	 * a3: terminal address of source region
-	 * t0: lowest XLEN-aligned address in source
-	 * t1: highest XLEN-aligned address in source
-	 */
-	bgeu t0, t1, 2f
-	bltu a1, t0, 4f
-1:
-	fixup REG_L, t2, (a1), 10f
-	fixup REG_S, t2, (a0), 10f
-	addi a1, a1, SZREG
-	addi a0, a0, SZREG
-	bltu a1, t1, 1b
-2:
-	bltu a1, a3, 5f
-
-3:
 	/* Disable access to user memory */
 	csrc CSR_STATUS, t6
 	li a0, 0
 	ret
-4: /* Edge case: unalignment */
-	fixup lbu, t2, (a1), 10f
-	fixup sb, t2, (a0), 10f
-	addi a1, a1, 1
-	addi a0, a0, 1
-	bltu a1, t0, 4b
-	j 1b
-5: /* Edge case: remainder */
-	fixup lbu, t2, (a1), 10f
-	fixup sb, t2, (a0), 10f
-	addi a1, a1, 1
-	addi a0, a0, 1
-	bltu a1, a3, 5b
-	j 3b
 ENDPROC(__asm_copy_to_user)
 ENDPROC(__asm_copy_from_user)
 EXPORT_SYMBOL(__asm_copy_to_user)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] riscv: __asm_to/copy_from_user: Adding byte copy first
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
  2021-06-19 11:34 ` [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code Akira Tsukamoto
@ 2021-06-19 11:35 ` Akira Tsukamoto
  2021-06-19 11:36 ` [PATCH 3/5] riscv: __asm_to/copy_from_user: Copy until dst is aligned Akira Tsukamoto
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-19 11:35 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-kernel, linux-riscv


Typical load and store loop.
It is used for mainly copying the remainder in one byte at a time.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
---
 arch/riscv/lib/uaccess.S | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index da9536e1e9cb..be1810077f9a 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -19,9 +19,39 @@ ENTRY(__asm_copy_from_user)
 	li t6, SR_SUM
 	csrs CSR_STATUS, t6
 
+	/* Save for return value */
+	mv	t5, a2
+
+	/*
+	 * Register allocation for code below:
+	 * a0 - start of uncopied dst
+	 * a1 - start of uncopied src
+	 * a2 - size
+	 * t0 - end of uncopied dst
+	 */
+	add	t0, a0, a2
+	bgtu	a0, t0, 5f
+
+.Lbyte_copy_tail:
+	/*
+	 * Byte copy anything left.
+	 *
+	 * a0 - start of remaining dst
+	 * a1 - start of remaining src
+	 * t0 - end of remaining dst
+	 */
+	bgeu	a0, t0, 5f
+4:
+	fixup lb      a5, 0(a1), 10f
+	addi	a1, a1, 1	/* src */
+	fixup sb      a5, 0(a0), 10f
+	addi	a0, a0, 1	/* dst */
+	bltu	a0, t0, 4b	/* t0 - end of dst */
+
+5:
 	/* Disable access to user memory */
 	csrc CSR_STATUS, t6
-	li a0, 0
+	li	a0, 0
 	ret
 ENDPROC(__asm_copy_to_user)
 ENDPROC(__asm_copy_from_user)
@@ -77,7 +107,7 @@ EXPORT_SYMBOL(__clear_user)
 10:
 	/* Disable access to user memory */
 	csrs CSR_STATUS, t6
-	mv a0, a2
+	mv a0, t5
 	ret
 11:
 	csrs CSR_STATUS, t6
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] riscv: __asm_to/copy_from_user: Copy until dst is aligned
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
  2021-06-19 11:34 ` [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code Akira Tsukamoto
  2021-06-19 11:35 ` [PATCH 2/5] riscv: __asm_to/copy_from_user: Adding byte copy first Akira Tsukamoto
@ 2021-06-19 11:36 ` Akira Tsukamoto
  2021-06-19 11:37 ` [PATCH 4/5] riscv: __asm_to/copy_from_user: Bulk copy while shifting Akira Tsukamoto
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-19 11:36 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-kernel, linux-riscv


First copy in bytes until reaches the first word aligned boundary in
destination memory address.

For speeding up the copy, trying to avoid both the unaligned memory access
and byte access are the key. This is the preparation before the bulk
aligned word copy.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
---
 arch/riscv/lib/uaccess.S | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index be1810077f9a..4906b5ca91c3 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -32,6 +32,34 @@ ENTRY(__asm_copy_from_user)
 	add	t0, a0, a2
 	bgtu	a0, t0, 5f
 
+	/*
+	 * Use byte copy only if too small.
+	 */
+	li	a3, 8*SZREG /* size must be larger than size in word_copy */
+	bltu	a2, a3, .Lbyte_copy_tail
+
+	/*
+	 * Copy first bytes until dst is align to word boundary.
+	 * a0 - start of dst
+	 * t1 - start of aligned dst
+	 */
+	addi	t1, a0, SZREG-1
+	andi	t1, t1, ~(SZREG-1)
+	/* dst is already aligned, skip */
+	beq	a0, t1, .Lskip_first_bytes
+1:
+	/* a5 - one byte for copying data */
+	fixup lb      a5, 0(a1), 10f
+	addi	a1, a1, 1	/* src */
+	fixup sb      a5, 0(a0), 10f
+	addi	a0, a0, 1	/* dst */
+	bltu	a0, t1, 1b	/* t1 - start of aligned dst */
+
+.Lskip_first_bytes:
+
+.Lword_copy:
+.Lshift_copy:
+
 .Lbyte_copy_tail:
 	/*
 	 * Byte copy anything left.
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] riscv: __asm_to/copy_from_user: Bulk copy while shifting
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
                   ` (2 preceding siblings ...)
  2021-06-19 11:36 ` [PATCH 3/5] riscv: __asm_to/copy_from_user: Copy until dst is aligned Akira Tsukamoto
@ 2021-06-19 11:37 ` Akira Tsukamoto
  2021-06-19 11:43 ` [PATCH 5/5] riscv: __asm_to/copy_from_user: Bulk copy when both src, dst are aligned Akira Tsukamoto
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-19 11:37 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-kernel, linux-riscv


The destination address is aligned now, but often time the source address
is not in an aligned boundary.

To reduce the unaligned memory access, it reads the data from source in
aligned boundaries, which will cause the data to have an offset, and then
combines the data in the next iteration by fixing offset with shifting
before writing to destination.

The majority of the improving copy speed comes from this shift copy.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
---
 arch/riscv/lib/uaccess.S | 60 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index 4906b5ca91c3..e2e57551fc76 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -56,10 +56,70 @@ ENTRY(__asm_copy_from_user)
 	bltu	a0, t1, 1b	/* t1 - start of aligned dst */
 
 .Lskip_first_bytes:
+	/*
+	 * Now dst is aligned.
+	 * Use shift-copy if src is misaligned.
+	 * Use word-copy if both src and dst are aligned because
+	 * can not use shift-copy which do not require shifting
+	 */
+	/* a1 - start of src */
+	andi	a3, a1, SZREG-1
+	bnez	a3, .Lshift_copy
 
 .Lword_copy:
 .Lshift_copy:
 
+	/*
+	 * Word copy with shifting.
+	 * For misaligned copy we still perform aligned word copy, but
+	 * we need to use the value fetched from the previous iteration and
+	 * do some shifts.
+	 * This is safe because reading less than a word size.
+	 *
+	 * a0 - start of aligned dst
+	 * a1 - start of src
+	 * a3 - a1 & mask:(SZREG-1)
+	 * t0 - end of uncopied dst
+	 * t1 - end of aligned dst
+	 */
+	/* calculating aligned word boundary for dst */
+	andi	t1, t0, ~(SZREG-1)
+	/* Converting unaligned src to aligned arc */
+	andi	a1, a1, ~(SZREG-1)
+
+	/*
+	 * Calculate shifts
+	 * t3 - prev shift
+	 * t4 - current shift
+	 */
+	slli	t3, a3, LGREG
+	li	a5, SZREG*8
+	sub	t4, a5, t3
+
+	/* Load the first word to combine with seceond word */
+	fixup REG_L   a5, 0(a1), 10f
+
+3:
+	/* Main shifting copy
+	 *
+	 * a0 - start of aligned dst
+	 * a1 - start of aligned src
+	 * t1 - end of aligned dst
+	 */
+
+	/* At least one iteration will be executed */
+	srl	a4, a5, t3
+	fixup REG_L   a5, SZREG(a1), 10f
+	addi	a1, a1, SZREG
+	sll	a2, a5, t4
+	or	a2, a2, a4
+	fixup REG_S   a2, 0(a0), 10f
+	addi	a0, a0, SZREG
+	bltu	a0, t1, 3b
+
+	/* Revert src to original unaligned value  */
+	add	a1, a1, a3
+
 .Lbyte_copy_tail:
 	/*
 	 * Byte copy anything left.
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] riscv: __asm_to/copy_from_user: Bulk copy when both src, dst are aligned
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
                   ` (3 preceding siblings ...)
  2021-06-19 11:37 ` [PATCH 4/5] riscv: __asm_to/copy_from_user: Bulk copy while shifting Akira Tsukamoto
@ 2021-06-19 11:43 ` Akira Tsukamoto
  2021-06-21 11:55   ` David Laight
  2021-06-20 10:02 ` [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Ben Dooks
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-19 11:43 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-kernel, linux-riscv


In the lucky situation that the both source and destination address are on
the aligned boundary, perform load and store with register size to copy the
data.

Without the unrolling, it will reduce the speed since the next store
instruction for the same register using from the load will stall the
pipeline.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
---
 arch/riscv/lib/uaccess.S | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index e2e57551fc76..bceb0629e440 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -67,6 +67,39 @@ ENTRY(__asm_copy_from_user)
 	bnez	a3, .Lshift_copy
 
 .Lword_copy:
+        /*
+	 * Both src and dst are aligned, unrolled word copy
+	 *
+	 * a0 - start of aligned dst
+	 * a1 - start of aligned src
+	 * a3 - a1 & mask:(SZREG-1)
+	 * t0 - end of aligned dst
+	 */
+	addi	t0, t0, -(8*SZREG-1) /* not to over run */
+2:
+	fixup REG_L   a4,        0(a1), 10f
+	fixup REG_L   a5,    SZREG(a1), 10f
+	fixup REG_L   a6,  2*SZREG(a1), 10f
+	fixup REG_L   a7,  3*SZREG(a1), 10f
+	fixup REG_L   t1,  4*SZREG(a1), 10f
+	fixup REG_L   t2,  5*SZREG(a1), 10f
+	fixup REG_L   t3,  6*SZREG(a1), 10f
+	fixup REG_L   t4,  7*SZREG(a1), 10f
+	fixup REG_S   a4,        0(a0), 10f
+	fixup REG_S   a5,    SZREG(a0), 10f
+	fixup REG_S   a6,  2*SZREG(a0), 10f
+	fixup REG_S   a7,  3*SZREG(a0), 10f
+	fixup REG_S   t1,  4*SZREG(a0), 10f
+	fixup REG_S   t2,  5*SZREG(a0), 10f
+	fixup REG_S   t3,  6*SZREG(a0), 10f
+	fixup REG_S   t4,  7*SZREG(a0), 10f
+	addi	a0, a0, 8*SZREG
+	addi	a1, a1, 8*SZREG
+	bltu	a0, t0, 2b
+
+	addi	t0, t0, 8*SZREG-1 /* revert to original value */
+	j	.Lbyte_copy_tail
+
 .Lshift_copy:
 
 	/*
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
                   ` (4 preceding siblings ...)
  2021-06-19 11:43 ` [PATCH 5/5] riscv: __asm_to/copy_from_user: Bulk copy when both src, dst are aligned Akira Tsukamoto
@ 2021-06-20 10:02 ` Ben Dooks
  2021-06-20 16:36   ` Akira Tsukamoto
  2021-06-22  8:30 ` Ben Dooks
  2021-07-12 21:24 ` Ben Dooks
  7 siblings, 1 reply; 16+ messages in thread
From: Ben Dooks @ 2021-06-20 10:02 UTC (permalink / raw)
  To: Akira Tsukamoto, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv

On 19/06/2021 12:21, Akira Tsukamoto wrote:
> Optimizing copy_to_user and copy_from_user.
> 
> I rewrote the functions in v2, heavily influenced by Garry's memcpy
> function [1].
> The functions must be written in assembler to handle page faults manually
> inside the function.
> 
> With the changes, improves in the percentage usage and some performance
> of network speed in UDP packets.
> Only patching copy_user. Using the original memcpy.
> 
> All results are from the same base kernel, same rootfs and same
> BeagleV beta board.

Is there a git tree for these to try them out?

> Comparison by "perf top -Ue task-clock" while running iperf3.
> 
> --- TCP recv ---
>   * Before
>    40.40%  [kernel]  [k] memcpy
>    33.09%  [kernel]  [k] __asm_copy_to_user
>   * After
>    50.35%  [kernel]  [k] memcpy
>    13.76%  [kernel]  [k] __asm_copy_to_user
> 
> --- TCP send ---
>   * Before
>    19.96%  [kernel]  [k] memcpy
>     9.84%  [kernel]  [k] __asm_copy_to_user
>   * After
>    14.27%  [kernel]  [k] memcpy
>     7.37%  [kernel]  [k] __asm_copy_to_user
> 
> --- UDP send ---
>   * Before
>    25.18%  [kernel]  [k] memcpy
>    22.50%  [kernel]  [k] __asm_copy_to_user
>   * After
>    28.90%  [kernel]  [k] memcpy
>     9.49%  [kernel]  [k] __asm_copy_to_user
> 
> --- UDP recv ---
>   * Before
>    44.45%  [kernel]  [k] memcpy
>    31.04%  [kernel]  [k] __asm_copy_to_user
>   * After
>    55.62%  [kernel]  [k] memcpy
>    11.22%  [kernel]  [k] __asm_copy_to_user

What's the memcpy figure in the above?
Could you explain the figures please?

> Processing network packets require a lot of unaligned access for the packet
> header, which is not able to change the design of the header format to be
> aligned.

Isn't there an option to allow padding of network packets
in the skbuff to make the fields aligned for architectures
which do not have efficient unaligned loads (looking at you
arm32). Has this been looked at?

> And user applications call system calls with a large buffer for send/recf()
> and sendto/recvfrom() to repeat less function calls for the optimization.
> 
> v1 -> v2:
> - Added shift copy
> - Separated patches for readability of changes in assembler
> - Using perf results
> 
> [1] https://lkml.org/lkml/2021/2/16/778
> 
> Akira Tsukamoto (5):
>    riscv: __asm_to/copy_from_user: delete existing code
>    riscv: __asm_to/copy_from_user: Adding byte copy first
>    riscv: __asm_to/copy_from_user: Copy until dst is aligned address
>    riscv: __asm_to/copy_from_user: Bulk copy while shifting misaligned
>      data
>    riscv: __asm_to/copy_from_user: Bulk copy when both src dst are
>      aligned
> 
>   arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
>   1 file changed, 146 insertions(+), 35 deletions(-)

I'm concerned that delete and then re-add is either going to make
the series un-bisectable or leave a point where the kernel is very
broken?

-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
  2021-06-20 10:02 ` [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Ben Dooks
@ 2021-06-20 16:36   ` Akira Tsukamoto
  0 siblings, 0 replies; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-20 16:36 UTC (permalink / raw)
  To: Ben Dooks, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv


On 6/20/21 19:02, Ben Dooks wrote:
> On 19/06/2021 12:21, Akira Tsukamoto wrote:
>> Optimizing copy_to_user and copy_from_user.
>>
>> I rewrote the functions in v2, heavily influenced by Garry's memcpy
>> function [1].
>> The functions must be written in assembler to handle page faults manually
>> inside the function.
>>
>> With the changes, improves in the percentage usage and some performance
>> of network speed in UDP packets.
>> Only patching copy_user. Using the original memcpy.
>>
>> All results are from the same base kernel, same rootfs and same
>> BeagleV beta board.
> 
> Is there a git tree for these to try them out?

Sure, please try.

The kernel without the patch is starlight branch and
the kernel with these patches is starlight-ua-new-up at 
https://github.com/mcd500/linux-jh7100

The starlight is maintained by Esmil where main development
is happening.

And the rootfs on beaglev is uploaded bellow.
https://github.com/mcd500/Fedora_on_StarFive#custome-fedora-image

To reproduce the results:
(please customize with your ip addreses)

The command I used for iperf3.

--- TCP recv ---
** on PC side, using default mtu 1500
$ iperf3 -c 192.168.1.112
** on riscv beaglev side, using default mtu 1500
[root@fedora-starfive ~]# iperf3 -s

--- TCP send ---
** on PC side, using default mtu 1500
$ iperf3 -s
** on riscv beaglev side, using default mtu 1500
[root@fedora-starfive ~]# iperf3 -c 192.168.1.153

--- UDP send ---
** on PC side first, changing mtu size from 1500 to 9000
$ sudo ifconfig eth0 down
$ sudo ifconfig eth0 mtu 9000 up
$ iperf3 -s
** on riscv beaglev, No changing the mtu size on riscv beaglev
[root@fedora-starfive ~]# iperf3 -u -b 1000M --length 50000 -c 192.168.1.153

--- UDP recv ---
** on PC side first, changing mtu size to 9000
$ sudo ifconfig eth0 down
$ sudo ifconfig eth0 mtu 9000 up
$ iperf3 -u -b 1000M --length 6500 -c 192.168.1.112
** on riscv beaglev side, changing mtu size to 9000 too
[root@fedora-starfive ~]# sudo ifconfig eth0 down
[root@fedora-starfive ~]# sudo ifconfig eth0 mtu 9000 up
[root@fedora-starfive ~]# iperf3 -s

The perf:
$ sudo perf top -Ue task-clock
after login with ssh.

> 
>> Comparison by "perf top -Ue task-clock" while running iperf3.
>>
>> --- TCP recv ---
>>   * Before
>>    40.40%  [kernel]  [k] memcpy
>>    33.09%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    50.35%  [kernel]  [k] memcpy
>>    13.76%  [kernel]  [k] __asm_copy_to_user
>>
>> --- TCP send ---
>>   * Before
>>    19.96%  [kernel]  [k] memcpy
>>     9.84%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    14.27%  [kernel]  [k] memcpy
>>     7.37%  [kernel]  [k] __asm_copy_to_user
>>
>> --- UDP send ---
>>   * Before
>>    25.18%  [kernel]  [k] memcpy
>>    22.50%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    28.90%  [kernel]  [k] memcpy
>>     9.49%  [kernel]  [k] __asm_copy_to_user
>>
>> --- UDP recv ---
>>   * Before
>>    44.45%  [kernel]  [k] memcpy
>>    31.04%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    55.62%  [kernel]  [k] memcpy
>>    11.22%  [kernel]  [k] __asm_copy_to_user
> 
> What's the memcpy figure in the above?
> Could you explain the figures please?

It is the output of "perf top -Ue task-clock" 
while performing the iperf3 which I described above.
Showing which functions are causing the most overhead
inside the kernel during running iperf3 in cpu usage.

The two biggest culprits were memcpy and __asm_copy_to_user
showing high cpu usage, and this is the reason I listed the two.

Initially this discussion started with Gary's memcpy patch
on this list. I will write more details bellow.

> 
>> Processing network packets require a lot of unaligned access for the packet
>> header, which is not able to change the design of the header format to be
>> aligned.
> 
> Isn't there an option to allow padding of network packets
> in the skbuff to make the fields aligned for architectures
> which do not have efficient unaligned loads (looking at you
> arm32). Has this been looked at?

I am trying at 64bit risc-v beaglev beta board.
My understanding of skbuff is that it is for aligning data when
the handling inside the kernel. It would help if memcpy and 
__asm_copy_to_user were not causing such a huge percentage.

This patch is against copy_to_user and copy_from_user
purely used for copying between kernel space and user space.

By looking the overhead on perf, the cpu usage is increasing
on copy_to_user because the user app (iperf3 here) uses socket API 
with large packet size. I used to use maximum buffer size 
of mtu for reduce the number of calling recvform() sendto()
for UDP programing too. And most of the network programmer probably
do the same.

> 
>> And user applications call system calls with a large buffer for send/recf()
>> and sendto/recvfrom() to repeat less function calls for the optimization.
>>
>> v1 -> v2:
>> - Added shift copy
>> - Separated patches for readability of changes in assembler
>> - Using perf results
>>
>> [1] https://lkml.org/lkml/2021/2/16/778
>>
>> Akira Tsukamoto (5):
>>    riscv: __asm_to/copy_from_user: delete existing code
>>    riscv: __asm_to/copy_from_user: Adding byte copy first
>>    riscv: __asm_to/copy_from_user: Copy until dst is aligned address
>>    riscv: __asm_to/copy_from_user: Bulk copy while shifting misaligned
>>      data
>>    riscv: __asm_to/copy_from_user: Bulk copy when both src dst are
>>      aligned
>>
>>   arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
>>   1 file changed, 146 insertions(+), 35 deletions(-)
> 
> I'm concerned that delete and then re-add is either going to make
> the series un-bisectable or leave a point where the kernel is very
> broken?

I completely agree and understand. The only reason I split the patches
is because of the comments in the other thread. It definitely breaks
the bisect. Once the content of this patch is understood and agreed,
I will re-spin them in one patch, otherwise it will not boot when
only individual patch is applied.

The Gary's memcpy patch was posted a while ago, and even it has the 
best result in bench, it was not merged.

When we were measuring network performance on beaglev beta board,
the Gary's memcpy did have huge improvement.

The patch was not easy to review and understand, but it really helps
the network performance.

By reading the discussion on his patch, I felt the first priority
is able to be understood the cause of cpu usage and speed results.

Please read the discussion by Gary, Palmer, Matteo, others and I in 
the list.

Matteo is rewriting Gary's patch in C, which is better for 
maintainability and incorporate the wisdom of the optimization
of the compiler.
The user_copy are written in assembler or inline assembler,
as I wrote.

I just want to help making it better, so once the consensus are
made, I will make them to one patch.
Or I am fine somebody else comes out with better results.

My attempt to do similar patches dates long time ago in 2002.
https://linux-kernel.vger.kernel.narkive.com/zU6OFlI6/cft-patch-2-5-47-athlon-druon-much-faster-copy-user-function
http://lkml.iu.edu/hypermail/linux/kernel/0211.2/0928.html

Akira

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code
  2021-06-19 11:34 ` [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code Akira Tsukamoto
@ 2021-06-21 11:45   ` David Laight
  2021-06-21 13:55     ` Akira Tsukamoto
  0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2021-06-21 11:45 UTC (permalink / raw)
  To: 'Akira Tsukamoto',
	Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-kernel,
	linux-riscv

From: Akira Tsukamoto
> Sent: 19 June 2021 12:35
> 
> This is to make the diff easier to read, since the diff on
> assembler is horrible to read.

You can't do that, it breaks bisection.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 5/5] riscv: __asm_to/copy_from_user: Bulk copy when both src, dst are aligned
  2021-06-19 11:43 ` [PATCH 5/5] riscv: __asm_to/copy_from_user: Bulk copy when both src, dst are aligned Akira Tsukamoto
@ 2021-06-21 11:55   ` David Laight
  2021-06-21 14:13     ` Akira Tsukamoto
  0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2021-06-21 11:55 UTC (permalink / raw)
  To: 'Akira Tsukamoto',
	Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-kernel,
	linux-riscv

From: Akira Tsukamoto
> Sent: 19 June 2021 12:43
> 
> In the lucky situation that the both source and destination address are on
> the aligned boundary, perform load and store with register size to copy the
> data.
> 
> Without the unrolling, it will reduce the speed since the next store
> instruction for the same register using from the load will stall the
> pipeline.
...
> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
> index e2e57551fc76..bceb0629e440 100644
> --- a/arch/riscv/lib/uaccess.S
> +++ b/arch/riscv/lib/uaccess.S
> @@ -67,6 +67,39 @@ ENTRY(__asm_copy_from_user)
>  	bnez	a3, .Lshift_copy
> 
>  .Lword_copy:
> +        /*
> +	 * Both src and dst are aligned, unrolled word copy
> +	 *
> +	 * a0 - start of aligned dst
> +	 * a1 - start of aligned src
> +	 * a3 - a1 & mask:(SZREG-1)
> +	 * t0 - end of aligned dst
> +	 */
> +	addi	t0, t0, -(8*SZREG-1) /* not to over run */
> +2:
> +	fixup REG_L   a4,        0(a1), 10f
> +	fixup REG_L   a5,    SZREG(a1), 10f
> +	fixup REG_L   a6,  2*SZREG(a1), 10f
> +	fixup REG_L   a7,  3*SZREG(a1), 10f
> +	fixup REG_L   t1,  4*SZREG(a1), 10f
> +	fixup REG_L   t2,  5*SZREG(a1), 10f
> +	fixup REG_L   t3,  6*SZREG(a1), 10f
> +	fixup REG_L   t4,  7*SZREG(a1), 10f
> +	fixup REG_S   a4,        0(a0), 10f
> +	fixup REG_S   a5,    SZREG(a0), 10f
> +	fixup REG_S   a6,  2*SZREG(a0), 10f
> +	fixup REG_S   a7,  3*SZREG(a0), 10f
> +	fixup REG_S   t1,  4*SZREG(a0), 10f
> +	fixup REG_S   t2,  5*SZREG(a0), 10f
> +	fixup REG_S   t3,  6*SZREG(a0), 10f
> +	fixup REG_S   t4,  7*SZREG(a0), 10f
> +	addi	a0, a0, 8*SZREG
> +	addi	a1, a1, 8*SZREG
> +	bltu	a0, t0, 2b
> +
> +	addi	t0, t0, 8*SZREG-1 /* revert to original value */
> +	j	.Lbyte_copy_tail
> +

Are there any riscv chips than can do a memory read and a
memory write int the same cycle but don't have significant
'out of order' execution?

Such chips will execute that code very badly.
Or, rather, there are loops that allow concurrent read+write
that will be a lot faster.

Also on a cpu that can execute a memory read/write
at the same time as an add (probably anything supercaler)
you want to move the two 'addi' further up so they get
executed 'for free'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code
  2021-06-21 11:45   ` David Laight
@ 2021-06-21 13:55     ` Akira Tsukamoto
  0 siblings, 0 replies; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-21 13:55 UTC (permalink / raw)
  To: David Laight, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv
  Cc: akira.tsukamoto

On 6/21/2021 8:45 PM, David Laight wrote:
> From: Akira Tsukamoto
>> Sent: 19 June 2021 12:35
>>
>> This is to make the diff easier to read, since the diff on
>> assembler is horrible to read.
> 
> You can't do that, it breaks bisection.

I know, it is intentional, I explained it on the other thread 
with Ben Dooks.
I just focusing make it easier to be understood of what the
code does right now.

Akira

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] riscv: __asm_to/copy_from_user: Bulk copy when both src, dst are aligned
  2021-06-21 11:55   ` David Laight
@ 2021-06-21 14:13     ` Akira Tsukamoto
  0 siblings, 0 replies; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-21 14:13 UTC (permalink / raw)
  To: David Laight, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv
  Cc: akira.tsukamoto

On 6/21/2021 8:55 PM, David Laight wrote:
> From: Akira Tsukamoto
>> Sent: 19 June 2021 12:43
>>
>> In the lucky situation that the both source and destination address are on
>> the aligned boundary, perform load and store with register size to copy the
>> data.
>>
>> Without the unrolling, it will reduce the speed since the next store
>> instruction for the same register using from the load will stall the
>> pipeline.
> ...
>> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
>> index e2e57551fc76..bceb0629e440 100644
>> --- a/arch/riscv/lib/uaccess.S
>> +++ b/arch/riscv/lib/uaccess.S
>> @@ -67,6 +67,39 @@ ENTRY(__asm_copy_from_user)
>>  	bnez	a3, .Lshift_copy
>>
>>  .Lword_copy:
>> +        /*
>> +	 * Both src and dst are aligned, unrolled word copy
>> +	 *
>> +	 * a0 - start of aligned dst
>> +	 * a1 - start of aligned src
>> +	 * a3 - a1 & mask:(SZREG-1)
>> +	 * t0 - end of aligned dst
>> +	 */
>> +	addi	t0, t0, -(8*SZREG-1) /* not to over run */
>> +2:
>> +	fixup REG_L   a4,        0(a1), 10f
>> +	fixup REG_L   a5,    SZREG(a1), 10f
>> +	fixup REG_L   a6,  2*SZREG(a1), 10f
>> +	fixup REG_L   a7,  3*SZREG(a1), 10f
>> +	fixup REG_L   t1,  4*SZREG(a1), 10f
>> +	fixup REG_L   t2,  5*SZREG(a1), 10f
>> +	fixup REG_L   t3,  6*SZREG(a1), 10f
>> +	fixup REG_L   t4,  7*SZREG(a1), 10f
>> +	fixup REG_S   a4,        0(a0), 10f
>> +	fixup REG_S   a5,    SZREG(a0), 10f
>> +	fixup REG_S   a6,  2*SZREG(a0), 10f
>> +	fixup REG_S   a7,  3*SZREG(a0), 10f
>> +	fixup REG_S   t1,  4*SZREG(a0), 10f
>> +	fixup REG_S   t2,  5*SZREG(a0), 10f
>> +	fixup REG_S   t3,  6*SZREG(a0), 10f
>> +	fixup REG_S   t4,  7*SZREG(a0), 10f
>> +	addi	a0, a0, 8*SZREG
>> +	addi	a1, a1, 8*SZREG
>> +	bltu	a0, t0, 2b
>> +
>> +	addi	t0, t0, 8*SZREG-1 /* revert to original value */
>> +	j	.Lbyte_copy_tail
>> +
> 
> Are there any riscv chips than can do a memory read and a
> memory write int the same cycle but don't have significant
> 'out of order' execution?
> 
> Such chips will execute that code very badly.
> Or, rather, there are loops that allow concurrent read+write
> that will be a lot faster.

For the above two paragraphs, the boom will be probably one of
them and perhaps U8, but I do not have a chance to try it.

I have run the benchmarks both the unrolled load store
and not unrolled load store and always unrolled version
was faster on current cores. We could discuss and optimizing
way when the Out of Order core comes out in the market
with comparing bench results on real hardware.

I really understand of your comments of concurrent read+write
that you have mentioned in the other thread too.

I just would like to make the current risc-v better
as soon as possible, since the difference is significant.

> 
> Also on a cpu that can execute a memory read/write
> at the same time as an add (probably anything supercaler)
> you want to move the two 'addi' further up so they get
> executed 'for free'.

The original assembler version of memcpy does have the `addi`
moving up the few lines up.
You really know the internals, I am in the between of making
the code easy to understand to make the patches in the upstream
and optimizing further more.

If you really like to, I will move the `addi` up at the time of 
when merging the patches to one which do not break bisecting.

Akira

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
                   ` (5 preceding siblings ...)
  2021-06-20 10:02 ` [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Ben Dooks
@ 2021-06-22  8:30 ` Ben Dooks
  2021-06-22 12:05   ` Akira Tsukamoto
  2021-07-12 21:24 ` Ben Dooks
  7 siblings, 1 reply; 16+ messages in thread
From: Ben Dooks @ 2021-06-22  8:30 UTC (permalink / raw)
  To: Akira Tsukamoto, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv

On 19/06/2021 12:21, Akira Tsukamoto wrote:
> Optimizing copy_to_user and copy_from_user.
> 
> I rewrote the functions in v2, heavily influenced by Garry's memcpy
> function [1].
> The functions must be written in assembler to handle page faults manually
> inside the function.
> 
> With the changes, improves in the percentage usage and some performance
> of network speed in UDP packets.
> Only patching copy_user. Using the original memcpy.
> 
> All results are from the same base kernel, same rootfs and same
> BeagleV beta board.
> 
> Comparison by "perf top -Ue task-clock" while running iperf3.

I did a quick test on a SiFive Unmatched with IO to an NVME.

before: cached-reads=172.47MB/sec, buffered-reads=135.8MB/sec
with-patch: cached-read=s177.54Mb/sec, buffered-reads=137.79MB/sec

That was just one test run, so there was a small improvement. I am
sort of surprised we didn't get more of a win from this.

perf record on hdparm shows that it spends approx 15% cpu time in
asm_copy_to_user. Does anyone have a benchmark for this which just
looks at copy/to user? if not should we create one?

-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
  2021-06-22  8:30 ` Ben Dooks
@ 2021-06-22 12:05   ` Akira Tsukamoto
  2021-06-22 17:45     ` Ben Dooks
  0 siblings, 1 reply; 16+ messages in thread
From: Akira Tsukamoto @ 2021-06-22 12:05 UTC (permalink / raw)
  To: Ben Dooks, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv

On 6/22/2021 5:30 PM, Ben Dooks wrote:
> On 19/06/2021 12:21, Akira Tsukamoto wrote:
>> Optimizing copy_to_user and copy_from_user.
>>
>> I rewrote the functions in v2, heavily influenced by Garry's memcpy
>> function [1].
>> The functions must be written in assembler to handle page faults manually
>> inside the function.
>>
>> With the changes, improves in the percentage usage and some performance
>> of network speed in UDP packets.
>> Only patching copy_user. Using the original memcpy.
>>
>> All results are from the same base kernel, same rootfs and same
>> BeagleV beta board.
>>
>> Comparison by "perf top -Ue task-clock" while running iperf3.
> 
> I did a quick test on a SiFive Unmatched with IO to an NVME.
> 
> before: cached-reads=172.47MB/sec, buffered-reads=135.8MB/sec
> with-patch: cached-read=s177.54Mb/sec, buffered-reads=137.79MB/sec
> 
> That was just one test run, so there was a small improvement. I am
> sort of surprised we didn't get more of a win from this.
> 
> perf record on hdparm shows that it spends approx 15% cpu time in
> asm_copy_to_user. Does anyone have a benchmark for this which just
> looks at copy/to user? if not should we create one?

Thanks for the result on the Unmatched with hdparm. Have you tried
iperf3?

The 15% is high, is it before or with-patch?

Akira

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
  2021-06-22 12:05   ` Akira Tsukamoto
@ 2021-06-22 17:45     ` Ben Dooks
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Dooks @ 2021-06-22 17:45 UTC (permalink / raw)
  To: Akira Tsukamoto, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv

On 22/06/2021 13:05, Akira Tsukamoto wrote:
> On 6/22/2021 5:30 PM, Ben Dooks wrote:
>> On 19/06/2021 12:21, Akira Tsukamoto wrote:
>>> Optimizing copy_to_user and copy_from_user.
>>>
>>> I rewrote the functions in v2, heavily influenced by Garry's memcpy
>>> function [1].
>>> The functions must be written in assembler to handle page faults manually
>>> inside the function.
>>>
>>> With the changes, improves in the percentage usage and some performance
>>> of network speed in UDP packets.
>>> Only patching copy_user. Using the original memcpy.
>>>
>>> All results are from the same base kernel, same rootfs and same
>>> BeagleV beta board.
>>>
>>> Comparison by "perf top -Ue task-clock" while running iperf3.
>>
>> I did a quick test on a SiFive Unmatched with IO to an NVME.
>>
>> before: cached-reads=172.47MB/sec, buffered-reads=135.8MB/sec
>> with-patch: cached-read=s177.54Mb/sec, buffered-reads=137.79MB/sec
>>
>> That was just one test run, so there was a small improvement. I am
>> sort of surprised we didn't get more of a win from this.
>>
>> perf record on hdparm shows that it spends approx 15% cpu time in
>> asm_copy_to_user. Does anyone have a benchmark for this which just
>> looks at copy/to user? if not should we create one?
> 
> Thanks for the result on the Unmatched with hdparm. Have you tried
> iperf3?

I will see if there is iperf3 installed. I've not done much other than
try booting it and then try booting it with a kernel i've built from
upstream.

> The 15% is high, is it before or with-patch?

Can't remember, I did this more to find out if the copy to/from user
was going to show up in the times for hdparm.

> Akira
> 


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
  2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
                   ` (6 preceding siblings ...)
  2021-06-22  8:30 ` Ben Dooks
@ 2021-07-12 21:24 ` Ben Dooks
  7 siblings, 0 replies; 16+ messages in thread
From: Ben Dooks @ 2021-07-12 21:24 UTC (permalink / raw)
  To: Akira Tsukamoto, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-kernel, linux-riscv

On 19/06/2021 12:21, Akira Tsukamoto wrote:
> Optimizing copy_to_user and copy_from_user.
> 
> I rewrote the functions in v2, heavily influenced by Garry's memcpy
> function [1].
> The functions must be written in assembler to handle page faults manually
> inside the function.
> 
> With the changes, improves in the percentage usage and some performance
> of network speed in UDP packets.
> Only patching copy_user. Using the original memcpy.
> 
> All results are from the same base kernel, same rootfs and same
> BeagleV beta board.
> 
> Comparison by "perf top -Ue task-clock" while running iperf3.
> 
> --- TCP recv ---
>   * Before
>    40.40%  [kernel]  [k] memcpy
>    33.09%  [kernel]  [k] __asm_copy_to_user
>   * After
>    50.35%  [kernel]  [k] memcpy
>    13.76%  [kernel]  [k] __asm_copy_to_user
> 
> --- TCP send ---
>   * Before
>    19.96%  [kernel]  [k] memcpy
>     9.84%  [kernel]  [k] __asm_copy_to_user
>   * After
>    14.27%  [kernel]  [k] memcpy
>     7.37%  [kernel]  [k] __asm_copy_to_user
> 
> --- UDP send ---
>   * Before
>    25.18%  [kernel]  [k] memcpy
>    22.50%  [kernel]  [k] __asm_copy_to_user
>   * After
>    28.90%  [kernel]  [k] memcpy
>     9.49%  [kernel]  [k] __asm_copy_to_user
> 
> --- UDP recv ---
>   * Before
>    44.45%  [kernel]  [k] memcpy
>    31.04%  [kernel]  [k] __asm_copy_to_user
>   * After
>    55.62%  [kernel]  [k] memcpy
>    11.22%  [kernel]  [k] __asm_copy_to_user
> 
> Processing network packets require a lot of unaligned access for the packet
> header, which is not able to change the design of the header format to be
> aligned.
> And user applications call system calls with a large buffer for send/recf()
> and sendto/recvfrom() to repeat less function calls for the optimization.
> 
> v1 -> v2:
> - Added shift copy
> - Separated patches for readability of changes in assembler
> - Using perf results
> 
> [1] https://lkml.org/lkml/2021/2/16/778
> 
> Akira Tsukamoto (5):
>    riscv: __asm_to/copy_from_user: delete existing code
>    riscv: __asm_to/copy_from_user: Adding byte copy first
>    riscv: __asm_to/copy_from_user: Copy until dst is aligned address
>    riscv: __asm_to/copy_from_user: Bulk copy while shifting misaligned
>      data
>    riscv: __asm_to/copy_from_user: Bulk copy when both src dst are
>      aligned
> 
>   arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
>   1 file changed, 146 insertions(+), 35 deletions(-)
> 

I'm doing some work on allow benchmarking and testing the uaccess code.

So far the initial results are:

> 
> copy routine 1: original
> test 1: copier 1: offset 0, size 8192: took 43.394000 ms, 1800.364106 MiB/sec
> test 1: copier 1: offset 1, size 8191: took 343.767000 ms, 227.233746 MiB/sec
> test 1: copier 1: offset 2, size 8190: took 343.727000 ms, 227.232445 MiB/sec
> test 1: copier 1: offset 3, size 8189: took 343.664000 ms, 227.246350 MiB/sec
> test 1: copier 1: offset 4, size 8188: took 343.751000 ms, 227.161093 MiB/sec
> test 1: copier 1: offset 5, size 8187: took 343.620000 ms, 227.219941 MiB/sec
> test 1: copier 1: offset 6, size 8186: took 343.540000 ms, 227.245094 MiB/sec
> test 1: copier 1: offset 7, size 8185: took 343.640000 ms, 227.151213 MiB/sec
> copy routine 2: new
> test 1: copier 2: offset 0, size 8192: took 18.819000 ms, 4151.389553 MiB/sec
> test 1: copier 2: offset 1, size 8191: took 43.770000 ms, 1784.680449 MiB/sec
> test 1: copier 2: offset 2, size 8190: took 43.727000 ms, 1786.217360 MiB/sec
> test 1: copier 2: offset 3, size 8189: took 43.679000 ms, 1787.961944 MiB/sec
> test 1: copier 2: offset 4, size 8188: took 43.620000 ms, 1790.161693 MiB/sec
> test 1: copier 2: offset 5, size 8187: took 43.577000 ms, 1791.709303 MiB/sec
> test 1: copier 2: offset 6, size 8186: took 43.533000 ms, 1793.301163 MiB/sec
> test 1: copier 2: offset 7, size 8185: took 43.471000 ms, 1795.639456 MiB/sec
> write tests:
> copy routine 1: original
> test 2: copier 1: offset 0, size 8192: took 43.443000 ms, 1798.333448 MiB/sec
> test 2: copier 1: offset 1, size 8191: took 344.281000 ms, 226.894494 MiB/sec
> test 2: copier 1: offset 2, size 8190: took 343.788000 ms, 227.192126 MiB/sec
> test 2: copier 1: offset 3, size 8189: took 343.735000 ms, 227.199412 MiB/sec
> test 2: copier 1: offset 4, size 8188: took 343.695000 ms, 227.198106 MiB/sec
> test 2: copier 1: offset 5, size 8187: took 343.626000 ms, 227.215974 MiB/sec
> test 2: copier 1: offset 6, size 8186: took 343.597000 ms, 227.207396 MiB/sec
> test 2: copier 1: offset 7, size 8185: took 343.823000 ms, 227.030312 MiB/sec
> copy routine 2: new
> test 2: copier 2: offset 0, size 8192: took 18.999000 ms, 4112.058529 MiB/sec
> test 2: copier 2: offset 1, size 8191: took 43.897000 ms, 1779.517125 MiB/sec
> test 2: copier 2: offset 2, size 8190: took 43.784000 ms, 1783.891981 MiB/sec
> test 2: copier 2: offset 3, size 8189: took 43.803000 ms, 1782.900481 MiB/sec
> test 2: copier 2: offset 4, size 8188: took 43.768000 ms, 1784.108322 MiB/sec
> test 2: copier 2: offset 5, size 8187: took 43.739000 ms, 1785.073191 MiB/sec
> test 2: copier 2: offset 6, size 8186: took 43.620000 ms, 1789.724428 MiB/sec
> test 2: copier 2: offset 7, size 8185: took 43.573000 ms, 1791.436045 MiB/sec
> read tests:
> copy routine 1: original
> test 1: copier 1: offset 0, size 16384: took 87.173000 ms, 1792.412788 MiB/sec
> test 1: copier 1: offset 1, size 16383: took 689.480000 ms, 226.606230 MiB/sec
> test 1: copier 1: offset 2, size 16382: took 689.251000 ms, 226.667682 MiB/sec
> test 1: copier 1: offset 3, size 16381: took 689.203000 ms, 226.669631 MiB/sec
> test 1: copier 1: offset 4, size 16380: took 689.385000 ms, 226.595956 MiB/sec
> test 1: copier 1: offset 5, size 16379: took 689.201000 ms, 226.642614 MiB/sec
> test 1: copier 1: offset 6, size 16378: took 689.158000 ms, 226.642917 MiB/sec
> test 1: copier 1: offset 7, size 16377: took 689.038000 ms, 226.668548 MiB/sec
> copy routine 2: new
> test 1: copier 2: offset 0, size 16384: took 38.825000 ms, 4024.468770 MiB/sec
> test 1: copier 2: offset 1, size 16383: took 88.706000 ms, 1761.329146 MiB/sec
> test 1: copier 2: offset 2, size 16382: took 88.663000 ms, 1762.075798 MiB/sec
> test 1: copier 2: offset 3, size 16381: took 88.614000 ms, 1762.942535 MiB/sec
> test 1: copier 2: offset 4, size 16380: took 88.592000 ms, 1763.272677 MiB/sec
> test 1: copier 2: offset 5, size 16379: took 88.518000 ms, 1764.639014 MiB/sec
> test 1: copier 2: offset 6, size 16378: took 88.481000 ms, 1765.269149 MiB/sec
> test 1: copier 2: offset 7, size 16377: took 88.437000 ms, 1766.039585 MiB/sec
> write tests:
> copy routine 1: original
> test 2: copier 1: offset 0, size 16384: took 87.150000 ms, 1792.885829 MiB/sec
> test 2: copier 1: offset 1, size 16383: took 689.470000 ms, 226.609516 MiB/sec
> test 2: copier 1: offset 2, size 16382: took 689.242000 ms, 226.670642 MiB/sec
> test 2: copier 1: offset 3, size 16381: took 689.165000 ms, 226.682129 MiB/sec
> test 2: copier 1: offset 4, size 16380: took 689.697000 ms, 226.493450 MiB/sec
> test 2: copier 1: offset 5, size 16379: took 689.070000 ms, 226.685701 MiB/sec
> test 2: copier 1: offset 6, size 16378: took 689.018000 ms, 226.688968 MiB/sec
> test 2: copier 1: offset 7, size 16377: took 689.009000 ms, 226.678088 MiB/sec
> copy routine 2: new
> test 2: copier 2: offset 0, size 16384: took 38.871000 ms, 4019.706208 MiB/sec
> test 2: copier 2: offset 1, size 16383: took 88.732000 ms, 1760.813047 MiB/sec
> test 2: copier 2: offset 2, size 16382: took 88.672000 ms, 1761.896952 MiB/sec
> test 2: copier 2: offset 3, size 16381: took 88.642000 ms, 1762.385661 MiB/sec
> test 2: copier 2: offset 4, size 16380: took 88.730000 ms, 1760.530294 MiB/sec
> test 2: copier 2: offset 5, size 16379: took 88.670000 ms, 1761.614033 MiB/sec
> test 2: copier 2: offset 6, size 16378: took 88.627000 ms, 1762.361126 MiB/sec
> test 2: copier 2: offset 7, size 16377: took 88.543000 ms, 1763.925356 MiB/sec
> read tests:
> copy routine 1: original
> test 1: copier 1: offset 0, size 32768: took 243.592000 ms, 1282.882853 MiB/sec
> test 1: copier 1: offset 1, size 32767: took 1426.538000 ms, 219.055127 MiB/sec
> test 1: copier 1: offset 2, size 32766: took 1426.340000 ms, 219.078850 MiB/sec
> test 1: copier 1: offset 3, size 32765: took 1426.297000 ms, 219.078768 MiB/sec
> test 1: copier 1: offset 4, size 32764: took 1426.069000 ms, 219.107107 MiB/sec
> test 1: copier 1: offset 5, size 32763: took 1425.970000 ms, 219.115631 MiB/sec
> test 1: copier 1: offset 6, size 32762: took 1425.975000 ms, 219.108175 MiB/sec
> test 1: copier 1: offset 7, size 32761: took 1425.906000 ms, 219.112089 MiB/sec
> copy routine 2: new
> test 1: copier 2: offset 0, size 32768: took 205.966000 ms, 1517.240710 MiB/sec
> test 1: copier 2: offset 1, size 32767: took 304.295000 ms, 1026.932625 MiB/sec
> test 1: copier 2: offset 2, size 32766: took 304.219000 ms, 1027.157825 MiB/sec
> test 1: copier 2: offset 3, size 32765: took 304.114000 ms, 1027.481108 MiB/sec
> test 1: copier 2: offset 4, size 32764: took 304.102000 ms, 1027.490293 MiB/sec
> test 1: copier 2: offset 5, size 32763: took 304.032000 ms, 1027.695494 MiB/sec
> test 1: copier 2: offset 6, size 32762: took 304.012000 ms, 1027.731733 MiB/sec
> test 1: copier 2: offset 7, size 32761: took 304.250000 ms, 1026.896443 MiB/sec
> write tests:
> copy routine 1: original
> test 2: copier 1: offset 0, size 32768: took 269.605000 ms, 1159.103132 MiB/sec
> test 2: copier 1: offset 1, size 32767: took 1438.271000 ms, 217.268139 MiB/sec
> test 2: copier 1: offset 2, size 32766: took 1438.197000 ms, 217.272687 MiB/sec
> test 2: copier 1: offset 3, size 32765: took 1438.157000 ms, 217.272099 MiB/sec
> test 2: copier 1: offset 4, size 32764: took 1438.121000 ms, 217.270906 MiB/sec
> test 2: copier 1: offset 5, size 32763: took 1438.085000 ms, 217.269714 MiB/sec
> test 2: copier 1: offset 6, size 32762: took 1438.012000 ms, 217.274111 MiB/sec
> test 2: copier 1: offset 7, size 32761: took 1437.998000 ms, 217.269595 MiB/sec
> copy routine 2: new
> test 2: copier 2: offset 0, size 32768: took 237.597000 ms, 1315.252297 MiB/sec
> test 2: copier 2: offset 1, size 32767: took 340.638000 ms, 917.368183 MiB/sec
> test 2: copier 2: offset 2, size 32766: took 340.669000 ms, 917.256711 MiB/sec
> test 2: copier 2: offset 3, size 32765: took 340.615000 ms, 917.374131 MiB/sec
> test 2: copier 2: offset 4, size 32764: took 340.542000 ms, 917.542779 MiB/sec
> test 2: copier 2: offset 5, size 32763: took 340.543000 ms, 917.512080 MiB/sec
> test 2: copier 2: offset 6, size 32762: took 340.775000 ms, 916.859451 MiB/sec
> test 2: copier 2: offset 7, size 32761: took 343.885000 ms, 908.539898 MiB/sec

It looks like the aligned is about 2.2 faster on the aligned and 7.8
times faster for the unalgined tests. I'll try and get this published
some time this week.


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-12 21:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-19 11:21 [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Akira Tsukamoto
2021-06-19 11:34 ` [PATCH 1/5] riscv: __asm_to/copy_from_user: delete existing code Akira Tsukamoto
2021-06-21 11:45   ` David Laight
2021-06-21 13:55     ` Akira Tsukamoto
2021-06-19 11:35 ` [PATCH 2/5] riscv: __asm_to/copy_from_user: Adding byte copy first Akira Tsukamoto
2021-06-19 11:36 ` [PATCH 3/5] riscv: __asm_to/copy_from_user: Copy until dst is aligned Akira Tsukamoto
2021-06-19 11:37 ` [PATCH 4/5] riscv: __asm_to/copy_from_user: Bulk copy while shifting Akira Tsukamoto
2021-06-19 11:43 ` [PATCH 5/5] riscv: __asm_to/copy_from_user: Bulk copy when both src, dst are aligned Akira Tsukamoto
2021-06-21 11:55   ` David Laight
2021-06-21 14:13     ` Akira Tsukamoto
2021-06-20 10:02 ` [PATCH v2 0/5] riscv: improving uaccess with logs from network bench Ben Dooks
2021-06-20 16:36   ` Akira Tsukamoto
2021-06-22  8:30 ` Ben Dooks
2021-06-22 12:05   ` Akira Tsukamoto
2021-06-22 17:45     ` Ben Dooks
2021-07-12 21:24 ` Ben Dooks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).