[PATCH v3 0/1] riscv: improving uaccess with logs from network bench

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/1] riscv: improving uaccess with logs from network bench
@ 2021-06-23 12:37 Akira Tsukamoto
  2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto
  0 siblings, 1 reply; 7+ messages in thread
From: Akira Tsukamoto @ 2021-06-23 12:37 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Akira Tsukamoto,
	linux-riscv, linux-kernel

Optimizing copy_to_user and copy_from_user.

I rewrote the functions but heavily influenced by Garry's memcpy
function [1]. It must be written in assembler to handle page faults
manually inside the function unlike other memcpy functions.

This patch will reduce cpu usage dramatically in kernel space especially
for applications which use sys-call with large buffer size, such as network
applications. The main reason behind this is that every unaligned memory
access will raise exceptions and switch between s-mode and m-mode causing
large overhead.

The motivation to create the patch was to improve network performance on
beaglev beta board. By observing with perf, the memcpy and
__asm_copy_to_user had heavy cpu usage and the network speed was limited
at around 680Mbps on 1Gbps lan. Matteo is creating the patches to improve
memcpy in C and this patch is meant to be used with them.

Typical network applications use system calls with a large buffer on
send/recv() and sendto/recvfrom() for the optimization.

The bench result, when patching only copy_user. The memcpy is without
Matteo's patches but listing the both since they are the top two largest
overhead.

All results are from the same base kernel, same rootfs and same BeagleV
beta board.

Results of iperf3 have speedup on UDP with the copy_user patch alone.

--- UDP send ---
306 Mbits/sec      362 Mbits/sec
305 Mbits/sec      362 Mbits/sec

--- UDP recv ---
772 Mbits/sec      787 Mbits/sec
773 Mbits/sec      784 Mbits/sec

Comparison by "perf top -Ue task-clock" while running iperf3.

--- TCP recv ---
 * Before
  40.40%  [kernel]  [k] memcpy
  33.09%  [kernel]  [k] __asm_copy_to_user
 * With patch
  50.35%  [kernel]  [k] memcpy
  13.76%  [kernel]  [k] __asm_copy_to_user

--- TCP send ---
 * Before
  19.96%  [kernel]  [k] memcpy
   9.84%  [kernel]  [k] __asm_copy_to_user
 * With patch
  14.27%  [kernel]  [k] memcpy
   7.37%  [kernel]  [k] __asm_copy_to_user

--- UDP recv ---
 * Before
  44.45%  [kernel]  [k] memcpy
  31.04%  [kernel]  [k] __asm_copy_to_user
 * With patch
  55.62%  [kernel]  [k] memcpy
  11.22%  [kernel]  [k] __asm_copy_to_user

--- UDP send ---
 * Before
  25.18%  [kernel]  [k] memcpy
  22.50%  [kernel]  [k] __asm_copy_to_user
 * With patch
  28.90%  [kernel]  [k] memcpy
   9.49%  [kernel]  [k] __asm_copy_to_user

---
v2 -> v3:
- Merged all patches

v1 -> v2:
- Added shift copy
- Separated patches for readability of changes in assembler
- Using perf results

[1] https://lkml.org/lkml/2021/2/16/778

Akira Tsukamoto (1):
  riscv: __asm_copy_to-from_user: Optimize unaligned memory access and
    pipeline stall

 arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
 1 file changed, 146 insertions(+), 35 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
  2021-06-23 12:37 [PATCH v3 0/1] riscv: improving uaccess with logs from network bench Akira Tsukamoto
@ 2021-06-23 12:40 ` Akira Tsukamoto
  2021-07-06 23:16   ` Palmer Dabbelt
  2021-07-10  1:49   ` Guenter Roeck
  0 siblings, 2 replies; 7+ messages in thread
From: Akira Tsukamoto @ 2021-06-23 12:40 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel


This patch will reduce cpu usage dramatically in kernel space especially
for application which use sys-call with large buffer size, such as network
applications. The main reason behind this is that every unaligned memory
access will raise exceptions and switch between s-mode and m-mode causing
large overhead.

First copy in bytes until reaches the first word aligned boundary in
destination memory address. This is the preparation before the bulk
aligned word copy.

The destination address is aligned now, but oftentimes the source address
is not in an aligned boundary. To reduce the unaligned memory access, it
reads the data from source in aligned boundaries, which will cause the
data to have an offset, and then combines the data in the next iteration
by fixing offset with shifting before writing to destination. The majority
of the improving copy speed comes from this shift copy.

In the lucky situation that the both source and destination address are on
the aligned boundary, perform load and store with register size to copy the
data. Without the unrolling, it will reduce the speed since the next store
instruction for the same register using from the load will stall the
pipeline.

At last, copying the remainder in one byte at a time.

Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
---
 arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
 1 file changed, 146 insertions(+), 35 deletions(-)

diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index fceaeb18cc64..bceb0629e440 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -19,50 +19,161 @@ ENTRY(__asm_copy_from_user)
 	li t6, SR_SUM
 	csrs CSR_STATUS, t6
 
-	add a3, a1, a2
-	/* Use word-oriented copy only if low-order bits match */
-	andi t0, a0, SZREG-1
-	andi t1, a1, SZREG-1
-	bne t0, t1, 2f
+	/* Save for return value */
+	mv	t5, a2
 
-	addi t0, a1, SZREG-1
-	andi t1, a3, ~(SZREG-1)
-	andi t0, t0, ~(SZREG-1)
 	/*
-	 * a3: terminal address of source region
-	 * t0: lowest XLEN-aligned address in source
-	 * t1: highest XLEN-aligned address in source
+	 * Register allocation for code below:
+	 * a0 - start of uncopied dst
+	 * a1 - start of uncopied src
+	 * a2 - size
+	 * t0 - end of uncopied dst
 	 */
-	bgeu t0, t1, 2f
-	bltu a1, t0, 4f
+	add	t0, a0, a2
+	bgtu	a0, t0, 5f
+
+	/*
+	 * Use byte copy only if too small.
+	 */
+	li	a3, 8*SZREG /* size must be larger than size in word_copy */
+	bltu	a2, a3, .Lbyte_copy_tail
+
+	/*
+	 * Copy first bytes until dst is align to word boundary.
+	 * a0 - start of dst
+	 * t1 - start of aligned dst
+	 */
+	addi	t1, a0, SZREG-1
+	andi	t1, t1, ~(SZREG-1)
+	/* dst is already aligned, skip */
+	beq	a0, t1, .Lskip_first_bytes
 1:
-	fixup REG_L, t2, (a1), 10f
-	fixup REG_S, t2, (a0), 10f
-	addi a1, a1, SZREG
-	addi a0, a0, SZREG
-	bltu a1, t1, 1b
+	/* a5 - one byte for copying data */
+	fixup lb      a5, 0(a1), 10f
+	addi	a1, a1, 1	/* src */
+	fixup sb      a5, 0(a0), 10f
+	addi	a0, a0, 1	/* dst */
+	bltu	a0, t1, 1b	/* t1 - start of aligned dst */
+
+.Lskip_first_bytes:
+	/*
+	 * Now dst is aligned.
+	 * Use shift-copy if src is misaligned.
+	 * Use word-copy if both src and dst are aligned because
+	 * can not use shift-copy which do not require shifting
+	 */
+	/* a1 - start of src */
+	andi	a3, a1, SZREG-1
+	bnez	a3, .Lshift_copy
+
+.Lword_copy:
+        /*
+	 * Both src and dst are aligned, unrolled word copy
+	 *
+	 * a0 - start of aligned dst
+	 * a1 - start of aligned src
+	 * a3 - a1 & mask:(SZREG-1)
+	 * t0 - end of aligned dst
+	 */
+	addi	t0, t0, -(8*SZREG-1) /* not to over run */
 2:
-	bltu a1, a3, 5f
+	fixup REG_L   a4,        0(a1), 10f
+	fixup REG_L   a5,    SZREG(a1), 10f
+	fixup REG_L   a6,  2*SZREG(a1), 10f
+	fixup REG_L   a7,  3*SZREG(a1), 10f
+	fixup REG_L   t1,  4*SZREG(a1), 10f
+	fixup REG_L   t2,  5*SZREG(a1), 10f
+	fixup REG_L   t3,  6*SZREG(a1), 10f
+	fixup REG_L   t4,  7*SZREG(a1), 10f
+	fixup REG_S   a4,        0(a0), 10f
+	fixup REG_S   a5,    SZREG(a0), 10f
+	fixup REG_S   a6,  2*SZREG(a0), 10f
+	fixup REG_S   a7,  3*SZREG(a0), 10f
+	fixup REG_S   t1,  4*SZREG(a0), 10f
+	fixup REG_S   t2,  5*SZREG(a0), 10f
+	fixup REG_S   t3,  6*SZREG(a0), 10f
+	fixup REG_S   t4,  7*SZREG(a0), 10f
+	addi	a0, a0, 8*SZREG
+	addi	a1, a1, 8*SZREG
+	bltu	a0, t0, 2b
+
+	addi	t0, t0, 8*SZREG-1 /* revert to original value */
+	j	.Lbyte_copy_tail
+
+.Lshift_copy:
+
+	/*
+	 * Word copy with shifting.
+	 * For misaligned copy we still perform aligned word copy, but
+	 * we need to use the value fetched from the previous iteration and
+	 * do some shifts.
+	 * This is safe because reading less than a word size.
+	 *
+	 * a0 - start of aligned dst
+	 * a1 - start of src
+	 * a3 - a1 & mask:(SZREG-1)
+	 * t0 - end of uncopied dst
+	 * t1 - end of aligned dst
+	 */
+	/* calculating aligned word boundary for dst */
+	andi	t1, t0, ~(SZREG-1)
+	/* Converting unaligned src to aligned arc */
+	andi	a1, a1, ~(SZREG-1)
+
+	/*
+	 * Calculate shifts
+	 * t3 - prev shift
+	 * t4 - current shift
+	 */
+	slli	t3, a3, LGREG
+	li	a5, SZREG*8
+	sub	t4, a5, t3
+
+	/* Load the first word to combine with seceond word */
+	fixup REG_L   a5, 0(a1), 10f
 
 3:
+	/* Main shifting copy
+	 *
+	 * a0 - start of aligned dst
+	 * a1 - start of aligned src
+	 * t1 - end of aligned dst
+	 */
+
+	/* At least one iteration will be executed */
+	srl	a4, a5, t3
+	fixup REG_L   a5, SZREG(a1), 10f
+	addi	a1, a1, SZREG
+	sll	a2, a5, t4
+	or	a2, a2, a4
+	fixup REG_S   a2, 0(a0), 10f
+	addi	a0, a0, SZREG
+	bltu	a0, t1, 3b
+
+	/* Revert src to original unaligned value  */
+	add	a1, a1, a3
+
+.Lbyte_copy_tail:
+	/*
+	 * Byte copy anything left.
+	 *
+	 * a0 - start of remaining dst
+	 * a1 - start of remaining src
+	 * t0 - end of remaining dst
+	 */
+	bgeu	a0, t0, 5f
+4:
+	fixup lb      a5, 0(a1), 10f
+	addi	a1, a1, 1	/* src */
+	fixup sb      a5, 0(a0), 10f
+	addi	a0, a0, 1	/* dst */
+	bltu	a0, t0, 4b	/* t0 - end of dst */
+
+5:
 	/* Disable access to user memory */
 	csrc CSR_STATUS, t6
-	li a0, 0
+	li	a0, 0
 	ret
-4: /* Edge case: unalignment */
-	fixup lbu, t2, (a1), 10f
-	fixup sb, t2, (a0), 10f
-	addi a1, a1, 1
-	addi a0, a0, 1
-	bltu a1, t0, 4b
-	j 1b
-5: /* Edge case: remainder */
-	fixup lbu, t2, (a1), 10f
-	fixup sb, t2, (a0), 10f
-	addi a1, a1, 1
-	addi a0, a0, 1
-	bltu a1, a3, 5b
-	j 3b
 ENDPROC(__asm_copy_to_user)
 ENDPROC(__asm_copy_from_user)
 EXPORT_SYMBOL(__asm_copy_to_user)
@@ -117,7 +228,7 @@ EXPORT_SYMBOL(__clear_user)
 10:
 	/* Disable access to user memory */
 	csrs CSR_STATUS, t6
-	mv a0, a2
+	mv a0, t5
 	ret
 11:
 	csrs CSR_STATUS, t6
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
  2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto
@ 2021-07-06 23:16   ` Palmer Dabbelt
  2021-07-07 10:07     ` David Laight
  2021-07-10  1:49   ` Guenter Roeck
  1 sibling, 1 reply; 7+ messages in thread
From: Palmer Dabbelt @ 2021-07-06 23:16 UTC (permalink / raw)
  To: akira.tsukamoto; +Cc: Paul Walmsley, aou, linux-riscv, linux-kernel

On Wed, 23 Jun 2021 05:40:39 PDT (-0700), akira.tsukamoto@gmail.com wrote:
>
> This patch will reduce cpu usage dramatically in kernel space especially
> for application which use sys-call with large buffer size, such as network
> applications. The main reason behind this is that every unaligned memory
> access will raise exceptions and switch between s-mode and m-mode causing
> large overhead.
>
> First copy in bytes until reaches the first word aligned boundary in
> destination memory address. This is the preparation before the bulk
> aligned word copy.
>
> The destination address is aligned now, but oftentimes the source address
> is not in an aligned boundary. To reduce the unaligned memory access, it
> reads the data from source in aligned boundaries, which will cause the
> data to have an offset, and then combines the data in the next iteration
> by fixing offset with shifting before writing to destination. The majority
> of the improving copy speed comes from this shift copy.
>
> In the lucky situation that the both source and destination address are on
> the aligned boundary, perform load and store with register size to copy the
> data. Without the unrolling, it will reduce the speed since the next store
> instruction for the same register using from the load will stall the
> pipeline.
>
> At last, copying the remainder in one byte at a time.
>
> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
> ---
>  arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
>  1 file changed, 146 insertions(+), 35 deletions(-)
>
> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
> index fceaeb18cc64..bceb0629e440 100644
> --- a/arch/riscv/lib/uaccess.S
> +++ b/arch/riscv/lib/uaccess.S
> @@ -19,50 +19,161 @@ ENTRY(__asm_copy_from_user)
>  	li t6, SR_SUM
>  	csrs CSR_STATUS, t6
>
> -	add a3, a1, a2
> -	/* Use word-oriented copy only if low-order bits match */
> -	andi t0, a0, SZREG-1
> -	andi t1, a1, SZREG-1
> -	bne t0, t1, 2f
> +	/* Save for return value */
> +	mv	t5, a2
>
> -	addi t0, a1, SZREG-1
> -	andi t1, a3, ~(SZREG-1)
> -	andi t0, t0, ~(SZREG-1)
>  	/*
> -	 * a3: terminal address of source region
> -	 * t0: lowest XLEN-aligned address in source
> -	 * t1: highest XLEN-aligned address in source
> +	 * Register allocation for code below:
> +	 * a0 - start of uncopied dst
> +	 * a1 - start of uncopied src
> +	 * a2 - size
> +	 * t0 - end of uncopied dst
>  	 */
> -	bgeu t0, t1, 2f
> -	bltu a1, t0, 4f
> +	add	t0, a0, a2
> +	bgtu	a0, t0, 5f
> +
> +	/*
> +	 * Use byte copy only if too small.
> +	 */
> +	li	a3, 8*SZREG /* size must be larger than size in word_copy */
> +	bltu	a2, a3, .Lbyte_copy_tail
> +
> +	/*
> +	 * Copy first bytes until dst is align to word boundary.
> +	 * a0 - start of dst
> +	 * t1 - start of aligned dst
> +	 */
> +	addi	t1, a0, SZREG-1
> +	andi	t1, t1, ~(SZREG-1)
> +	/* dst is already aligned, skip */
> +	beq	a0, t1, .Lskip_first_bytes
>  1:
> -	fixup REG_L, t2, (a1), 10f
> -	fixup REG_S, t2, (a0), 10f
> -	addi a1, a1, SZREG
> -	addi a0, a0, SZREG
> -	bltu a1, t1, 1b
> +	/* a5 - one byte for copying data */
> +	fixup lb      a5, 0(a1), 10f
> +	addi	a1, a1, 1	/* src */
> +	fixup sb      a5, 0(a0), 10f
> +	addi	a0, a0, 1	/* dst */
> +	bltu	a0, t1, 1b	/* t1 - start of aligned dst */
> +
> +.Lskip_first_bytes:
> +	/*
> +	 * Now dst is aligned.
> +	 * Use shift-copy if src is misaligned.
> +	 * Use word-copy if both src and dst are aligned because
> +	 * can not use shift-copy which do not require shifting
> +	 */
> +	/* a1 - start of src */
> +	andi	a3, a1, SZREG-1
> +	bnez	a3, .Lshift_copy
> +
> +.Lword_copy:
> +        /*
> +	 * Both src and dst are aligned, unrolled word copy
> +	 *
> +	 * a0 - start of aligned dst
> +	 * a1 - start of aligned src
> +	 * a3 - a1 & mask:(SZREG-1)
> +	 * t0 - end of aligned dst
> +	 */
> +	addi	t0, t0, -(8*SZREG-1) /* not to over run */
>  2:
> -	bltu a1, a3, 5f
> +	fixup REG_L   a4,        0(a1), 10f
> +	fixup REG_L   a5,    SZREG(a1), 10f
> +	fixup REG_L   a6,  2*SZREG(a1), 10f
> +	fixup REG_L   a7,  3*SZREG(a1), 10f
> +	fixup REG_L   t1,  4*SZREG(a1), 10f
> +	fixup REG_L   t2,  5*SZREG(a1), 10f
> +	fixup REG_L   t3,  6*SZREG(a1), 10f
> +	fixup REG_L   t4,  7*SZREG(a1), 10f
> +	fixup REG_S   a4,        0(a0), 10f
> +	fixup REG_S   a5,    SZREG(a0), 10f
> +	fixup REG_S   a6,  2*SZREG(a0), 10f
> +	fixup REG_S   a7,  3*SZREG(a0), 10f
> +	fixup REG_S   t1,  4*SZREG(a0), 10f
> +	fixup REG_S   t2,  5*SZREG(a0), 10f
> +	fixup REG_S   t3,  6*SZREG(a0), 10f
> +	fixup REG_S   t4,  7*SZREG(a0), 10f

This seems like a suspiciously large unrolling factor, at least without 
a fallback.  My guess is that some workloads will want some smaller 
unrolling factors, but given that we run on these single-issue in-order 
processors it's probably best to have some big unrolling factors as well 
since they're pretty limited WRT integer bandwidth.

> +	addi	a0, a0, 8*SZREG
> +	addi	a1, a1, 8*SZREG
> +	bltu	a0, t0, 2b
> +
> +	addi	t0, t0, 8*SZREG-1 /* revert to original value */
> +	j	.Lbyte_copy_tail
> +
> +.Lshift_copy:
> +
> +	/*
> +	 * Word copy with shifting.
> +	 * For misaligned copy we still perform aligned word copy, but
> +	 * we need to use the value fetched from the previous iteration and
> +	 * do some shifts.
> +	 * This is safe because reading less than a word size.
> +	 *
> +	 * a0 - start of aligned dst
> +	 * a1 - start of src
> +	 * a3 - a1 & mask:(SZREG-1)
> +	 * t0 - end of uncopied dst
> +	 * t1 - end of aligned dst
> +	 */
> +	/* calculating aligned word boundary for dst */
> +	andi	t1, t0, ~(SZREG-1)
> +	/* Converting unaligned src to aligned arc */
> +	andi	a1, a1, ~(SZREG-1)
> +
> +	/*
> +	 * Calculate shifts
> +	 * t3 - prev shift
> +	 * t4 - current shift
> +	 */
> +	slli	t3, a3, LGREG
> +	li	a5, SZREG*8
> +	sub	t4, a5, t3
> +
> +	/* Load the first word to combine with seceond word */
> +	fixup REG_L   a5, 0(a1), 10f
>
>  3:
> +	/* Main shifting copy
> +	 *
> +	 * a0 - start of aligned dst
> +	 * a1 - start of aligned src
> +	 * t1 - end of aligned dst
> +	 */
> +
> +	/* At least one iteration will be executed */
> +	srl	a4, a5, t3
> +	fixup REG_L   a5, SZREG(a1), 10f
> +	addi	a1, a1, SZREG
> +	sll	a2, a5, t4
> +	or	a2, a2, a4
> +	fixup REG_S   a2, 0(a0), 10f
> +	addi	a0, a0, SZREG
> +	bltu	a0, t1, 3b
> +
> +	/* Revert src to original unaligned value  */
> +	add	a1, a1, a3
> +
> +.Lbyte_copy_tail:
> +	/*
> +	 * Byte copy anything left.
> +	 *
> +	 * a0 - start of remaining dst
> +	 * a1 - start of remaining src
> +	 * t0 - end of remaining dst
> +	 */
> +	bgeu	a0, t0, 5f
> +4:
> +	fixup lb      a5, 0(a1), 10f
> +	addi	a1, a1, 1	/* src */
> +	fixup sb      a5, 0(a0), 10f
> +	addi	a0, a0, 1	/* dst */
> +	bltu	a0, t0, 4b	/* t0 - end of dst */
> +
> +5:
>  	/* Disable access to user memory */
>  	csrc CSR_STATUS, t6
> -	li a0, 0
> +	li	a0, 0
>  	ret
> -4: /* Edge case: unalignment */
> -	fixup lbu, t2, (a1), 10f
> -	fixup sb, t2, (a0), 10f
> -	addi a1, a1, 1
> -	addi a0, a0, 1
> -	bltu a1, t0, 4b
> -	j 1b
> -5: /* Edge case: remainder */
> -	fixup lbu, t2, (a1), 10f
> -	fixup sb, t2, (a0), 10f
> -	addi a1, a1, 1
> -	addi a0, a0, 1
> -	bltu a1, a3, 5b
> -	j 3b
>  ENDPROC(__asm_copy_to_user)
>  ENDPROC(__asm_copy_from_user)
>  EXPORT_SYMBOL(__asm_copy_to_user)
> @@ -117,7 +228,7 @@ EXPORT_SYMBOL(__clear_user)
>  10:
>  	/* Disable access to user memory */
>  	csrs CSR_STATUS, t6
> -	mv a0, a2
> +	mv a0, t5
>  	ret
>  11:
>  	csrs CSR_STATUS, t6

That said, this is good enough for me.  If someone comes up with a case 
where the extra unrolling is an issue I'm happy to take something to fix 
it, but until then I'm fine with this as-is.  Like the string fuctions 
it's probably best to eventually put this in C, but IIRC last time I 
tried it was kind of a headache.

This is on for-next.

Thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
  2021-07-06 23:16   ` Palmer Dabbelt
@ 2021-07-07 10:07     ` David Laight
  0 siblings, 0 replies; 7+ messages in thread
From: David Laight @ 2021-07-07 10:07 UTC (permalink / raw)
  To: 'Palmer Dabbelt', akira.tsukamoto
  Cc: Paul Walmsley, aou, linux-riscv, linux-kernel

...
> > +	fixup REG_L   a4,        0(a1), 10f
> > +	fixup REG_L   a5,    SZREG(a1), 10f
> > +	fixup REG_L   a6,  2*SZREG(a1), 10f
> > +	fixup REG_L   a7,  3*SZREG(a1), 10f
> > +	fixup REG_L   t1,  4*SZREG(a1), 10f
> > +	fixup REG_L   t2,  5*SZREG(a1), 10f
> > +	fixup REG_L   t3,  6*SZREG(a1), 10f
> > +	fixup REG_L   t4,  7*SZREG(a1), 10f
> > +	fixup REG_S   a4,        0(a0), 10f
> > +	fixup REG_S   a5,    SZREG(a0), 10f
> > +	fixup REG_S   a6,  2*SZREG(a0), 10f
> > +	fixup REG_S   a7,  3*SZREG(a0), 10f
> > +	fixup REG_S   t1,  4*SZREG(a0), 10f
> > +	fixup REG_S   t2,  5*SZREG(a0), 10f
> > +	fixup REG_S   t3,  6*SZREG(a0), 10f
> > +	fixup REG_S   t4,  7*SZREG(a0), 10f
> 
> This seems like a suspiciously large unrolling factor, at least without
> a fallback.  My guess is that some workloads will want some smaller
> unrolling factors, but given that we run on these single-issue in-order
> processors it's probably best to have some big unrolling factors as well
> since they're pretty limited WRT integer bandwidth.

But a single-issue cpu is unlikely to have an 8 clock data delay.
OTOH a cpu than can do concurrent memory read and write might
not have enough 'out of order' capability to do so with the above loop.

You may want to interleave the reads and writes - starting with
two or three reads (possibly with the extra ones outside the loop).

I don't know the microarchitectures well enough (well at all)
to know the exact pitfalls.

The very simple cpu might have the same 'issue' the Nios2 has
(another MIPS clone fpga soft cpu) where there can be a pipeline
stall between a write and read.
I doubt the non-trivial riscv have that issue though.

> > +	addi	a0, a0, 8*SZREG
> > +	addi	a1, a1, 8*SZREG
> > +	bltu	a0, t0, 2b

For a dual-issue cpu you want to move the two 'addi' higher
up the loop so that they are 'free'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
  2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto
  2021-07-06 23:16   ` Palmer Dabbelt
@ 2021-07-10  1:49   ` Guenter Roeck
  2021-07-13 18:10     ` Geert Uytterhoeven
  1 sibling, 1 reply; 7+ messages in thread
From: Guenter Roeck @ 2021-07-10  1:49 UTC (permalink / raw)
  To: Akira Tsukamoto
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel

Hi,

On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote:
> This patch will reduce cpu usage dramatically in kernel space especially
> for application which use sys-call with large buffer size, such as network
> applications. The main reason behind this is that every unaligned memory
> access will raise exceptions and switch between s-mode and m-mode causing
> large overhead.
> 
> First copy in bytes until reaches the first word aligned boundary in
> destination memory address. This is the preparation before the bulk
> aligned word copy.
> 
> The destination address is aligned now, but oftentimes the source address
> is not in an aligned boundary. To reduce the unaligned memory access, it
> reads the data from source in aligned boundaries, which will cause the
> data to have an offset, and then combines the data in the next iteration
> by fixing offset with shifting before writing to destination. The majority
> of the improving copy speed comes from this shift copy.
> 
> In the lucky situation that the both source and destination address are on
> the aligned boundary, perform load and store with register size to copy the
> data. Without the unrolling, it will reduce the speed since the next store
> instruction for the same register using from the load will stall the
> pipeline.
> 
> At last, copying the remainder in one byte at a time.
> 
> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>

This patch causes all riscv32 qemu emulations to stall during boot.
The log suggests that something in kernel/user communication may be wrong.

Bad case:

Starting syslogd: OK
Starting klogd: OK
/etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution
/etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string
Starting network: /bin/sh: syntax error: unterminated quoted string
sed: unmatched '/'
/bin/sh: syntax error: unterminated quoted string
FAIL
/etc/init.d/S55runtest: line 48: syntax error: EOF in backquote substitution

Good case (this patch reverted):

Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Saving random seed: [   12.277714] random: dd: uninitialized urandom read (512 bytes read)
OK
Starting network: [   12.949529] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   12.951170] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
udhcpc: started, v1.33.0
udhcpc: sending discover
udhcpc: sending select for 10.0.2.15
udhcpc: lease of 10.0.2.15 obtained, lease time 86400
deleting routers
adding dns 10.0.2.3
OK
Found console ttyS0

Reverting this patch fixes the problem. Bisect log attached.

Guenter

---
# bad: [50be9417e23af5a8ac860d998e1e3f06b8fd79d7] Merge tag 'io_uring-5.14-2021-07-09' of git://git.kernel.dk/linux-block
# good: [f55966571d5eb2876a11e48e798b4592fa1ffbb7] Merge tag 'drm-next-2021-07-08-1' of git://anongit.freedesktop.org/drm/drm
git bisect start 'HEAD' 'f55966571d5e'
# good: [7a400bf28334fc7734639db3566394e1fc80670c] Merge tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
git bisect good 7a400bf28334fc7734639db3566394e1fc80670c
# bad: [d8dc121eeab9abfbc510097f8db83e87560f753b] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect bad d8dc121eeab9abfbc510097f8db83e87560f753b
# bad: [7761e36bc7222d1221242c5f195ee0fd40caea40] riscv: Fix PTDUMP output now BPF region moved back to module region
git bisect bad 7761e36bc7222d1221242c5f195ee0fd40caea40
# good: [5def4429aefe65b494816d9ba8ae7f971d522251] riscv: mm: Use better bitmap_zalloc()
git bisect good 5def4429aefe65b494816d9ba8ae7f971d522251
# good: [47513f243b452a5e21180dcf3d6ac1c57e1781a6] riscv: Enable KFENCE for riscv64
git bisect good 47513f243b452a5e21180dcf3d6ac1c57e1781a6
# good: [01112e5e20f5298a81639806cd0a3c587aade467] Merge branch 'riscv-wx-mappings' into for-next
git bisect good 01112e5e20f5298a81639806cd0a3c587aade467
# good: [70eee556b678d1e4cd4ea6742a577b596963fa25] riscv: ptrace: add argn syntax
git bisect good 70eee556b678d1e4cd4ea6742a577b596963fa25
# bad: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
git bisect bad ca6eaaa210deec0e41cbfc380bf89cf079203569
# good: [31da94c25aea835ceac00575a9fd206c5a833fed] riscv: add VMAP_STACK overflow detection
git bisect good 31da94c25aea835ceac00575a9fd206c5a833fed
# first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
  2021-07-10  1:49   ` Guenter Roeck
@ 2021-07-13 18:10     ` Geert Uytterhoeven
  2021-07-15  6:20       ` Akira Tsukamoto
  0 siblings, 1 reply; 7+ messages in thread
From: Geert Uytterhoeven @ 2021-07-13 18:10 UTC (permalink / raw)
  To: Guenter Roeck, Akira Tsukamoto
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	Linux Kernel Mailing List

Hi Günter, Tsukamoto-san,

On Sat, Jul 10, 2021 at 3:50 AM Guenter Roeck <linux@roeck-us.net> wrote:
> On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote:
> > This patch will reduce cpu usage dramatically in kernel space especially
> > for application which use sys-call with large buffer size, such as network
> > applications. The main reason behind this is that every unaligned memory
> > access will raise exceptions and switch between s-mode and m-mode causing
> > large overhead.
> >
> > First copy in bytes until reaches the first word aligned boundary in
> > destination memory address. This is the preparation before the bulk
> > aligned word copy.
> >
> > The destination address is aligned now, but oftentimes the source address
> > is not in an aligned boundary. To reduce the unaligned memory access, it
> > reads the data from source in aligned boundaries, which will cause the
> > data to have an offset, and then combines the data in the next iteration
> > by fixing offset with shifting before writing to destination. The majority
> > of the improving copy speed comes from this shift copy.
> >
> > In the lucky situation that the both source and destination address are on
> > the aligned boundary, perform load and store with register size to copy the
> > data. Without the unrolling, it will reduce the speed since the next store
> > instruction for the same register using from the load will stall the
> > pipeline.
> >
> > At last, copying the remainder in one byte at a time.
> >
> > Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
>
> This patch causes all riscv32 qemu emulations to stall during boot.
> The log suggests that something in kernel/user communication may be wrong.
>
> Bad case:
>
> Starting syslogd: OK
> Starting klogd: OK
> /etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution
> /etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string
> Starting network: /bin/sh: syntax error: unterminated quoted string

> # first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

Same here on vexriscv. Bisected to the same commit.

The actual scripts look fine when using "cat", but contain some garbage
when executing them using "sh -v".

Tsukamoto-san: glancing at the patch:

+       addi    a0, a0, 8*SZREG
+       addi    a1, a1, 8*SZREG

I think you forgot about rv32, where registers cover only 4
bytes each?

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
  2021-07-13 18:10     ` Geert Uytterhoeven
@ 2021-07-15  6:20       ` Akira Tsukamoto
  0 siblings, 0 replies; 7+ messages in thread
From: Akira Tsukamoto @ 2021-07-15  6:20 UTC (permalink / raw)
  To: Geert Uytterhoeven, Guenter Roeck
  Cc: akira.tsukamoto, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Linux Kernel Mailing List



On 7/14/2021 3:10 AM, Geert Uytterhoeven wrote:
> Hi Günter, Tsukamoto-san,
> 
> On Sat, Jul 10, 2021 at 3:50 AM Guenter Roeck <linux@roeck-us.net> wrote:
>> On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote:
>>> This patch will reduce cpu usage dramatically in kernel space especially
>>> for application which use sys-call with large buffer size, such as network
>>> applications. The main reason behind this is that every unaligned memory
>>> access will raise exceptions and switch between s-mode and m-mode causing
>>> large overhead.
>>>
>>> First copy in bytes until reaches the first word aligned boundary in
>>> destination memory address. This is the preparation before the bulk
>>> aligned word copy.
>>>
>>> The destination address is aligned now, but oftentimes the source address
>>> is not in an aligned boundary. To reduce the unaligned memory access, it
>>> reads the data from source in aligned boundaries, which will cause the
>>> data to have an offset, and then combines the data in the next iteration
>>> by fixing offset with shifting before writing to destination. The majority
>>> of the improving copy speed comes from this shift copy.
>>>
>>> In the lucky situation that the both source and destination address are on
>>> the aligned boundary, perform load and store with register size to copy the
>>> data. Without the unrolling, it will reduce the speed since the next store
>>> instruction for the same register using from the load will stall the
>>> pipeline.
>>>
>>> At last, copying the remainder in one byte at a time.
>>>
>>> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
>>
>> This patch causes all riscv32 qemu emulations to stall during boot.
>> The log suggests that something in kernel/user communication may be wrong.
>>
>> Bad case:
>>
>> Starting syslogd: OK
>> Starting klogd: OK
>> /etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution
>> /etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string
>> Starting network: /bin/sh: syntax error: unterminated quoted string
> 
>> # first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
> 
> Same here on vexriscv. Bisected to the same commit.
> 
> The actual scripts look fine when using "cat", but contain some garbage
> when executing them using "sh -v".
> 
> Tsukamoto-san: glancing at the patch:
> 
> +       addi    a0, a0, 8*SZREG
> +       addi    a1, a1, 8*SZREG
> 
> I think you forgot about rv32, where registers cover only 4
> bytes each?

Thanks Günter and Geert for the pointing out the errors.
I will send the fixes, probably this weekend.

Akira

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-07-15  6:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-23 12:37 [PATCH v3 0/1] riscv: improving uaccess with logs from network bench Akira Tsukamoto
2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto
2021-07-06 23:16   ` Palmer Dabbelt
2021-07-07 10:07     ` David Laight
2021-07-10  1:49   ` Guenter Roeck
2021-07-13 18:10     ` Geert Uytterhoeven
2021-07-15  6:20       ` Akira Tsukamoto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).