* [PATCH v3 0/1] riscv: improving uaccess with logs from network bench @ 2021-06-23 12:37 Akira Tsukamoto 2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto 0 siblings, 1 reply; 7+ messages in thread From: Akira Tsukamoto @ 2021-06-23 12:37 UTC (permalink / raw) To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Akira Tsukamoto, linux-riscv, linux-kernel Optimizing copy_to_user and copy_from_user. I rewrote the functions but heavily influenced by Garry's memcpy function [1]. It must be written in assembler to handle page faults manually inside the function unlike other memcpy functions. This patch will reduce cpu usage dramatically in kernel space especially for applications which use sys-call with large buffer size, such as network applications. The main reason behind this is that every unaligned memory access will raise exceptions and switch between s-mode and m-mode causing large overhead. The motivation to create the patch was to improve network performance on beaglev beta board. By observing with perf, the memcpy and __asm_copy_to_user had heavy cpu usage and the network speed was limited at around 680Mbps on 1Gbps lan. Matteo is creating the patches to improve memcpy in C and this patch is meant to be used with them. Typical network applications use system calls with a large buffer on send/recv() and sendto/recvfrom() for the optimization. The bench result, when patching only copy_user. The memcpy is without Matteo's patches but listing the both since they are the top two largest overhead. All results are from the same base kernel, same rootfs and same BeagleV beta board. Results of iperf3 have speedup on UDP with the copy_user patch alone. --- UDP send --- 306 Mbits/sec 362 Mbits/sec 305 Mbits/sec 362 Mbits/sec --- UDP recv --- 772 Mbits/sec 787 Mbits/sec 773 Mbits/sec 784 Mbits/sec Comparison by "perf top -Ue task-clock" while running iperf3. --- TCP recv --- * Before 40.40% [kernel] [k] memcpy 33.09% [kernel] [k] __asm_copy_to_user * With patch 50.35% [kernel] [k] memcpy 13.76% [kernel] [k] __asm_copy_to_user --- TCP send --- * Before 19.96% [kernel] [k] memcpy 9.84% [kernel] [k] __asm_copy_to_user * With patch 14.27% [kernel] [k] memcpy 7.37% [kernel] [k] __asm_copy_to_user --- UDP recv --- * Before 44.45% [kernel] [k] memcpy 31.04% [kernel] [k] __asm_copy_to_user * With patch 55.62% [kernel] [k] memcpy 11.22% [kernel] [k] __asm_copy_to_user --- UDP send --- * Before 25.18% [kernel] [k] memcpy 22.50% [kernel] [k] __asm_copy_to_user * With patch 28.90% [kernel] [k] memcpy 9.49% [kernel] [k] __asm_copy_to_user --- v2 -> v3: - Merged all patches v1 -> v2: - Added shift copy - Separated patches for readability of changes in assembler - Using perf results [1] https://lkml.org/lkml/2021/2/16/778 Akira Tsukamoto (1): riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++-------- 1 file changed, 146 insertions(+), 35 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall 2021-06-23 12:37 [PATCH v3 0/1] riscv: improving uaccess with logs from network bench Akira Tsukamoto @ 2021-06-23 12:40 ` Akira Tsukamoto 2021-07-06 23:16 ` Palmer Dabbelt 2021-07-10 1:49 ` Guenter Roeck 0 siblings, 2 replies; 7+ messages in thread From: Akira Tsukamoto @ 2021-06-23 12:40 UTC (permalink / raw) To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel This patch will reduce cpu usage dramatically in kernel space especially for application which use sys-call with large buffer size, such as network applications. The main reason behind this is that every unaligned memory access will raise exceptions and switch between s-mode and m-mode causing large overhead. First copy in bytes until reaches the first word aligned boundary in destination memory address. This is the preparation before the bulk aligned word copy. The destination address is aligned now, but oftentimes the source address is not in an aligned boundary. To reduce the unaligned memory access, it reads the data from source in aligned boundaries, which will cause the data to have an offset, and then combines the data in the next iteration by fixing offset with shifting before writing to destination. The majority of the improving copy speed comes from this shift copy. In the lucky situation that the both source and destination address are on the aligned boundary, perform load and store with register size to copy the data. Without the unrolling, it will reduce the speed since the next store instruction for the same register using from the load will stall the pipeline. At last, copying the remainder in one byte at a time. Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com> --- arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++-------- 1 file changed, 146 insertions(+), 35 deletions(-) diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S index fceaeb18cc64..bceb0629e440 100644 --- a/arch/riscv/lib/uaccess.S +++ b/arch/riscv/lib/uaccess.S @@ -19,50 +19,161 @@ ENTRY(__asm_copy_from_user) li t6, SR_SUM csrs CSR_STATUS, t6 - add a3, a1, a2 - /* Use word-oriented copy only if low-order bits match */ - andi t0, a0, SZREG-1 - andi t1, a1, SZREG-1 - bne t0, t1, 2f + /* Save for return value */ + mv t5, a2 - addi t0, a1, SZREG-1 - andi t1, a3, ~(SZREG-1) - andi t0, t0, ~(SZREG-1) /* - * a3: terminal address of source region - * t0: lowest XLEN-aligned address in source - * t1: highest XLEN-aligned address in source + * Register allocation for code below: + * a0 - start of uncopied dst + * a1 - start of uncopied src + * a2 - size + * t0 - end of uncopied dst */ - bgeu t0, t1, 2f - bltu a1, t0, 4f + add t0, a0, a2 + bgtu a0, t0, 5f + + /* + * Use byte copy only if too small. + */ + li a3, 8*SZREG /* size must be larger than size in word_copy */ + bltu a2, a3, .Lbyte_copy_tail + + /* + * Copy first bytes until dst is align to word boundary. + * a0 - start of dst + * t1 - start of aligned dst + */ + addi t1, a0, SZREG-1 + andi t1, t1, ~(SZREG-1) + /* dst is already aligned, skip */ + beq a0, t1, .Lskip_first_bytes 1: - fixup REG_L, t2, (a1), 10f - fixup REG_S, t2, (a0), 10f - addi a1, a1, SZREG - addi a0, a0, SZREG - bltu a1, t1, 1b + /* a5 - one byte for copying data */ + fixup lb a5, 0(a1), 10f + addi a1, a1, 1 /* src */ + fixup sb a5, 0(a0), 10f + addi a0, a0, 1 /* dst */ + bltu a0, t1, 1b /* t1 - start of aligned dst */ + +.Lskip_first_bytes: + /* + * Now dst is aligned. + * Use shift-copy if src is misaligned. + * Use word-copy if both src and dst are aligned because + * can not use shift-copy which do not require shifting + */ + /* a1 - start of src */ + andi a3, a1, SZREG-1 + bnez a3, .Lshift_copy + +.Lword_copy: + /* + * Both src and dst are aligned, unrolled word copy + * + * a0 - start of aligned dst + * a1 - start of aligned src + * a3 - a1 & mask:(SZREG-1) + * t0 - end of aligned dst + */ + addi t0, t0, -(8*SZREG-1) /* not to over run */ 2: - bltu a1, a3, 5f + fixup REG_L a4, 0(a1), 10f + fixup REG_L a5, SZREG(a1), 10f + fixup REG_L a6, 2*SZREG(a1), 10f + fixup REG_L a7, 3*SZREG(a1), 10f + fixup REG_L t1, 4*SZREG(a1), 10f + fixup REG_L t2, 5*SZREG(a1), 10f + fixup REG_L t3, 6*SZREG(a1), 10f + fixup REG_L t4, 7*SZREG(a1), 10f + fixup REG_S a4, 0(a0), 10f + fixup REG_S a5, SZREG(a0), 10f + fixup REG_S a6, 2*SZREG(a0), 10f + fixup REG_S a7, 3*SZREG(a0), 10f + fixup REG_S t1, 4*SZREG(a0), 10f + fixup REG_S t2, 5*SZREG(a0), 10f + fixup REG_S t3, 6*SZREG(a0), 10f + fixup REG_S t4, 7*SZREG(a0), 10f + addi a0, a0, 8*SZREG + addi a1, a1, 8*SZREG + bltu a0, t0, 2b + + addi t0, t0, 8*SZREG-1 /* revert to original value */ + j .Lbyte_copy_tail + +.Lshift_copy: + + /* + * Word copy with shifting. + * For misaligned copy we still perform aligned word copy, but + * we need to use the value fetched from the previous iteration and + * do some shifts. + * This is safe because reading less than a word size. + * + * a0 - start of aligned dst + * a1 - start of src + * a3 - a1 & mask:(SZREG-1) + * t0 - end of uncopied dst + * t1 - end of aligned dst + */ + /* calculating aligned word boundary for dst */ + andi t1, t0, ~(SZREG-1) + /* Converting unaligned src to aligned arc */ + andi a1, a1, ~(SZREG-1) + + /* + * Calculate shifts + * t3 - prev shift + * t4 - current shift + */ + slli t3, a3, LGREG + li a5, SZREG*8 + sub t4, a5, t3 + + /* Load the first word to combine with seceond word */ + fixup REG_L a5, 0(a1), 10f 3: + /* Main shifting copy + * + * a0 - start of aligned dst + * a1 - start of aligned src + * t1 - end of aligned dst + */ + + /* At least one iteration will be executed */ + srl a4, a5, t3 + fixup REG_L a5, SZREG(a1), 10f + addi a1, a1, SZREG + sll a2, a5, t4 + or a2, a2, a4 + fixup REG_S a2, 0(a0), 10f + addi a0, a0, SZREG + bltu a0, t1, 3b + + /* Revert src to original unaligned value */ + add a1, a1, a3 + +.Lbyte_copy_tail: + /* + * Byte copy anything left. + * + * a0 - start of remaining dst + * a1 - start of remaining src + * t0 - end of remaining dst + */ + bgeu a0, t0, 5f +4: + fixup lb a5, 0(a1), 10f + addi a1, a1, 1 /* src */ + fixup sb a5, 0(a0), 10f + addi a0, a0, 1 /* dst */ + bltu a0, t0, 4b /* t0 - end of dst */ + +5: /* Disable access to user memory */ csrc CSR_STATUS, t6 - li a0, 0 + li a0, 0 ret -4: /* Edge case: unalignment */ - fixup lbu, t2, (a1), 10f - fixup sb, t2, (a0), 10f - addi a1, a1, 1 - addi a0, a0, 1 - bltu a1, t0, 4b - j 1b -5: /* Edge case: remainder */ - fixup lbu, t2, (a1), 10f - fixup sb, t2, (a0), 10f - addi a1, a1, 1 - addi a0, a0, 1 - bltu a1, a3, 5b - j 3b ENDPROC(__asm_copy_to_user) ENDPROC(__asm_copy_from_user) EXPORT_SYMBOL(__asm_copy_to_user) @@ -117,7 +228,7 @@ EXPORT_SYMBOL(__clear_user) 10: /* Disable access to user memory */ csrs CSR_STATUS, t6 - mv a0, a2 + mv a0, t5 ret 11: csrs CSR_STATUS, t6 -- 2.17.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall 2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto @ 2021-07-06 23:16 ` Palmer Dabbelt 2021-07-07 10:07 ` David Laight 2021-07-10 1:49 ` Guenter Roeck 1 sibling, 1 reply; 7+ messages in thread From: Palmer Dabbelt @ 2021-07-06 23:16 UTC (permalink / raw) To: akira.tsukamoto; +Cc: Paul Walmsley, aou, linux-riscv, linux-kernel On Wed, 23 Jun 2021 05:40:39 PDT (-0700), akira.tsukamoto@gmail.com wrote: > > This patch will reduce cpu usage dramatically in kernel space especially > for application which use sys-call with large buffer size, such as network > applications. The main reason behind this is that every unaligned memory > access will raise exceptions and switch between s-mode and m-mode causing > large overhead. > > First copy in bytes until reaches the first word aligned boundary in > destination memory address. This is the preparation before the bulk > aligned word copy. > > The destination address is aligned now, but oftentimes the source address > is not in an aligned boundary. To reduce the unaligned memory access, it > reads the data from source in aligned boundaries, which will cause the > data to have an offset, and then combines the data in the next iteration > by fixing offset with shifting before writing to destination. The majority > of the improving copy speed comes from this shift copy. > > In the lucky situation that the both source and destination address are on > the aligned boundary, perform load and store with register size to copy the > data. Without the unrolling, it will reduce the speed since the next store > instruction for the same register using from the load will stall the > pipeline. > > At last, copying the remainder in one byte at a time. > > Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com> > --- > arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++-------- > 1 file changed, 146 insertions(+), 35 deletions(-) > > diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S > index fceaeb18cc64..bceb0629e440 100644 > --- a/arch/riscv/lib/uaccess.S > +++ b/arch/riscv/lib/uaccess.S > @@ -19,50 +19,161 @@ ENTRY(__asm_copy_from_user) > li t6, SR_SUM > csrs CSR_STATUS, t6 > > - add a3, a1, a2 > - /* Use word-oriented copy only if low-order bits match */ > - andi t0, a0, SZREG-1 > - andi t1, a1, SZREG-1 > - bne t0, t1, 2f > + /* Save for return value */ > + mv t5, a2 > > - addi t0, a1, SZREG-1 > - andi t1, a3, ~(SZREG-1) > - andi t0, t0, ~(SZREG-1) > /* > - * a3: terminal address of source region > - * t0: lowest XLEN-aligned address in source > - * t1: highest XLEN-aligned address in source > + * Register allocation for code below: > + * a0 - start of uncopied dst > + * a1 - start of uncopied src > + * a2 - size > + * t0 - end of uncopied dst > */ > - bgeu t0, t1, 2f > - bltu a1, t0, 4f > + add t0, a0, a2 > + bgtu a0, t0, 5f > + > + /* > + * Use byte copy only if too small. > + */ > + li a3, 8*SZREG /* size must be larger than size in word_copy */ > + bltu a2, a3, .Lbyte_copy_tail > + > + /* > + * Copy first bytes until dst is align to word boundary. > + * a0 - start of dst > + * t1 - start of aligned dst > + */ > + addi t1, a0, SZREG-1 > + andi t1, t1, ~(SZREG-1) > + /* dst is already aligned, skip */ > + beq a0, t1, .Lskip_first_bytes > 1: > - fixup REG_L, t2, (a1), 10f > - fixup REG_S, t2, (a0), 10f > - addi a1, a1, SZREG > - addi a0, a0, SZREG > - bltu a1, t1, 1b > + /* a5 - one byte for copying data */ > + fixup lb a5, 0(a1), 10f > + addi a1, a1, 1 /* src */ > + fixup sb a5, 0(a0), 10f > + addi a0, a0, 1 /* dst */ > + bltu a0, t1, 1b /* t1 - start of aligned dst */ > + > +.Lskip_first_bytes: > + /* > + * Now dst is aligned. > + * Use shift-copy if src is misaligned. > + * Use word-copy if both src and dst are aligned because > + * can not use shift-copy which do not require shifting > + */ > + /* a1 - start of src */ > + andi a3, a1, SZREG-1 > + bnez a3, .Lshift_copy > + > +.Lword_copy: > + /* > + * Both src and dst are aligned, unrolled word copy > + * > + * a0 - start of aligned dst > + * a1 - start of aligned src > + * a3 - a1 & mask:(SZREG-1) > + * t0 - end of aligned dst > + */ > + addi t0, t0, -(8*SZREG-1) /* not to over run */ > 2: > - bltu a1, a3, 5f > + fixup REG_L a4, 0(a1), 10f > + fixup REG_L a5, SZREG(a1), 10f > + fixup REG_L a6, 2*SZREG(a1), 10f > + fixup REG_L a7, 3*SZREG(a1), 10f > + fixup REG_L t1, 4*SZREG(a1), 10f > + fixup REG_L t2, 5*SZREG(a1), 10f > + fixup REG_L t3, 6*SZREG(a1), 10f > + fixup REG_L t4, 7*SZREG(a1), 10f > + fixup REG_S a4, 0(a0), 10f > + fixup REG_S a5, SZREG(a0), 10f > + fixup REG_S a6, 2*SZREG(a0), 10f > + fixup REG_S a7, 3*SZREG(a0), 10f > + fixup REG_S t1, 4*SZREG(a0), 10f > + fixup REG_S t2, 5*SZREG(a0), 10f > + fixup REG_S t3, 6*SZREG(a0), 10f > + fixup REG_S t4, 7*SZREG(a0), 10f This seems like a suspiciously large unrolling factor, at least without a fallback. My guess is that some workloads will want some smaller unrolling factors, but given that we run on these single-issue in-order processors it's probably best to have some big unrolling factors as well since they're pretty limited WRT integer bandwidth. > + addi a0, a0, 8*SZREG > + addi a1, a1, 8*SZREG > + bltu a0, t0, 2b > + > + addi t0, t0, 8*SZREG-1 /* revert to original value */ > + j .Lbyte_copy_tail > + > +.Lshift_copy: > + > + /* > + * Word copy with shifting. > + * For misaligned copy we still perform aligned word copy, but > + * we need to use the value fetched from the previous iteration and > + * do some shifts. > + * This is safe because reading less than a word size. > + * > + * a0 - start of aligned dst > + * a1 - start of src > + * a3 - a1 & mask:(SZREG-1) > + * t0 - end of uncopied dst > + * t1 - end of aligned dst > + */ > + /* calculating aligned word boundary for dst */ > + andi t1, t0, ~(SZREG-1) > + /* Converting unaligned src to aligned arc */ > + andi a1, a1, ~(SZREG-1) > + > + /* > + * Calculate shifts > + * t3 - prev shift > + * t4 - current shift > + */ > + slli t3, a3, LGREG > + li a5, SZREG*8 > + sub t4, a5, t3 > + > + /* Load the first word to combine with seceond word */ > + fixup REG_L a5, 0(a1), 10f > > 3: > + /* Main shifting copy > + * > + * a0 - start of aligned dst > + * a1 - start of aligned src > + * t1 - end of aligned dst > + */ > + > + /* At least one iteration will be executed */ > + srl a4, a5, t3 > + fixup REG_L a5, SZREG(a1), 10f > + addi a1, a1, SZREG > + sll a2, a5, t4 > + or a2, a2, a4 > + fixup REG_S a2, 0(a0), 10f > + addi a0, a0, SZREG > + bltu a0, t1, 3b > + > + /* Revert src to original unaligned value */ > + add a1, a1, a3 > + > +.Lbyte_copy_tail: > + /* > + * Byte copy anything left. > + * > + * a0 - start of remaining dst > + * a1 - start of remaining src > + * t0 - end of remaining dst > + */ > + bgeu a0, t0, 5f > +4: > + fixup lb a5, 0(a1), 10f > + addi a1, a1, 1 /* src */ > + fixup sb a5, 0(a0), 10f > + addi a0, a0, 1 /* dst */ > + bltu a0, t0, 4b /* t0 - end of dst */ > + > +5: > /* Disable access to user memory */ > csrc CSR_STATUS, t6 > - li a0, 0 > + li a0, 0 > ret > -4: /* Edge case: unalignment */ > - fixup lbu, t2, (a1), 10f > - fixup sb, t2, (a0), 10f > - addi a1, a1, 1 > - addi a0, a0, 1 > - bltu a1, t0, 4b > - j 1b > -5: /* Edge case: remainder */ > - fixup lbu, t2, (a1), 10f > - fixup sb, t2, (a0), 10f > - addi a1, a1, 1 > - addi a0, a0, 1 > - bltu a1, a3, 5b > - j 3b > ENDPROC(__asm_copy_to_user) > ENDPROC(__asm_copy_from_user) > EXPORT_SYMBOL(__asm_copy_to_user) > @@ -117,7 +228,7 @@ EXPORT_SYMBOL(__clear_user) > 10: > /* Disable access to user memory */ > csrs CSR_STATUS, t6 > - mv a0, a2 > + mv a0, t5 > ret > 11: > csrs CSR_STATUS, t6 That said, this is good enough for me. If someone comes up with a case where the extra unrolling is an issue I'm happy to take something to fix it, but until then I'm fine with this as-is. Like the string fuctions it's probably best to eventually put this in C, but IIRC last time I tried it was kind of a headache. This is on for-next. Thanks! ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall 2021-07-06 23:16 ` Palmer Dabbelt @ 2021-07-07 10:07 ` David Laight 0 siblings, 0 replies; 7+ messages in thread From: David Laight @ 2021-07-07 10:07 UTC (permalink / raw) To: 'Palmer Dabbelt', akira.tsukamoto Cc: Paul Walmsley, aou, linux-riscv, linux-kernel ... > > + fixup REG_L a4, 0(a1), 10f > > + fixup REG_L a5, SZREG(a1), 10f > > + fixup REG_L a6, 2*SZREG(a1), 10f > > + fixup REG_L a7, 3*SZREG(a1), 10f > > + fixup REG_L t1, 4*SZREG(a1), 10f > > + fixup REG_L t2, 5*SZREG(a1), 10f > > + fixup REG_L t3, 6*SZREG(a1), 10f > > + fixup REG_L t4, 7*SZREG(a1), 10f > > + fixup REG_S a4, 0(a0), 10f > > + fixup REG_S a5, SZREG(a0), 10f > > + fixup REG_S a6, 2*SZREG(a0), 10f > > + fixup REG_S a7, 3*SZREG(a0), 10f > > + fixup REG_S t1, 4*SZREG(a0), 10f > > + fixup REG_S t2, 5*SZREG(a0), 10f > > + fixup REG_S t3, 6*SZREG(a0), 10f > > + fixup REG_S t4, 7*SZREG(a0), 10f > > This seems like a suspiciously large unrolling factor, at least without > a fallback. My guess is that some workloads will want some smaller > unrolling factors, but given that we run on these single-issue in-order > processors it's probably best to have some big unrolling factors as well > since they're pretty limited WRT integer bandwidth. But a single-issue cpu is unlikely to have an 8 clock data delay. OTOH a cpu than can do concurrent memory read and write might not have enough 'out of order' capability to do so with the above loop. You may want to interleave the reads and writes - starting with two or three reads (possibly with the extra ones outside the loop). I don't know the microarchitectures well enough (well at all) to know the exact pitfalls. The very simple cpu might have the same 'issue' the Nios2 has (another MIPS clone fpga soft cpu) where there can be a pipeline stall between a write and read. I doubt the non-trivial riscv have that issue though. > > + addi a0, a0, 8*SZREG > > + addi a1, a1, 8*SZREG > > + bltu a0, t0, 2b For a dual-issue cpu you want to move the two 'addi' higher up the loop so that they are 'free'. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall 2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto 2021-07-06 23:16 ` Palmer Dabbelt @ 2021-07-10 1:49 ` Guenter Roeck 2021-07-13 18:10 ` Geert Uytterhoeven 1 sibling, 1 reply; 7+ messages in thread From: Guenter Roeck @ 2021-07-10 1:49 UTC (permalink / raw) To: Akira Tsukamoto Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel Hi, On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote: > This patch will reduce cpu usage dramatically in kernel space especially > for application which use sys-call with large buffer size, such as network > applications. The main reason behind this is that every unaligned memory > access will raise exceptions and switch between s-mode and m-mode causing > large overhead. > > First copy in bytes until reaches the first word aligned boundary in > destination memory address. This is the preparation before the bulk > aligned word copy. > > The destination address is aligned now, but oftentimes the source address > is not in an aligned boundary. To reduce the unaligned memory access, it > reads the data from source in aligned boundaries, which will cause the > data to have an offset, and then combines the data in the next iteration > by fixing offset with shifting before writing to destination. The majority > of the improving copy speed comes from this shift copy. > > In the lucky situation that the both source and destination address are on > the aligned boundary, perform load and store with register size to copy the > data. Without the unrolling, it will reduce the speed since the next store > instruction for the same register using from the load will stall the > pipeline. > > At last, copying the remainder in one byte at a time. > > Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com> This patch causes all riscv32 qemu emulations to stall during boot. The log suggests that something in kernel/user communication may be wrong. Bad case: Starting syslogd: OK Starting klogd: OK /etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution /etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string Starting network: /bin/sh: syntax error: unterminated quoted string sed: unmatched '/' /bin/sh: syntax error: unterminated quoted string FAIL /etc/init.d/S55runtest: line 48: syntax error: EOF in backquote substitution Good case (this patch reverted): Starting syslogd: OK Starting klogd: OK Running sysctl: OK Saving random seed: [ 12.277714] random: dd: uninitialized urandom read (512 bytes read) OK Starting network: [ 12.949529] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [ 12.951170] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready udhcpc: started, v1.33.0 udhcpc: sending discover udhcpc: sending select for 10.0.2.15 udhcpc: lease of 10.0.2.15 obtained, lease time 86400 deleting routers adding dns 10.0.2.3 OK Found console ttyS0 Reverting this patch fixes the problem. Bisect log attached. Guenter --- # bad: [50be9417e23af5a8ac860d998e1e3f06b8fd79d7] Merge tag 'io_uring-5.14-2021-07-09' of git://git.kernel.dk/linux-block # good: [f55966571d5eb2876a11e48e798b4592fa1ffbb7] Merge tag 'drm-next-2021-07-08-1' of git://anongit.freedesktop.org/drm/drm git bisect start 'HEAD' 'f55966571d5e' # good: [7a400bf28334fc7734639db3566394e1fc80670c] Merge tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs git bisect good 7a400bf28334fc7734639db3566394e1fc80670c # bad: [d8dc121eeab9abfbc510097f8db83e87560f753b] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 git bisect bad d8dc121eeab9abfbc510097f8db83e87560f753b # bad: [7761e36bc7222d1221242c5f195ee0fd40caea40] riscv: Fix PTDUMP output now BPF region moved back to module region git bisect bad 7761e36bc7222d1221242c5f195ee0fd40caea40 # good: [5def4429aefe65b494816d9ba8ae7f971d522251] riscv: mm: Use better bitmap_zalloc() git bisect good 5def4429aefe65b494816d9ba8ae7f971d522251 # good: [47513f243b452a5e21180dcf3d6ac1c57e1781a6] riscv: Enable KFENCE for riscv64 git bisect good 47513f243b452a5e21180dcf3d6ac1c57e1781a6 # good: [01112e5e20f5298a81639806cd0a3c587aade467] Merge branch 'riscv-wx-mappings' into for-next git bisect good 01112e5e20f5298a81639806cd0a3c587aade467 # good: [70eee556b678d1e4cd4ea6742a577b596963fa25] riscv: ptrace: add argn syntax git bisect good 70eee556b678d1e4cd4ea6742a577b596963fa25 # bad: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall git bisect bad ca6eaaa210deec0e41cbfc380bf89cf079203569 # good: [31da94c25aea835ceac00575a9fd206c5a833fed] riscv: add VMAP_STACK overflow detection git bisect good 31da94c25aea835ceac00575a9fd206c5a833fed # first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall 2021-07-10 1:49 ` Guenter Roeck @ 2021-07-13 18:10 ` Geert Uytterhoeven 2021-07-15 6:20 ` Akira Tsukamoto 0 siblings, 1 reply; 7+ messages in thread From: Geert Uytterhoeven @ 2021-07-13 18:10 UTC (permalink / raw) To: Guenter Roeck, Akira Tsukamoto Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, Linux Kernel Mailing List Hi Günter, Tsukamoto-san, On Sat, Jul 10, 2021 at 3:50 AM Guenter Roeck <linux@roeck-us.net> wrote: > On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote: > > This patch will reduce cpu usage dramatically in kernel space especially > > for application which use sys-call with large buffer size, such as network > > applications. The main reason behind this is that every unaligned memory > > access will raise exceptions and switch between s-mode and m-mode causing > > large overhead. > > > > First copy in bytes until reaches the first word aligned boundary in > > destination memory address. This is the preparation before the bulk > > aligned word copy. > > > > The destination address is aligned now, but oftentimes the source address > > is not in an aligned boundary. To reduce the unaligned memory access, it > > reads the data from source in aligned boundaries, which will cause the > > data to have an offset, and then combines the data in the next iteration > > by fixing offset with shifting before writing to destination. The majority > > of the improving copy speed comes from this shift copy. > > > > In the lucky situation that the both source and destination address are on > > the aligned boundary, perform load and store with register size to copy the > > data. Without the unrolling, it will reduce the speed since the next store > > instruction for the same register using from the load will stall the > > pipeline. > > > > At last, copying the remainder in one byte at a time. > > > > Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com> > > This patch causes all riscv32 qemu emulations to stall during boot. > The log suggests that something in kernel/user communication may be wrong. > > Bad case: > > Starting syslogd: OK > Starting klogd: OK > /etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution > /etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string > Starting network: /bin/sh: syntax error: unterminated quoted string > # first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Same here on vexriscv. Bisected to the same commit. The actual scripts look fine when using "cat", but contain some garbage when executing them using "sh -v". Tsukamoto-san: glancing at the patch: + addi a0, a0, 8*SZREG + addi a1, a1, 8*SZREG I think you forgot about rv32, where registers cover only 4 bytes each? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall 2021-07-13 18:10 ` Geert Uytterhoeven @ 2021-07-15 6:20 ` Akira Tsukamoto 0 siblings, 0 replies; 7+ messages in thread From: Akira Tsukamoto @ 2021-07-15 6:20 UTC (permalink / raw) To: Geert Uytterhoeven, Guenter Roeck Cc: akira.tsukamoto, Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, Linux Kernel Mailing List On 7/14/2021 3:10 AM, Geert Uytterhoeven wrote: > Hi Günter, Tsukamoto-san, > > On Sat, Jul 10, 2021 at 3:50 AM Guenter Roeck <linux@roeck-us.net> wrote: >> On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote: >>> This patch will reduce cpu usage dramatically in kernel space especially >>> for application which use sys-call with large buffer size, such as network >>> applications. The main reason behind this is that every unaligned memory >>> access will raise exceptions and switch between s-mode and m-mode causing >>> large overhead. >>> >>> First copy in bytes until reaches the first word aligned boundary in >>> destination memory address. This is the preparation before the bulk >>> aligned word copy. >>> >>> The destination address is aligned now, but oftentimes the source address >>> is not in an aligned boundary. To reduce the unaligned memory access, it >>> reads the data from source in aligned boundaries, which will cause the >>> data to have an offset, and then combines the data in the next iteration >>> by fixing offset with shifting before writing to destination. The majority >>> of the improving copy speed comes from this shift copy. >>> >>> In the lucky situation that the both source and destination address are on >>> the aligned boundary, perform load and store with register size to copy the >>> data. Without the unrolling, it will reduce the speed since the next store >>> instruction for the same register using from the load will stall the >>> pipeline. >>> >>> At last, copying the remainder in one byte at a time. >>> >>> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com> >> >> This patch causes all riscv32 qemu emulations to stall during boot. >> The log suggests that something in kernel/user communication may be wrong. >> >> Bad case: >> >> Starting syslogd: OK >> Starting klogd: OK >> /etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution >> /etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string >> Starting network: /bin/sh: syntax error: unterminated quoted string > >> # first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall > > Same here on vexriscv. Bisected to the same commit. > > The actual scripts look fine when using "cat", but contain some garbage > when executing them using "sh -v". > > Tsukamoto-san: glancing at the patch: > > + addi a0, a0, 8*SZREG > + addi a1, a1, 8*SZREG > > I think you forgot about rv32, where registers cover only 4 > bytes each? Thanks Günter and Geert for the pointing out the errors. I will send the fixes, probably this weekend. Akira ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-07-15 6:20 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-23 12:37 [PATCH v3 0/1] riscv: improving uaccess with logs from network bench Akira Tsukamoto 2021-06-23 12:40 ` [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall Akira Tsukamoto 2021-07-06 23:16 ` Palmer Dabbelt 2021-07-07 10:07 ` David Laight 2021-07-10 1:49 ` Guenter Roeck 2021-07-13 18:10 ` Geert Uytterhoeven 2021-07-15 6:20 ` Akira Tsukamoto
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).