* [PATCH 0/3] riscv: optimized mem* functions @ 2021-06-15 2:38 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> Replace the assembly mem{cpy,move,set} with C equivalent. Try to access RAM with the largest bit width possible, but without doing unaligned accesses. Tested on a BeagleV Starlight with a SiFive U74 core, where the improvement is noticeable. Matteo Croce (3): riscv: optimized memcpy riscv: optimized memmove riscv: optimized memset arch/riscv/include/asm/string.h | 18 ++-- arch/riscv/kernel/Makefile | 1 - arch/riscv/kernel/riscv_ksyms.c | 17 ---- arch/riscv/lib/Makefile | 4 +- arch/riscv/lib/memcpy.S | 108 --------------------- arch/riscv/lib/memmove.S | 64 ------------- arch/riscv/lib/memset.S | 113 ---------------------- arch/riscv/lib/string.c | 162 ++++++++++++++++++++++++++++++++ 8 files changed, 172 insertions(+), 315 deletions(-) delete mode 100644 arch/riscv/kernel/riscv_ksyms.c delete mode 100644 arch/riscv/lib/memcpy.S delete mode 100644 arch/riscv/lib/memmove.S delete mode 100644 arch/riscv/lib/memset.S create mode 100644 arch/riscv/lib/string.c -- 2.31.1 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH 0/3] riscv: optimized mem* functions @ 2021-06-15 2:38 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> Replace the assembly mem{cpy,move,set} with C equivalent. Try to access RAM with the largest bit width possible, but without doing unaligned accesses. Tested on a BeagleV Starlight with a SiFive U74 core, where the improvement is noticeable. Matteo Croce (3): riscv: optimized memcpy riscv: optimized memmove riscv: optimized memset arch/riscv/include/asm/string.h | 18 ++-- arch/riscv/kernel/Makefile | 1 - arch/riscv/kernel/riscv_ksyms.c | 17 ---- arch/riscv/lib/Makefile | 4 +- arch/riscv/lib/memcpy.S | 108 --------------------- arch/riscv/lib/memmove.S | 64 ------------- arch/riscv/lib/memset.S | 113 ---------------------- arch/riscv/lib/string.c | 162 ++++++++++++++++++++++++++++++++ 8 files changed, 172 insertions(+), 315 deletions(-) delete mode 100644 arch/riscv/kernel/riscv_ksyms.c delete mode 100644 arch/riscv/lib/memcpy.S delete mode 100644 arch/riscv/lib/memmove.S delete mode 100644 arch/riscv/lib/memset.S create mode 100644 arch/riscv/lib/string.c -- 2.31.1 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH 1/3] riscv: optimized memcpy 2021-06-15 2:38 ` Matteo Croce @ 2021-06-15 2:38 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> Write a C version of memcpy() which uses the biggest data size allowed, without generating unaligned accesses. The procedure is made of three steps: First copy data one byte at time until the destination buffer is aligned to a long boundary. Then copy the data one long at time shifting the current and the next u8 to compose a long at every cycle. Finally, copy the remainder one byte at time. On a BeagleV, the TCP RX throughput increased by 45%: before: $ iperf3 -c beaglev Connecting to host beaglev, port 5201 [ 5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 76.4 MBytes 641 Mbits/sec 27 624 KBytes [ 5] 1.00-2.00 sec 72.5 MBytes 608 Mbits/sec 0 708 KBytes [ 5] 2.00-3.00 sec 73.8 MBytes 619 Mbits/sec 10 451 KBytes [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 564 KBytes [ 5] 4.00-5.00 sec 73.8 MBytes 619 Mbits/sec 0 658 KBytes [ 5] 5.00-6.00 sec 73.8 MBytes 619 Mbits/sec 14 522 KBytes [ 5] 6.00-7.00 sec 73.8 MBytes 619 Mbits/sec 0 621 KBytes [ 5] 7.00-8.00 sec 72.5 MBytes 608 Mbits/sec 0 706 KBytes [ 5] 8.00-9.00 sec 73.8 MBytes 619 Mbits/sec 20 580 KBytes [ 5] 9.00-10.00 sec 73.8 MBytes 619 Mbits/sec 0 672 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 736 MBytes 618 Mbits/sec 71 sender [ 5] 0.00-10.01 sec 733 MBytes 615 Mbits/sec receiver after: $ iperf3 -c beaglev Connecting to host beaglev, port 5201 [ 5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 109 MBytes 912 Mbits/sec 48 559 KBytes [ 5] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0 690 KBytes [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 36 396 KBytes [ 5] 3.00-4.00 sec 108 MBytes 902 Mbits/sec 0 567 KBytes [ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 699 KBytes [ 5] 5.00-6.00 sec 106 MBytes 891 Mbits/sec 32 414 KBytes [ 5] 6.00-7.00 sec 106 MBytes 891 Mbits/sec 0 583 KBytes [ 5] 7.00-8.00 sec 106 MBytes 891 Mbits/sec 0 708 KBytes [ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec 28 433 KBytes [ 5] 9.00-10.00 sec 108 MBytes 902 Mbits/sec 0 591 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.04 GBytes 897 Mbits/sec 144 sender [ 5] 0.00-10.01 sec 1.04 GBytes 894 Mbits/sec receiver And the decreased CPU time of the memcpy() is observable with perf top. This is the `perf top -Ue task-clock` output when doing the test: before: Overhead Shared O Symbol 42.22% [kernel] [k] memcpy 35.00% [kernel] [k] __asm_copy_to_user 3.50% [kernel] [k] sifive_l2_flush64_range 2.30% [kernel] [k] stmmac_napi_poll_rx 1.11% [kernel] [k] memset after: Overhead Shared O Symbol 45.69% [kernel] [k] __asm_copy_to_user 29.06% [kernel] [k] memcpy 4.09% [kernel] [k] sifive_l2_flush64_range 2.77% [kernel] [k] stmmac_napi_poll_rx 1.24% [kernel] [k] memset Signed-off-by: Matteo Croce <mcroce@microsoft.com> --- arch/riscv/include/asm/string.h | 8 ++- arch/riscv/kernel/riscv_ksyms.c | 2 - arch/riscv/lib/Makefile | 2 +- arch/riscv/lib/memcpy.S | 108 -------------------------------- arch/riscv/lib/string.c | 94 +++++++++++++++++++++++++++ 5 files changed, 101 insertions(+), 113 deletions(-) delete mode 100644 arch/riscv/lib/memcpy.S create mode 100644 arch/riscv/lib/string.c diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index 909049366555..6b5d6fc3eab4 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -12,9 +12,13 @@ #define __HAVE_ARCH_MEMSET extern asmlinkage void *memset(void *, int, size_t); extern asmlinkage void *__memset(void *, int, size_t); + +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE #define __HAVE_ARCH_MEMCPY -extern asmlinkage void *memcpy(void *, const void *, size_t); -extern asmlinkage void *__memcpy(void *, const void *, size_t); +extern void *memcpy(void *dest, const void *src, size_t count); +extern void *__memcpy(void *dest, const void *src, size_t count); +#endif + #define __HAVE_ARCH_MEMMOVE extern asmlinkage void *memmove(void *, const void *, size_t); extern asmlinkage void *__memmove(void *, const void *, size_t); diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index 5ab1c7e1a6ed..3f6d512a5b97 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,8 +10,6 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memcpy); EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memcpy); EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 25d5c9664e57..2ffe85d4baee 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,9 +1,9 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o -lib-y += memcpy.o lib-y += memset.o lib-y += memmove.o lib-$(CONFIG_MMU) += uaccess.o lib-$(CONFIG_64BIT) += tishift.o +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S deleted file mode 100644 index 51ab716253fa..000000000000 --- a/arch/riscv/lib/memcpy.S +++ /dev/null @@ -1,108 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2013 Regents of the University of California - */ - -#include <linux/linkage.h> -#include <asm/asm.h> - -/* void *memcpy(void *, const void *, size_t) */ -ENTRY(__memcpy) -WEAK(memcpy) - move t6, a0 /* Preserve return value */ - - /* Defer to byte-oriented copy for small sizes */ - sltiu a3, a2, 128 - bnez a3, 4f - /* Use word-oriented copy only if low-order bits match */ - andi a3, t6, SZREG-1 - andi a4, a1, SZREG-1 - bne a3, a4, 4f - - beqz a3, 2f /* Skip if already aligned */ - /* - * Round to nearest double word-aligned address - * greater than or equal to start address - */ - andi a3, a1, ~(SZREG-1) - addi a3, a3, SZREG - /* Handle initial misalignment */ - sub a4, a3, a1 -1: - lb a5, 0(a1) - addi a1, a1, 1 - sb a5, 0(t6) - addi t6, t6, 1 - bltu a1, a3, 1b - sub a2, a2, a4 /* Update count */ - -2: - andi a4, a2, ~((16*SZREG)-1) - beqz a4, 4f - add a3, a1, a4 -3: - REG_L a4, 0(a1) - REG_L a5, SZREG(a1) - REG_L a6, 2*SZREG(a1) - REG_L a7, 3*SZREG(a1) - REG_L t0, 4*SZREG(a1) - REG_L t1, 5*SZREG(a1) - REG_L t2, 6*SZREG(a1) - REG_L t3, 7*SZREG(a1) - REG_L t4, 8*SZREG(a1) - REG_L t5, 9*SZREG(a1) - REG_S a4, 0(t6) - REG_S a5, SZREG(t6) - REG_S a6, 2*SZREG(t6) - REG_S a7, 3*SZREG(t6) - REG_S t0, 4*SZREG(t6) - REG_S t1, 5*SZREG(t6) - REG_S t2, 6*SZREG(t6) - REG_S t3, 7*SZREG(t6) - REG_S t4, 8*SZREG(t6) - REG_S t5, 9*SZREG(t6) - REG_L a4, 10*SZREG(a1) - REG_L a5, 11*SZREG(a1) - REG_L a6, 12*SZREG(a1) - REG_L a7, 13*SZREG(a1) - REG_L t0, 14*SZREG(a1) - REG_L t1, 15*SZREG(a1) - addi a1, a1, 16*SZREG - REG_S a4, 10*SZREG(t6) - REG_S a5, 11*SZREG(t6) - REG_S a6, 12*SZREG(t6) - REG_S a7, 13*SZREG(t6) - REG_S t0, 14*SZREG(t6) - REG_S t1, 15*SZREG(t6) - addi t6, t6, 16*SZREG - bltu a1, a3, 3b - andi a2, a2, (16*SZREG)-1 /* Update count */ - -4: - /* Handle trailing misalignment */ - beqz a2, 6f - add a3, a1, a2 - - /* Use word-oriented copy if co-aligned to word boundary */ - or a5, a1, t6 - or a5, a5, a3 - andi a5, a5, 3 - bnez a5, 5f -7: - lw a4, 0(a1) - addi a1, a1, 4 - sw a4, 0(t6) - addi t6, t6, 4 - bltu a1, a3, 7b - - ret - -5: - lb a4, 0(a1) - addi a1, a1, 1 - sb a4, 0(t6) - addi t6, t6, 1 - bltu a1, a3, 5b -6: - ret -END(__memcpy) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c new file mode 100644 index 000000000000..525f9ee25a74 --- /dev/null +++ b/arch/riscv/lib/string.c @@ -0,0 +1,94 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * String functions optimized for hardware which doesn't + * handle unaligned memory accesses efficiently. + * + * Copyright (C) 2021 Matteo Croce + */ + +#include <linux/types.h> +#include <linux/module.h> + +/* size below a classic byte at time copy is done */ +#define MIN_THRESHOLD 64 + +/* convenience types to avoid cast between different pointer types */ +union types { + u8 *u8; + unsigned long *ulong; + uintptr_t uptr; +}; + +union const_types { + const u8 *u8; + unsigned long *ulong; +}; + +void *memcpy(void *dest, const void *src, size_t count) +{ + const int bytes_long = BITS_PER_LONG / 8; +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + const int mask = bytes_long - 1; + const int distance = (src - dest) & mask; +#endif + union const_types s = { .u8 = src }; + union types d = { .u8 = dest }; + +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + if (count <= MIN_THRESHOLD) + goto copy_remainder; + + /* copy a byte at time until destination is aligned */ + for (; count && d.uptr & mask; count--) + *d.u8++ = *s.u8++; + + if (distance) { + unsigned long last, next; + + /* move s backward to the previous alignment boundary */ + s.u8 -= distance; + + /* 32/64 bit wide copy from s to d. + * d is aligned now but s is not, so read s alignment wise, + * and do proper shift to get the right value. + * Works only on Little Endian machines. + */ + for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) { + last = next; + next = s.ulong[1]; + + d.ulong[0] = last >> (distance * 8) | + next << ((bytes_long - distance) * 8); + + d.ulong++; + s.ulong++; + } + + /* restore s with the original offset */ + s.u8 += distance; + } else +#endif + { + /* if the source and dest lower bits are the same, do a simple + * 32/64 bit wide copy. + */ + for (; count >= bytes_long; count -= bytes_long) + *d.ulong++ = *s.ulong++; + } + + /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */ + goto copy_remainder; + +copy_remainder: + while (count--) + *d.u8++ = *s.u8++; + + return dest; +} +EXPORT_SYMBOL(memcpy); + +void *__memcpy(void *dest, const void *src, size_t count) +{ + return memcpy(dest, src, count); +} +EXPORT_SYMBOL(__memcpy); -- 2.31.1 ^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 1/3] riscv: optimized memcpy @ 2021-06-15 2:38 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> Write a C version of memcpy() which uses the biggest data size allowed, without generating unaligned accesses. The procedure is made of three steps: First copy data one byte at time until the destination buffer is aligned to a long boundary. Then copy the data one long at time shifting the current and the next u8 to compose a long at every cycle. Finally, copy the remainder one byte at time. On a BeagleV, the TCP RX throughput increased by 45%: before: $ iperf3 -c beaglev Connecting to host beaglev, port 5201 [ 5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 76.4 MBytes 641 Mbits/sec 27 624 KBytes [ 5] 1.00-2.00 sec 72.5 MBytes 608 Mbits/sec 0 708 KBytes [ 5] 2.00-3.00 sec 73.8 MBytes 619 Mbits/sec 10 451 KBytes [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 564 KBytes [ 5] 4.00-5.00 sec 73.8 MBytes 619 Mbits/sec 0 658 KBytes [ 5] 5.00-6.00 sec 73.8 MBytes 619 Mbits/sec 14 522 KBytes [ 5] 6.00-7.00 sec 73.8 MBytes 619 Mbits/sec 0 621 KBytes [ 5] 7.00-8.00 sec 72.5 MBytes 608 Mbits/sec 0 706 KBytes [ 5] 8.00-9.00 sec 73.8 MBytes 619 Mbits/sec 20 580 KBytes [ 5] 9.00-10.00 sec 73.8 MBytes 619 Mbits/sec 0 672 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 736 MBytes 618 Mbits/sec 71 sender [ 5] 0.00-10.01 sec 733 MBytes 615 Mbits/sec receiver after: $ iperf3 -c beaglev Connecting to host beaglev, port 5201 [ 5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 109 MBytes 912 Mbits/sec 48 559 KBytes [ 5] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0 690 KBytes [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 36 396 KBytes [ 5] 3.00-4.00 sec 108 MBytes 902 Mbits/sec 0 567 KBytes [ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 699 KBytes [ 5] 5.00-6.00 sec 106 MBytes 891 Mbits/sec 32 414 KBytes [ 5] 6.00-7.00 sec 106 MBytes 891 Mbits/sec 0 583 KBytes [ 5] 7.00-8.00 sec 106 MBytes 891 Mbits/sec 0 708 KBytes [ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec 28 433 KBytes [ 5] 9.00-10.00 sec 108 MBytes 902 Mbits/sec 0 591 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.04 GBytes 897 Mbits/sec 144 sender [ 5] 0.00-10.01 sec 1.04 GBytes 894 Mbits/sec receiver And the decreased CPU time of the memcpy() is observable with perf top. This is the `perf top -Ue task-clock` output when doing the test: before: Overhead Shared O Symbol 42.22% [kernel] [k] memcpy 35.00% [kernel] [k] __asm_copy_to_user 3.50% [kernel] [k] sifive_l2_flush64_range 2.30% [kernel] [k] stmmac_napi_poll_rx 1.11% [kernel] [k] memset after: Overhead Shared O Symbol 45.69% [kernel] [k] __asm_copy_to_user 29.06% [kernel] [k] memcpy 4.09% [kernel] [k] sifive_l2_flush64_range 2.77% [kernel] [k] stmmac_napi_poll_rx 1.24% [kernel] [k] memset Signed-off-by: Matteo Croce <mcroce@microsoft.com> --- arch/riscv/include/asm/string.h | 8 ++- arch/riscv/kernel/riscv_ksyms.c | 2 - arch/riscv/lib/Makefile | 2 +- arch/riscv/lib/memcpy.S | 108 -------------------------------- arch/riscv/lib/string.c | 94 +++++++++++++++++++++++++++ 5 files changed, 101 insertions(+), 113 deletions(-) delete mode 100644 arch/riscv/lib/memcpy.S create mode 100644 arch/riscv/lib/string.c diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index 909049366555..6b5d6fc3eab4 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -12,9 +12,13 @@ #define __HAVE_ARCH_MEMSET extern asmlinkage void *memset(void *, int, size_t); extern asmlinkage void *__memset(void *, int, size_t); + +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE #define __HAVE_ARCH_MEMCPY -extern asmlinkage void *memcpy(void *, const void *, size_t); -extern asmlinkage void *__memcpy(void *, const void *, size_t); +extern void *memcpy(void *dest, const void *src, size_t count); +extern void *__memcpy(void *dest, const void *src, size_t count); +#endif + #define __HAVE_ARCH_MEMMOVE extern asmlinkage void *memmove(void *, const void *, size_t); extern asmlinkage void *__memmove(void *, const void *, size_t); diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index 5ab1c7e1a6ed..3f6d512a5b97 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,8 +10,6 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memcpy); EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memcpy); EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 25d5c9664e57..2ffe85d4baee 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,9 +1,9 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o -lib-y += memcpy.o lib-y += memset.o lib-y += memmove.o lib-$(CONFIG_MMU) += uaccess.o lib-$(CONFIG_64BIT) += tishift.o +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S deleted file mode 100644 index 51ab716253fa..000000000000 --- a/arch/riscv/lib/memcpy.S +++ /dev/null @@ -1,108 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2013 Regents of the University of California - */ - -#include <linux/linkage.h> -#include <asm/asm.h> - -/* void *memcpy(void *, const void *, size_t) */ -ENTRY(__memcpy) -WEAK(memcpy) - move t6, a0 /* Preserve return value */ - - /* Defer to byte-oriented copy for small sizes */ - sltiu a3, a2, 128 - bnez a3, 4f - /* Use word-oriented copy only if low-order bits match */ - andi a3, t6, SZREG-1 - andi a4, a1, SZREG-1 - bne a3, a4, 4f - - beqz a3, 2f /* Skip if already aligned */ - /* - * Round to nearest double word-aligned address - * greater than or equal to start address - */ - andi a3, a1, ~(SZREG-1) - addi a3, a3, SZREG - /* Handle initial misalignment */ - sub a4, a3, a1 -1: - lb a5, 0(a1) - addi a1, a1, 1 - sb a5, 0(t6) - addi t6, t6, 1 - bltu a1, a3, 1b - sub a2, a2, a4 /* Update count */ - -2: - andi a4, a2, ~((16*SZREG)-1) - beqz a4, 4f - add a3, a1, a4 -3: - REG_L a4, 0(a1) - REG_L a5, SZREG(a1) - REG_L a6, 2*SZREG(a1) - REG_L a7, 3*SZREG(a1) - REG_L t0, 4*SZREG(a1) - REG_L t1, 5*SZREG(a1) - REG_L t2, 6*SZREG(a1) - REG_L t3, 7*SZREG(a1) - REG_L t4, 8*SZREG(a1) - REG_L t5, 9*SZREG(a1) - REG_S a4, 0(t6) - REG_S a5, SZREG(t6) - REG_S a6, 2*SZREG(t6) - REG_S a7, 3*SZREG(t6) - REG_S t0, 4*SZREG(t6) - REG_S t1, 5*SZREG(t6) - REG_S t2, 6*SZREG(t6) - REG_S t3, 7*SZREG(t6) - REG_S t4, 8*SZREG(t6) - REG_S t5, 9*SZREG(t6) - REG_L a4, 10*SZREG(a1) - REG_L a5, 11*SZREG(a1) - REG_L a6, 12*SZREG(a1) - REG_L a7, 13*SZREG(a1) - REG_L t0, 14*SZREG(a1) - REG_L t1, 15*SZREG(a1) - addi a1, a1, 16*SZREG - REG_S a4, 10*SZREG(t6) - REG_S a5, 11*SZREG(t6) - REG_S a6, 12*SZREG(t6) - REG_S a7, 13*SZREG(t6) - REG_S t0, 14*SZREG(t6) - REG_S t1, 15*SZREG(t6) - addi t6, t6, 16*SZREG - bltu a1, a3, 3b - andi a2, a2, (16*SZREG)-1 /* Update count */ - -4: - /* Handle trailing misalignment */ - beqz a2, 6f - add a3, a1, a2 - - /* Use word-oriented copy if co-aligned to word boundary */ - or a5, a1, t6 - or a5, a5, a3 - andi a5, a5, 3 - bnez a5, 5f -7: - lw a4, 0(a1) - addi a1, a1, 4 - sw a4, 0(t6) - addi t6, t6, 4 - bltu a1, a3, 7b - - ret - -5: - lb a4, 0(a1) - addi a1, a1, 1 - sb a4, 0(t6) - addi t6, t6, 1 - bltu a1, a3, 5b -6: - ret -END(__memcpy) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c new file mode 100644 index 000000000000..525f9ee25a74 --- /dev/null +++ b/arch/riscv/lib/string.c @@ -0,0 +1,94 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * String functions optimized for hardware which doesn't + * handle unaligned memory accesses efficiently. + * + * Copyright (C) 2021 Matteo Croce + */ + +#include <linux/types.h> +#include <linux/module.h> + +/* size below a classic byte at time copy is done */ +#define MIN_THRESHOLD 64 + +/* convenience types to avoid cast between different pointer types */ +union types { + u8 *u8; + unsigned long *ulong; + uintptr_t uptr; +}; + +union const_types { + const u8 *u8; + unsigned long *ulong; +}; + +void *memcpy(void *dest, const void *src, size_t count) +{ + const int bytes_long = BITS_PER_LONG / 8; +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + const int mask = bytes_long - 1; + const int distance = (src - dest) & mask; +#endif + union const_types s = { .u8 = src }; + union types d = { .u8 = dest }; + +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + if (count <= MIN_THRESHOLD) + goto copy_remainder; + + /* copy a byte at time until destination is aligned */ + for (; count && d.uptr & mask; count--) + *d.u8++ = *s.u8++; + + if (distance) { + unsigned long last, next; + + /* move s backward to the previous alignment boundary */ + s.u8 -= distance; + + /* 32/64 bit wide copy from s to d. + * d is aligned now but s is not, so read s alignment wise, + * and do proper shift to get the right value. + * Works only on Little Endian machines. + */ + for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) { + last = next; + next = s.ulong[1]; + + d.ulong[0] = last >> (distance * 8) | + next << ((bytes_long - distance) * 8); + + d.ulong++; + s.ulong++; + } + + /* restore s with the original offset */ + s.u8 += distance; + } else +#endif + { + /* if the source and dest lower bits are the same, do a simple + * 32/64 bit wide copy. + */ + for (; count >= bytes_long; count -= bytes_long) + *d.ulong++ = *s.ulong++; + } + + /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */ + goto copy_remainder; + +copy_remainder: + while (count--) + *d.u8++ = *s.u8++; + + return dest; +} +EXPORT_SYMBOL(memcpy); + +void *__memcpy(void *dest, const void *src, size_t count) +{ + return memcpy(dest, src, count); +} +EXPORT_SYMBOL(__memcpy); -- 2.31.1 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 2:38 ` Matteo Croce @ 2021-06-15 8:57 ` David Laight -1 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-15 8:57 UTC (permalink / raw) To: 'Matteo Croce', linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce > Sent: 15 June 2021 03:38 > > Write a C version of memcpy() which uses the biggest data size allowed, > without generating unaligned accesses. I'm surprised that the C loop: > + for (; count >= bytes_long; count -= bytes_long) > + *d.ulong++ = *s.ulong++; ends up being faster than the ASM 'read lots' - 'write lots' loop. Especially since there was an earlier patch to convert copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop instead of a tight single register copy loop. I'd also guess that the performance needs to be measured on different classes of riscv cpu. A simple cpu will behave differently to one that can execute multiple instructions per clock. Any form of 'out of order' execution also changes things. The other big change is whether the cpu can to a memory read and write in the same clock. I'd guess that riscv exist with some/all of those features. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-15 8:57 ` David Laight 0 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-15 8:57 UTC (permalink / raw) To: 'Matteo Croce', linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce > Sent: 15 June 2021 03:38 > > Write a C version of memcpy() which uses the biggest data size allowed, > without generating unaligned accesses. I'm surprised that the C loop: > + for (; count >= bytes_long; count -= bytes_long) > + *d.ulong++ = *s.ulong++; ends up being faster than the ASM 'read lots' - 'write lots' loop. Especially since there was an earlier patch to convert copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop instead of a tight single register copy loop. I'd also guess that the performance needs to be measured on different classes of riscv cpu. A simple cpu will behave differently to one that can execute multiple instructions per clock. Any form of 'out of order' execution also changes things. The other big change is whether the cpu can to a memory read and write in the same clock. I'd guess that riscv exist with some/all of those features. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 8:57 ` David Laight @ 2021-06-15 13:08 ` Bin Meng -1 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-15 13:08 UTC (permalink / raw) To: David Laight Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 15 June 2021 03:38 > > > > Write a C version of memcpy() which uses the biggest data size allowed, > > without generating unaligned accesses. > > I'm surprised that the C loop: > > > + for (; count >= bytes_long; count -= bytes_long) > > + *d.ulong++ = *s.ulong++; > > ends up being faster than the ASM 'read lots' - 'write lots' loop. I believe that's because the assembly version has some unaligned access cases, which end up being trap-n-emulated in the OpenSBI firmware, and that is a big overhead. > > Especially since there was an earlier patch to convert > copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop > instead of a tight single register copy loop. > > I'd also guess that the performance needs to be measured on > different classes of riscv cpu. > > A simple cpu will behave differently to one that can execute > multiple instructions per clock. > Any form of 'out of order' execution also changes things. > The other big change is whether the cpu can to a memory > read and write in the same clock. > > I'd guess that riscv exist with some/all of those features. Regards, Bin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-15 13:08 ` Bin Meng 0 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-15 13:08 UTC (permalink / raw) To: David Laight Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 15 June 2021 03:38 > > > > Write a C version of memcpy() which uses the biggest data size allowed, > > without generating unaligned accesses. > > I'm surprised that the C loop: > > > + for (; count >= bytes_long; count -= bytes_long) > > + *d.ulong++ = *s.ulong++; > > ends up being faster than the ASM 'read lots' - 'write lots' loop. I believe that's because the assembly version has some unaligned access cases, which end up being trap-n-emulated in the OpenSBI firmware, and that is a big overhead. > > Especially since there was an earlier patch to convert > copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop > instead of a tight single register copy loop. > > I'd also guess that the performance needs to be measured on > different classes of riscv cpu. > > A simple cpu will behave differently to one that can execute > multiple instructions per clock. > Any form of 'out of order' execution also changes things. > The other big change is whether the cpu can to a memory > read and write in the same clock. > > I'd guess that riscv exist with some/all of those features. Regards, Bin _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 13:08 ` Bin Meng @ 2021-06-15 13:18 ` David Laight -1 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-15 13:18 UTC (permalink / raw) To: 'Bin Meng' Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini From: Bin Meng > Sent: 15 June 2021 14:09 > > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > ... > > I'm surprised that the C loop: > > > > > + for (; count >= bytes_long; count -= bytes_long) > > > + *d.ulong++ = *s.ulong++; > > > > ends up being faster than the ASM 'read lots' - 'write lots' loop. > > I believe that's because the assembly version has some unaligned > access cases, which end up being trap-n-emulated in the OpenSBI > firmware, and that is a big overhead. Ah, that would make sense since the asm user copy code was broken for misaligned copies. I suspect memcpy() was broken the same way. I'm surprised IP_NET_ALIGN isn't set to 2 to try to avoid all these misaligned copies in the network stack. Although avoiding 8n+4 aligned data is rather harder. Misaligned copies are just best avoided - really even on x86. The 'real fun' is when the access crosses TLB boundaries. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-15 13:18 ` David Laight 0 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-15 13:18 UTC (permalink / raw) To: 'Bin Meng' Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini From: Bin Meng > Sent: 15 June 2021 14:09 > > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > ... > > I'm surprised that the C loop: > > > > > + for (; count >= bytes_long; count -= bytes_long) > > > + *d.ulong++ = *s.ulong++; > > > > ends up being faster than the ASM 'read lots' - 'write lots' loop. > > I believe that's because the assembly version has some unaligned > access cases, which end up being trap-n-emulated in the OpenSBI > firmware, and that is a big overhead. Ah, that would make sense since the asm user copy code was broken for misaligned copies. I suspect memcpy() was broken the same way. I'm surprised IP_NET_ALIGN isn't set to 2 to try to avoid all these misaligned copies in the network stack. Although avoiding 8n+4 aligned data is rather harder. Misaligned copies are just best avoided - really even on x86. The 'real fun' is when the access crosses TLB boundaries. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 13:18 ` David Laight @ 2021-06-15 13:28 ` Bin Meng -1 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-15 13:28 UTC (permalink / raw) To: David Laight, Gary Guo Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini On Tue, Jun 15, 2021 at 9:18 PM David Laight <David.Laight@aculab.com> wrote: > > From: Bin Meng > > Sent: 15 June 2021 14:09 > > > > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > > > ... > > > I'm surprised that the C loop: > > > > > > > + for (; count >= bytes_long; count -= bytes_long) > > > > + *d.ulong++ = *s.ulong++; > > > > > > ends up being faster than the ASM 'read lots' - 'write lots' loop. > > > > I believe that's because the assembly version has some unaligned > > access cases, which end up being trap-n-emulated in the OpenSBI > > firmware, and that is a big overhead. > > Ah, that would make sense since the asm user copy code > was broken for misaligned copies. > I suspect memcpy() was broken the same way. > Yes, Gary Guo sent one patch long time ago against the broken assembly version, but that patch was still not applied as of today. https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ I suggest Matteo re-test using Gary's version. > I'm surprised IP_NET_ALIGN isn't set to 2 to try to > avoid all these misaligned copies in the network stack. > Although avoiding 8n+4 aligned data is rather harder. > > Misaligned copies are just best avoided - really even on x86. > The 'real fun' is when the access crosses TLB boundaries. Regards, Bin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-15 13:28 ` Bin Meng 0 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-15 13:28 UTC (permalink / raw) To: David Laight, Gary Guo Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini On Tue, Jun 15, 2021 at 9:18 PM David Laight <David.Laight@aculab.com> wrote: > > From: Bin Meng > > Sent: 15 June 2021 14:09 > > > > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > > > ... > > > I'm surprised that the C loop: > > > > > > > + for (; count >= bytes_long; count -= bytes_long) > > > > + *d.ulong++ = *s.ulong++; > > > > > > ends up being faster than the ASM 'read lots' - 'write lots' loop. > > > > I believe that's because the assembly version has some unaligned > > access cases, which end up being trap-n-emulated in the OpenSBI > > firmware, and that is a big overhead. > > Ah, that would make sense since the asm user copy code > was broken for misaligned copies. > I suspect memcpy() was broken the same way. > Yes, Gary Guo sent one patch long time ago against the broken assembly version, but that patch was still not applied as of today. https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ I suggest Matteo re-test using Gary's version. > I'm surprised IP_NET_ALIGN isn't set to 2 to try to > avoid all these misaligned copies in the network stack. > Although avoiding 8n+4 aligned data is rather harder. > > Misaligned copies are just best avoided - really even on x86. > The 'real fun' is when the access crosses TLB boundaries. Regards, Bin _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 13:28 ` Bin Meng @ 2021-06-15 16:12 ` Emil Renner Berthing -1 siblings, 0 replies; 64+ messages in thread From: Emil Renner Berthing @ 2021-06-15 16:12 UTC (permalink / raw) To: Bin Meng Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote: > ... > Yes, Gary Guo sent one patch long time ago against the broken assembly > version, but that patch was still not applied as of today. > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ > > I suggest Matteo re-test using Gary's version. That's a good idea, but if you read the replies to Gary's original patch https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ .. both Gary, Palmer and David would rather like a C-based version. This is one attempt at providing that. > > I'm surprised IP_NET_ALIGN isn't set to 2 to try to > > avoid all these misaligned copies in the network stack. > > Although avoiding 8n+4 aligned data is rather harder. > > > > Misaligned copies are just best avoided - really even on x86. > > The 'real fun' is when the access crosses TLB boundaries. > > Regards, > Bin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-15 16:12 ` Emil Renner Berthing 0 siblings, 0 replies; 64+ messages in thread From: Emil Renner Berthing @ 2021-06-15 16:12 UTC (permalink / raw) To: Bin Meng Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote: > ... > Yes, Gary Guo sent one patch long time ago against the broken assembly > version, but that patch was still not applied as of today. > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ > > I suggest Matteo re-test using Gary's version. That's a good idea, but if you read the replies to Gary's original patch https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ .. both Gary, Palmer and David would rather like a C-based version. This is one attempt at providing that. > > I'm surprised IP_NET_ALIGN isn't set to 2 to try to > > avoid all these misaligned copies in the network stack. > > Although avoiding 8n+4 aligned data is rather harder. > > > > Misaligned copies are just best avoided - really even on x86. > > The 'real fun' is when the access crosses TLB boundaries. > > Regards, > Bin _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 16:12 ` Emil Renner Berthing @ 2021-06-16 0:33 ` Bin Meng -1 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-16 0:33 UTC (permalink / raw) To: Emil Renner Berthing Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing <kernel@esmil.dk> wrote: > > On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote: > > ... > > Yes, Gary Guo sent one patch long time ago against the broken assembly > > version, but that patch was still not applied as of today. > > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ > > > > I suggest Matteo re-test using Gary's version. > > That's a good idea, but if you read the replies to Gary's original patch > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > .. both Gary, Palmer and David would rather like a C-based version. > This is one attempt at providing that. Yep, I prefer C as well :) But if you check commit 04091d6, the assembly version was introduced for KASAN. So if we are to change it back to C, please make sure KASAN is not broken. Regards, Bin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-16 0:33 ` Bin Meng 0 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-16 0:33 UTC (permalink / raw) To: Emil Renner Berthing Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing <kernel@esmil.dk> wrote: > > On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote: > > ... > > Yes, Gary Guo sent one patch long time ago against the broken assembly > > version, but that patch was still not applied as of today. > > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ > > > > I suggest Matteo re-test using Gary's version. > > That's a good idea, but if you read the replies to Gary's original patch > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > .. both Gary, Palmer and David would rather like a C-based version. > This is one attempt at providing that. Yep, I prefer C as well :) But if you check commit 04091d6, the assembly version was introduced for KASAN. So if we are to change it back to C, please make sure KASAN is not broken. Regards, Bin _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-16 0:33 ` Bin Meng @ 2021-06-16 2:01 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-16 2:01 UTC (permalink / raw) To: Bin Meng Cc: Emil Renner Berthing, David Laight, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Wed, 16 Jun 2021 08:33:21 +0800 Bin Meng <bmeng.cn@gmail.com> wrote: > On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing > <kernel@esmil.dk> wrote: > > > > On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote: > > > ... > > > Yes, Gary Guo sent one patch long time ago against the broken > > > assembly version, but that patch was still not applied as of > > > today. > > > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ > > > > > > I suggest Matteo re-test using Gary's version. > > > > That's a good idea, but if you read the replies to Gary's original > > patch > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > .. both Gary, Palmer and David would rather like a C-based version. > > This is one attempt at providing that. > > Yep, I prefer C as well :) > > But if you check commit 04091d6, the assembly version was introduced > for KASAN. So if we are to change it back to C, please make sure KASAN > is not broken. > > Regards, > Bin > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv I added a small benchmark for memcpy() and memset() in lib/test_string.c: memcpy_align_selftest(): #define PG_SIZE (1 << (MAX_ORDER - 1 + PAGE_SHIFT)) page1 = alloc_pages(GFP_KERNEL, MAX_ORDER-1); page2 = alloc_pages(GFP_KERNEL, MAX_ORDER-1); for (i = 0; i < sizeof(void*); i++) { t0 = ktime_get(); memset(dst + i, 0, PG_SIZE - i); t1 = ktime_get(); printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i, PG_SIZE * (1000000000l / 1048576l) / (t1-t0)); } memset_align_selftest(): page = alloc_pages(GFP_KERNEL, MAX_ORDER-1); for (i = 0; i < sizeof(void*); i++) { for (j = 0; j < sizeof(void*); j++) { t0 = ktime_get(); memcpy(dst + j, src + i, PG_SIZE - max(i, j)); t1 = ktime_get(); printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n", i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0)); } } And I ran it agains the three implementations, current, Gary's assembler and mine in C. Current: [ 38.980687] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s [ 39.021612] Strings selftest: memcpy(src+0, dst+1): 113 Mb/s [ 39.062191] Strings selftest: memcpy(src+0, dst+2): 114 Mb/s [ 39.102669] Strings selftest: memcpy(src+0, dst+3): 114 Mb/s [ 39.127423] Strings selftest: memcpy(src+0, dst+4): 209 Mb/s [ 39.167836] Strings selftest: memcpy(src+0, dst+5): 115 Mb/s [ 39.208305] Strings selftest: memcpy(src+0, dst+6): 114 Mb/s [ 39.248712] Strings selftest: memcpy(src+0, dst+7): 115 Mb/s [ 39.288144] Strings selftest: memcpy(src+1, dst+0): 118 Mb/s [ 39.309190] Strings selftest: memcpy(src+1, dst+1): 260 Mb/s [ 39.349721] Strings selftest: memcpy(src+1, dst+2): 114 Mb/s [...] [ 41.289423] Strings selftest: memcpy(src+7, dst+5): 114 Mb/s [ 41.328801] Strings selftest: memcpy(src+7, dst+6): 118 Mb/s [ 41.349907] Strings selftest: memcpy(src+7, dst+7): 259 Mb/s [ 41.377735] Strings selftest: memset(dst+0): 241 Mb/s [ 41.397882] Strings selftest: memset(dst+1): 265 Mb/s [ 41.417666] Strings selftest: memset(dst+2): 272 Mb/s [ 41.437169] Strings selftest: memset(dst+3): 277 Mb/s [ 41.456656] Strings selftest: memset(dst+4): 277 Mb/s [ 41.476125] Strings selftest: memset(dst+5): 278 Mb/s [ 41.495555] Strings selftest: memset(dst+6): 278 Mb/s [ 41.515002] Strings selftest: memset(dst+7): 278 Mb/s Gary's [ 27.438112] Strings selftest: memcpy(src+0, dst+0): 232 Mb/s [ 27.461586] Strings selftest: memcpy(src+0, dst+1): 224 Mb/s [ 27.484691] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s [ 27.507693] Strings selftest: memcpy(src+0, dst+3): 230 Mb/s [ 27.530758] Strings selftest: memcpy(src+0, dst+4): 229 Mb/s [ 27.553840] Strings selftest: memcpy(src+0, dst+5): 229 Mb/s [ 27.576793] Strings selftest: memcpy(src+0, dst+6): 231 Mb/s [ 27.599862] Strings selftest: memcpy(src+0, dst+7): 230 Mb/s [ 27.622888] Strings selftest: memcpy(src+1, dst+0): 230 Mb/s [ 27.643964] Strings selftest: memcpy(src+1, dst+1): 259 Mb/s [ 27.666926] Strings selftest: memcpy(src+1, dst+2): 231 Mb/s [...] [ 28.831726] Strings selftest: memcpy(src+7, dst+5): 230 Mb/s [ 28.854790] Strings selftest: memcpy(src+7, dst+6): 229 Mb/s [ 28.875844] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s [ 28.903666] Strings selftest: memset(dst+0): 240 Mb/s [ 28.923533] Strings selftest: memset(dst+1): 269 Mb/s [ 28.943100] Strings selftest: memset(dst+2): 275 Mb/s [ 28.962554] Strings selftest: memset(dst+3): 277 Mb/s [ 28.982009] Strings selftest: memset(dst+4): 277 Mb/s [ 29.001412] Strings selftest: memset(dst+5): 278 Mb/s [ 29.020894] Strings selftest: memset(dst+6): 277 Mb/s [ 29.040383] Strings selftest: memset(dst+7): 276 Mb/s Mine: [ 33.916144] Strings selftest: memcpy(src+0, dst+0): 222 Mb/s [ 33.939520] Strings selftest: memcpy(src+0, dst+1): 226 Mb/s [ 33.962666] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s [ 33.985749] Strings selftest: memcpy(src+0, dst+3): 229 Mb/s [ 34.008748] Strings selftest: memcpy(src+0, dst+4): 231 Mb/s [ 34.031970] Strings selftest: memcpy(src+0, dst+5): 228 Mb/s [ 34.055065] Strings selftest: memcpy(src+0, dst+6): 229 Mb/s [ 34.078068] Strings selftest: memcpy(src+0, dst+7): 231 Mb/s [ 34.101177] Strings selftest: memcpy(src+1, dst+0): 229 Mb/s [ 34.122995] Strings selftest: memcpy(src+1, dst+1): 247 Mb/s [ 34.146072] Strings selftest: memcpy(src+1, dst+2): 229 Mb/s [...] [ 35.315594] Strings selftest: memcpy(src+7, dst+5): 229 Mb/s [ 35.338617] Strings selftest: memcpy(src+7, dst+6): 230 Mb/s [ 35.360464] Strings selftest: memcpy(src+7, dst+7): 247 Mb/s [ 35.388929] Strings selftest: memset(dst+0): 232 Mb/s [ 35.409351] Strings selftest: memset(dst+1): 260 Mb/s [ 35.429434] Strings selftest: memset(dst+2): 266 Mb/s [ 35.449460] Strings selftest: memset(dst+3): 267 Mb/s [ 35.469479] Strings selftest: memset(dst+4): 267 Mb/s [ 35.489481] Strings selftest: memset(dst+5): 268 Mb/s [ 35.509443] Strings selftest: memset(dst+6): 269 Mb/s [ 35.529449] Strings selftest: memset(dst+7): 268 Mb/s Leaving out the first memcpy/set of every test which is always slower, (maybe because of a cache miss?), the current implementation copies 260 Mb/s when the low order bits match, and 114 otherwise. Memset is stable at 278 Mb/s. Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, and 230 Mb/s otherwise. Memset is the same as the current one. Mine has the same speed of Gary's one when the low order bits mismatch, but it's slower when equally aligned, it stops at 247 Mb/s. Memset is slighty slower ad 269 Mb/s. I'm not familiar with RISC-V assembly, but looking at Gary's assembler and I think that he manually unrolled the loop by copying 16 uint64_t at time using 16 registers. I managed to do the same with a small change in the C code and a pragma directive: This for memcpy(): if (distance) { unsigned long last, next; int i; s.u8 -= distance; for (; count >= bytes_long * 8 + mask; count -= bytes_long * 8) { next = s.ulong[0]; for (i = 0; i < 8; i++) { last = next; next = s.ulong[i + 1]; d.ulong[i] = last >> (distance * 8) | next << ((bytes_long - distance) * 8); } d.ulong += 8; s.ulong += 8; } s.u8 += distance; } else { /* 8 byte wide copy */ int i; for (; count >= bytes_long * 8; count -= bytes_long * 8) { #pragma GCC unroll 8 for (i = 0; i < 8; i++) d.ulong[i] = s.ulong[i]; d.ulong += 8; s.ulong += 8; } } And this for memset: for (; count >= bytes_long * 8; count -= bytes_long * 8) { #pragma GCC unroll 8 for (i = 0; i < 8; i++) dest.ulong[i] = cu; dest.ulong += 8; } And the generated machine code is very, very similar to Gary's one! And these are the result: [ 35.898366] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s [ 35.920942] Strings selftest: memcpy(src+0, dst+1): 236 Mb/s [ 35.943171] Strings selftest: memcpy(src+0, dst+2): 241 Mb/s [ 35.965291] Strings selftest: memcpy(src+0, dst+3): 242 Mb/s [ 35.987374] Strings selftest: memcpy(src+0, dst+4): 244 Mb/s [ 36.009554] Strings selftest: memcpy(src+0, dst+5): 242 Mb/s [ 36.031721] Strings selftest: memcpy(src+0, dst+6): 242 Mb/s [ 36.053881] Strings selftest: memcpy(src+0, dst+7): 242 Mb/s [ 36.075949] Strings selftest: memcpy(src+1, dst+0): 243 Mb/s [ 36.097084] Strings selftest: memcpy(src+1, dst+1): 258 Mb/s [ 36.119269] Strings selftest: memcpy(src+1, dst+2): 242 Mb/s [...] [ 37.242433] Strings selftest: memcpy(src+7, dst+5): 242 Mb/s [ 37.264571] Strings selftest: memcpy(src+7, dst+6): 242 Mb/s [ 37.285609] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s [ 37.313633] Strings selftest: memset(dst+0): 237 Mb/s [ 37.333682] Strings selftest: memset(dst+1): 266 Mb/s [ 37.353375] Strings selftest: memset(dst+2): 273 Mb/s [ 37.373000] Strings selftest: memset(dst+3): 274 Mb/s [ 37.392608] Strings selftest: memset(dst+4): 274 Mb/s [ 37.412220] Strings selftest: memset(dst+5): 274 Mb/s [ 37.431848] Strings selftest: memset(dst+6): 274 Mb/s [ 37.451467] Strings selftest: memset(dst+7): 274 Mb/s This version is even faster than the assembly one, but it won't work for copies/set smaller that at least 64 bytes, or even 128. With small buffers it will copy bytes one at time, so I don't know if it's worth it. What is preferred in your opinion, an implementation which is always fast with all sizes, or one which is a bit faster but slow with small copies? -- per aspera ad upstream ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-16 2:01 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-16 2:01 UTC (permalink / raw) To: Bin Meng Cc: Emil Renner Berthing, David Laight, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Wed, 16 Jun 2021 08:33:21 +0800 Bin Meng <bmeng.cn@gmail.com> wrote: > On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing > <kernel@esmil.dk> wrote: > > > > On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote: > > > ... > > > Yes, Gary Guo sent one patch long time ago against the broken > > > assembly version, but that patch was still not applied as of > > > today. > > > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ > > > > > > I suggest Matteo re-test using Gary's version. > > > > That's a good idea, but if you read the replies to Gary's original > > patch > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > .. both Gary, Palmer and David would rather like a C-based version. > > This is one attempt at providing that. > > Yep, I prefer C as well :) > > But if you check commit 04091d6, the assembly version was introduced > for KASAN. So if we are to change it back to C, please make sure KASAN > is not broken. > > Regards, > Bin > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv I added a small benchmark for memcpy() and memset() in lib/test_string.c: memcpy_align_selftest(): #define PG_SIZE (1 << (MAX_ORDER - 1 + PAGE_SHIFT)) page1 = alloc_pages(GFP_KERNEL, MAX_ORDER-1); page2 = alloc_pages(GFP_KERNEL, MAX_ORDER-1); for (i = 0; i < sizeof(void*); i++) { t0 = ktime_get(); memset(dst + i, 0, PG_SIZE - i); t1 = ktime_get(); printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i, PG_SIZE * (1000000000l / 1048576l) / (t1-t0)); } memset_align_selftest(): page = alloc_pages(GFP_KERNEL, MAX_ORDER-1); for (i = 0; i < sizeof(void*); i++) { for (j = 0; j < sizeof(void*); j++) { t0 = ktime_get(); memcpy(dst + j, src + i, PG_SIZE - max(i, j)); t1 = ktime_get(); printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n", i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0)); } } And I ran it agains the three implementations, current, Gary's assembler and mine in C. Current: [ 38.980687] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s [ 39.021612] Strings selftest: memcpy(src+0, dst+1): 113 Mb/s [ 39.062191] Strings selftest: memcpy(src+0, dst+2): 114 Mb/s [ 39.102669] Strings selftest: memcpy(src+0, dst+3): 114 Mb/s [ 39.127423] Strings selftest: memcpy(src+0, dst+4): 209 Mb/s [ 39.167836] Strings selftest: memcpy(src+0, dst+5): 115 Mb/s [ 39.208305] Strings selftest: memcpy(src+0, dst+6): 114 Mb/s [ 39.248712] Strings selftest: memcpy(src+0, dst+7): 115 Mb/s [ 39.288144] Strings selftest: memcpy(src+1, dst+0): 118 Mb/s [ 39.309190] Strings selftest: memcpy(src+1, dst+1): 260 Mb/s [ 39.349721] Strings selftest: memcpy(src+1, dst+2): 114 Mb/s [...] [ 41.289423] Strings selftest: memcpy(src+7, dst+5): 114 Mb/s [ 41.328801] Strings selftest: memcpy(src+7, dst+6): 118 Mb/s [ 41.349907] Strings selftest: memcpy(src+7, dst+7): 259 Mb/s [ 41.377735] Strings selftest: memset(dst+0): 241 Mb/s [ 41.397882] Strings selftest: memset(dst+1): 265 Mb/s [ 41.417666] Strings selftest: memset(dst+2): 272 Mb/s [ 41.437169] Strings selftest: memset(dst+3): 277 Mb/s [ 41.456656] Strings selftest: memset(dst+4): 277 Mb/s [ 41.476125] Strings selftest: memset(dst+5): 278 Mb/s [ 41.495555] Strings selftest: memset(dst+6): 278 Mb/s [ 41.515002] Strings selftest: memset(dst+7): 278 Mb/s Gary's [ 27.438112] Strings selftest: memcpy(src+0, dst+0): 232 Mb/s [ 27.461586] Strings selftest: memcpy(src+0, dst+1): 224 Mb/s [ 27.484691] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s [ 27.507693] Strings selftest: memcpy(src+0, dst+3): 230 Mb/s [ 27.530758] Strings selftest: memcpy(src+0, dst+4): 229 Mb/s [ 27.553840] Strings selftest: memcpy(src+0, dst+5): 229 Mb/s [ 27.576793] Strings selftest: memcpy(src+0, dst+6): 231 Mb/s [ 27.599862] Strings selftest: memcpy(src+0, dst+7): 230 Mb/s [ 27.622888] Strings selftest: memcpy(src+1, dst+0): 230 Mb/s [ 27.643964] Strings selftest: memcpy(src+1, dst+1): 259 Mb/s [ 27.666926] Strings selftest: memcpy(src+1, dst+2): 231 Mb/s [...] [ 28.831726] Strings selftest: memcpy(src+7, dst+5): 230 Mb/s [ 28.854790] Strings selftest: memcpy(src+7, dst+6): 229 Mb/s [ 28.875844] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s [ 28.903666] Strings selftest: memset(dst+0): 240 Mb/s [ 28.923533] Strings selftest: memset(dst+1): 269 Mb/s [ 28.943100] Strings selftest: memset(dst+2): 275 Mb/s [ 28.962554] Strings selftest: memset(dst+3): 277 Mb/s [ 28.982009] Strings selftest: memset(dst+4): 277 Mb/s [ 29.001412] Strings selftest: memset(dst+5): 278 Mb/s [ 29.020894] Strings selftest: memset(dst+6): 277 Mb/s [ 29.040383] Strings selftest: memset(dst+7): 276 Mb/s Mine: [ 33.916144] Strings selftest: memcpy(src+0, dst+0): 222 Mb/s [ 33.939520] Strings selftest: memcpy(src+0, dst+1): 226 Mb/s [ 33.962666] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s [ 33.985749] Strings selftest: memcpy(src+0, dst+3): 229 Mb/s [ 34.008748] Strings selftest: memcpy(src+0, dst+4): 231 Mb/s [ 34.031970] Strings selftest: memcpy(src+0, dst+5): 228 Mb/s [ 34.055065] Strings selftest: memcpy(src+0, dst+6): 229 Mb/s [ 34.078068] Strings selftest: memcpy(src+0, dst+7): 231 Mb/s [ 34.101177] Strings selftest: memcpy(src+1, dst+0): 229 Mb/s [ 34.122995] Strings selftest: memcpy(src+1, dst+1): 247 Mb/s [ 34.146072] Strings selftest: memcpy(src+1, dst+2): 229 Mb/s [...] [ 35.315594] Strings selftest: memcpy(src+7, dst+5): 229 Mb/s [ 35.338617] Strings selftest: memcpy(src+7, dst+6): 230 Mb/s [ 35.360464] Strings selftest: memcpy(src+7, dst+7): 247 Mb/s [ 35.388929] Strings selftest: memset(dst+0): 232 Mb/s [ 35.409351] Strings selftest: memset(dst+1): 260 Mb/s [ 35.429434] Strings selftest: memset(dst+2): 266 Mb/s [ 35.449460] Strings selftest: memset(dst+3): 267 Mb/s [ 35.469479] Strings selftest: memset(dst+4): 267 Mb/s [ 35.489481] Strings selftest: memset(dst+5): 268 Mb/s [ 35.509443] Strings selftest: memset(dst+6): 269 Mb/s [ 35.529449] Strings selftest: memset(dst+7): 268 Mb/s Leaving out the first memcpy/set of every test which is always slower, (maybe because of a cache miss?), the current implementation copies 260 Mb/s when the low order bits match, and 114 otherwise. Memset is stable at 278 Mb/s. Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, and 230 Mb/s otherwise. Memset is the same as the current one. Mine has the same speed of Gary's one when the low order bits mismatch, but it's slower when equally aligned, it stops at 247 Mb/s. Memset is slighty slower ad 269 Mb/s. I'm not familiar with RISC-V assembly, but looking at Gary's assembler and I think that he manually unrolled the loop by copying 16 uint64_t at time using 16 registers. I managed to do the same with a small change in the C code and a pragma directive: This for memcpy(): if (distance) { unsigned long last, next; int i; s.u8 -= distance; for (; count >= bytes_long * 8 + mask; count -= bytes_long * 8) { next = s.ulong[0]; for (i = 0; i < 8; i++) { last = next; next = s.ulong[i + 1]; d.ulong[i] = last >> (distance * 8) | next << ((bytes_long - distance) * 8); } d.ulong += 8; s.ulong += 8; } s.u8 += distance; } else { /* 8 byte wide copy */ int i; for (; count >= bytes_long * 8; count -= bytes_long * 8) { #pragma GCC unroll 8 for (i = 0; i < 8; i++) d.ulong[i] = s.ulong[i]; d.ulong += 8; s.ulong += 8; } } And this for memset: for (; count >= bytes_long * 8; count -= bytes_long * 8) { #pragma GCC unroll 8 for (i = 0; i < 8; i++) dest.ulong[i] = cu; dest.ulong += 8; } And the generated machine code is very, very similar to Gary's one! And these are the result: [ 35.898366] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s [ 35.920942] Strings selftest: memcpy(src+0, dst+1): 236 Mb/s [ 35.943171] Strings selftest: memcpy(src+0, dst+2): 241 Mb/s [ 35.965291] Strings selftest: memcpy(src+0, dst+3): 242 Mb/s [ 35.987374] Strings selftest: memcpy(src+0, dst+4): 244 Mb/s [ 36.009554] Strings selftest: memcpy(src+0, dst+5): 242 Mb/s [ 36.031721] Strings selftest: memcpy(src+0, dst+6): 242 Mb/s [ 36.053881] Strings selftest: memcpy(src+0, dst+7): 242 Mb/s [ 36.075949] Strings selftest: memcpy(src+1, dst+0): 243 Mb/s [ 36.097084] Strings selftest: memcpy(src+1, dst+1): 258 Mb/s [ 36.119269] Strings selftest: memcpy(src+1, dst+2): 242 Mb/s [...] [ 37.242433] Strings selftest: memcpy(src+7, dst+5): 242 Mb/s [ 37.264571] Strings selftest: memcpy(src+7, dst+6): 242 Mb/s [ 37.285609] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s [ 37.313633] Strings selftest: memset(dst+0): 237 Mb/s [ 37.333682] Strings selftest: memset(dst+1): 266 Mb/s [ 37.353375] Strings selftest: memset(dst+2): 273 Mb/s [ 37.373000] Strings selftest: memset(dst+3): 274 Mb/s [ 37.392608] Strings selftest: memset(dst+4): 274 Mb/s [ 37.412220] Strings selftest: memset(dst+5): 274 Mb/s [ 37.431848] Strings selftest: memset(dst+6): 274 Mb/s [ 37.451467] Strings selftest: memset(dst+7): 274 Mb/s This version is even faster than the assembly one, but it won't work for copies/set smaller that at least 64 bytes, or even 128. With small buffers it will copy bytes one at time, so I don't know if it's worth it. What is preferred in your opinion, an implementation which is always fast with all sizes, or one which is a bit faster but slow with small copies? -- per aspera ad upstream _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy 2021-06-16 2:01 ` Matteo Croce @ 2021-06-16 8:24 ` David Laight -1 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-16 8:24 UTC (permalink / raw) To: 'Matteo Croce', Bin Meng Cc: Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini From: Matteo Croce > Sent: 16 June 2021 03:02 ... > > > That's a good idea, but if you read the replies to Gary's original > > > patch > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > > .. both Gary, Palmer and David would rather like a C-based version. > > > This is one attempt at providing that. > > > > Yep, I prefer C as well :) > > > > But if you check commit 04091d6, the assembly version was introduced > > for KASAN. So if we are to change it back to C, please make sure KASAN > > is not broken. > > ... > Leaving out the first memcpy/set of every test which is always slower, (maybe > because of a cache miss?), the current implementation copies 260 Mb/s when > the low order bits match, and 114 otherwise. > Memset is stable at 278 Mb/s. > > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, > and 230 Mb/s otherwise. Memset is the same as the current one. Any idea what the attainable performance is for the cpu you are using? Since both memset and memcpy are running at much the same speed I suspect it is all limited by the writes. 272MB/s is only 34M writes/sec. This seems horribly slow for a modern cpu. So is this actually really limited by the cache writes to physical memory? You might want to do some tests (userspace is fine) where you check much smaller lengths that definitely sit within the data cache. It is also worth checking how much overhead there is for short copies - they are almost certainly more common than you might expect. This is one problem with excessive loop unrolling - the 'special cases' for the ends of the buffer start having a big effect on small copies. For cpu that support misaligned memory accesses, one 'trick' for transfers longer than a 'word' is to do a (probably) misaligned transfer of the last word of the buffer first followed by the transfer of the rest of the buffer (overlapping a few bytes at the end). This saves on conditionals and temporary values. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-16 8:24 ` David Laight 0 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-16 8:24 UTC (permalink / raw) To: 'Matteo Croce', Bin Meng Cc: Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini From: Matteo Croce > Sent: 16 June 2021 03:02 ... > > > That's a good idea, but if you read the replies to Gary's original > > > patch > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > > .. both Gary, Palmer and David would rather like a C-based version. > > > This is one attempt at providing that. > > > > Yep, I prefer C as well :) > > > > But if you check commit 04091d6, the assembly version was introduced > > for KASAN. So if we are to change it back to C, please make sure KASAN > > is not broken. > > ... > Leaving out the first memcpy/set of every test which is always slower, (maybe > because of a cache miss?), the current implementation copies 260 Mb/s when > the low order bits match, and 114 otherwise. > Memset is stable at 278 Mb/s. > > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, > and 230 Mb/s otherwise. Memset is the same as the current one. Any idea what the attainable performance is for the cpu you are using? Since both memset and memcpy are running at much the same speed I suspect it is all limited by the writes. 272MB/s is only 34M writes/sec. This seems horribly slow for a modern cpu. So is this actually really limited by the cache writes to physical memory? You might want to do some tests (userspace is fine) where you check much smaller lengths that definitely sit within the data cache. It is also worth checking how much overhead there is for short copies - they are almost certainly more common than you might expect. This is one problem with excessive loop unrolling - the 'special cases' for the ends of the buffer start having a big effect on small copies. For cpu that support misaligned memory accesses, one 'trick' for transfers longer than a 'word' is to do a (probably) misaligned transfer of the last word of the buffer first followed by the transfer of the rest of the buffer (overlapping a few bytes at the end). This saves on conditionals and temporary values. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-16 8:24 ` David Laight @ 2021-06-16 10:48 ` Akira Tsukamoto -1 siblings, 0 replies; 64+ messages in thread From: Akira Tsukamoto @ 2021-06-16 10:48 UTC (permalink / raw) To: David Laight Cc: Matteo Croce, Bin Meng, Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Drew Fustini On Wed, Jun 16, 2021 at 5:24 PM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 16 June 2021 03:02 > ... > > > > That's a good idea, but if you read the replies to Gary's original > > > > patch > > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > > > .. both Gary, Palmer and David would rather like a C-based version. > > > > This is one attempt at providing that. > > > > > > Yep, I prefer C as well :) > > > > > > But if you check commit 04091d6, the assembly version was introduced > > > for KASAN. So if we are to change it back to C, please make sure KASAN > > > is not broken. > > > > ... > > Leaving out the first memcpy/set of every test which is always slower, (maybe > > because of a cache miss?), the current implementation copies 260 Mb/s when > > the low order bits match, and 114 otherwise. > > Memset is stable at 278 Mb/s. > > > > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, > > and 230 Mb/s otherwise. Memset is the same as the current one. > > Any idea what the attainable performance is for the cpu you are using? > Since both memset and memcpy are running at much the same speed > I suspect it is all limited by the writes. > > 272MB/s is only 34M writes/sec. > This seems horribly slow for a modern cpu. > So is this actually really limited by the cache writes to physical memory? > > You might want to do some tests (userspace is fine) where you > check much smaller lengths that definitely sit within the data cache. > > It is also worth checking how much overhead there is for > short copies - they are almost certainly more common than > you might expect. > This is one problem with excessive loop unrolling - the 'special > cases' for the ends of the buffer start having a big effect > on small copies. > > For cpu that support misaligned memory accesses, one 'trick' > for transfers longer than a 'word' is to do a (probably) misaligned > transfer of the last word of the buffer first followed by the > transfer of the rest of the buffer (overlapping a few bytes at the end). > This saves on conditionals and temporary values. I am fine with Matteo's memcpy. The two culprits seen by the `perf top -Ue task-clock` output during the tcp and ucp network are > Overhead Shared O Symbol > 42.22% [kernel] [k] memcpy > 35.00% [kernel] [k] __asm_copy_to_user so we really need to optimize both memcpy and __asm_copy_to_user. The main reason of speed up in memcpy is that > The Gary's assembly version of memcpy is improving by not using unaligned > access in 64 bit boundary, uses shifting it after reading with offset of > aligned access, because every misaligned access is trapped and switches to > opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel) > and M-mode (opensbi) switching. which are in the code: Gary's: + /* Calculate shifts */ + slli t3, a3, 3 + sub t4, x0, t3 /* negate is okay as shift will only look at LSBs */ + + /* Load the initial value and align a1 */ + andi a1, a1, ~(SZREG-1) + REG_L a5, 0(a1) + + addi t0, t0, -(SZREG-1) + /* At least one iteration will be executed here, no check */ +1: + srl a4, a5, t3 + REG_L a5, SZREG(a1) + addi a1, a1, SZREG + sll a2, a5, t4 + or a2, a2, a4 + REG_S a2, 0(a0) + addi a0, a0, SZREG + bltu a0, t0, 1b and Matteo ported to C: +#pragma GCC unroll 8 + for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) { + last = next; + next = s.ulong[1]; + + d.ulong[0] = last >> (distance * 8) | + next << ((bytes_long - distance) * 8); + + d.ulong++; + s.ulong++; + } I believe this is reasonable and enough to be in the upstream. Akira > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-16 10:48 ` Akira Tsukamoto 0 siblings, 0 replies; 64+ messages in thread From: Akira Tsukamoto @ 2021-06-16 10:48 UTC (permalink / raw) To: David Laight Cc: Matteo Croce, Bin Meng, Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Drew Fustini On Wed, Jun 16, 2021 at 5:24 PM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 16 June 2021 03:02 > ... > > > > That's a good idea, but if you read the replies to Gary's original > > > > patch > > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > > > .. both Gary, Palmer and David would rather like a C-based version. > > > > This is one attempt at providing that. > > > > > > Yep, I prefer C as well :) > > > > > > But if you check commit 04091d6, the assembly version was introduced > > > for KASAN. So if we are to change it back to C, please make sure KASAN > > > is not broken. > > > > ... > > Leaving out the first memcpy/set of every test which is always slower, (maybe > > because of a cache miss?), the current implementation copies 260 Mb/s when > > the low order bits match, and 114 otherwise. > > Memset is stable at 278 Mb/s. > > > > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, > > and 230 Mb/s otherwise. Memset is the same as the current one. > > Any idea what the attainable performance is for the cpu you are using? > Since both memset and memcpy are running at much the same speed > I suspect it is all limited by the writes. > > 272MB/s is only 34M writes/sec. > This seems horribly slow for a modern cpu. > So is this actually really limited by the cache writes to physical memory? > > You might want to do some tests (userspace is fine) where you > check much smaller lengths that definitely sit within the data cache. > > It is also worth checking how much overhead there is for > short copies - they are almost certainly more common than > you might expect. > This is one problem with excessive loop unrolling - the 'special > cases' for the ends of the buffer start having a big effect > on small copies. > > For cpu that support misaligned memory accesses, one 'trick' > for transfers longer than a 'word' is to do a (probably) misaligned > transfer of the last word of the buffer first followed by the > transfer of the rest of the buffer (overlapping a few bytes at the end). > This saves on conditionals and temporary values. I am fine with Matteo's memcpy. The two culprits seen by the `perf top -Ue task-clock` output during the tcp and ucp network are > Overhead Shared O Symbol > 42.22% [kernel] [k] memcpy > 35.00% [kernel] [k] __asm_copy_to_user so we really need to optimize both memcpy and __asm_copy_to_user. The main reason of speed up in memcpy is that > The Gary's assembly version of memcpy is improving by not using unaligned > access in 64 bit boundary, uses shifting it after reading with offset of > aligned access, because every misaligned access is trapped and switches to > opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel) > and M-mode (opensbi) switching. which are in the code: Gary's: + /* Calculate shifts */ + slli t3, a3, 3 + sub t4, x0, t3 /* negate is okay as shift will only look at LSBs */ + + /* Load the initial value and align a1 */ + andi a1, a1, ~(SZREG-1) + REG_L a5, 0(a1) + + addi t0, t0, -(SZREG-1) + /* At least one iteration will be executed here, no check */ +1: + srl a4, a5, t3 + REG_L a5, SZREG(a1) + addi a1, a1, SZREG + sll a2, a5, t4 + or a2, a2, a4 + REG_S a2, 0(a0) + addi a0, a0, SZREG + bltu a0, t0, 1b and Matteo ported to C: +#pragma GCC unroll 8 + for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) { + last = next; + next = s.ulong[1]; + + d.ulong[0] = last >> (distance * 8) | + next << ((bytes_long - distance) * 8); + + d.ulong++; + s.ulong++; + } I believe this is reasonable and enough to be in the upstream. Akira > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-16 8:24 ` David Laight @ 2021-06-16 19:06 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-16 19:06 UTC (permalink / raw) To: David Laight Cc: Bin Meng, Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Wed, Jun 16, 2021 at 10:24 AM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 16 June 2021 03:02 > ... > > > > That's a good idea, but if you read the replies to Gary's original > > > > patch > > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > > > .. both Gary, Palmer and David would rather like a C-based version. > > > > This is one attempt at providing that. > > > > > > Yep, I prefer C as well :) > > > > > > But if you check commit 04091d6, the assembly version was introduced > > > for KASAN. So if we are to change it back to C, please make sure KASAN > > > is not broken. > > > > ... > > Leaving out the first memcpy/set of every test which is always slower, (maybe > > because of a cache miss?), the current implementation copies 260 Mb/s when > > the low order bits match, and 114 otherwise. > > Memset is stable at 278 Mb/s. > > > > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, > > and 230 Mb/s otherwise. Memset is the same as the current one. > > Any idea what the attainable performance is for the cpu you are using? > Since both memset and memcpy are running at much the same speed > I suspect it is all limited by the writes. > > 272MB/s is only 34M writes/sec. > This seems horribly slow for a modern cpu. > So is this actually really limited by the cache writes to physical memory? > > You might want to do some tests (userspace is fine) where you > check much smaller lengths that definitely sit within the data cache. > I get similar results in userspace, this tool write to RAM with variable data width: root@beaglev:~/src# ./unalign_check 1 0 1 size: 1 Mb write size: 8 bit unalignment: 0 byte elapsed time: 0.01 sec throughput: 124.36 Mb/s # ./unalign_check 1 0 8 size: 1 Mb write size: 64 bit unalignment: 0 byte elapsed time: 0.00 sec throughput: 252.12 Mb/s > It is also worth checking how much overhead there is for > short copies - they are almost certainly more common than > you might expect. > This is one problem with excessive loop unrolling - the 'special > cases' for the ends of the buffer start having a big effect > on small copies. > I too believe that they are much more common than long ones. Indeed, I wish to reduce the MIN_THRESHOLD value from 64 to 32 or even 16. Or having it dependend on the word size, e.g. sizeof(long) * 2. Suggestions? > For cpu that support misaligned memory accesses, one 'trick' > for transfers longer than a 'word' is to do a (probably) misaligned > transfer of the last word of the buffer first followed by the > transfer of the rest of the buffer (overlapping a few bytes at the end). > This saves on conditionals and temporary values. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > Regards, -- per aspera ad upstream ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-16 19:06 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-16 19:06 UTC (permalink / raw) To: David Laight Cc: Bin Meng, Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini On Wed, Jun 16, 2021 at 10:24 AM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 16 June 2021 03:02 > ... > > > > That's a good idea, but if you read the replies to Gary's original > > > > patch > > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ > > > > .. both Gary, Palmer and David would rather like a C-based version. > > > > This is one attempt at providing that. > > > > > > Yep, I prefer C as well :) > > > > > > But if you check commit 04091d6, the assembly version was introduced > > > for KASAN. So if we are to change it back to C, please make sure KASAN > > > is not broken. > > > > ... > > Leaving out the first memcpy/set of every test which is always slower, (maybe > > because of a cache miss?), the current implementation copies 260 Mb/s when > > the low order bits match, and 114 otherwise. > > Memset is stable at 278 Mb/s. > > > > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, > > and 230 Mb/s otherwise. Memset is the same as the current one. > > Any idea what the attainable performance is for the cpu you are using? > Since both memset and memcpy are running at much the same speed > I suspect it is all limited by the writes. > > 272MB/s is only 34M writes/sec. > This seems horribly slow for a modern cpu. > So is this actually really limited by the cache writes to physical memory? > > You might want to do some tests (userspace is fine) where you > check much smaller lengths that definitely sit within the data cache. > I get similar results in userspace, this tool write to RAM with variable data width: root@beaglev:~/src# ./unalign_check 1 0 1 size: 1 Mb write size: 8 bit unalignment: 0 byte elapsed time: 0.01 sec throughput: 124.36 Mb/s # ./unalign_check 1 0 8 size: 1 Mb write size: 64 bit unalignment: 0 byte elapsed time: 0.00 sec throughput: 252.12 Mb/s > It is also worth checking how much overhead there is for > short copies - they are almost certainly more common than > you might expect. > This is one problem with excessive loop unrolling - the 'special > cases' for the ends of the buffer start having a big effect > on small copies. > I too believe that they are much more common than long ones. Indeed, I wish to reduce the MIN_THRESHOLD value from 64 to 32 or even 16. Or having it dependend on the word size, e.g. sizeof(long) * 2. Suggestions? > For cpu that support misaligned memory accesses, one 'trick' > for transfers longer than a 'word' is to do a (probably) misaligned > transfer of the last word of the buffer first followed by the > transfer of the rest of the buffer (overlapping a few bytes at the end). > This saves on conditionals and temporary values. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > Regards, -- per aspera ad upstream _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 13:18 ` David Laight @ 2021-06-15 13:44 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 13:44 UTC (permalink / raw) To: David Laight Cc: Bin Meng, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini On Tue, Jun 15, 2021 at 3:18 PM David Laight <David.Laight@aculab.com> wrote: > > From: Bin Meng > > Sent: 15 June 2021 14:09 > > > > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > > > ... > > > I'm surprised that the C loop: > > > > > > > + for (; count >= bytes_long; count -= bytes_long) > > > > + *d.ulong++ = *s.ulong++; > > > > > > ends up being faster than the ASM 'read lots' - 'write lots' loop. > > > > I believe that's because the assembly version has some unaligned > > access cases, which end up being trap-n-emulated in the OpenSBI > > firmware, and that is a big overhead. > > Ah, that would make sense since the asm user copy code > was broken for misaligned copies. > I suspect memcpy() was broken the same way. > > I'm surprised IP_NET_ALIGN isn't set to 2 to try to > avoid all these misaligned copies in the network stack. > Although avoiding 8n+4 aligned data is rather harder. > That's up to the network driver, indeed I have a patch already for the BeagleV one: https://lore.kernel.org/netdev/20210615012107.577ead86@linux.microsoft.com/T/ > Misaligned copies are just best avoided - really even on x86. > The 'real fun' is when the access crosses TLB boundaries. > -- per aspera ad upstream ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-15 13:44 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 13:44 UTC (permalink / raw) To: David Laight Cc: Bin Meng, linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini On Tue, Jun 15, 2021 at 3:18 PM David Laight <David.Laight@aculab.com> wrote: > > From: Bin Meng > > Sent: 15 June 2021 14:09 > > > > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote: > > > > ... > > > I'm surprised that the C loop: > > > > > > > + for (; count >= bytes_long; count -= bytes_long) > > > > + *d.ulong++ = *s.ulong++; > > > > > > ends up being faster than the ASM 'read lots' - 'write lots' loop. > > > > I believe that's because the assembly version has some unaligned > > access cases, which end up being trap-n-emulated in the OpenSBI > > firmware, and that is a big overhead. > > Ah, that would make sense since the asm user copy code > was broken for misaligned copies. > I suspect memcpy() was broken the same way. > > I'm surprised IP_NET_ALIGN isn't set to 2 to try to > avoid all these misaligned copies in the network stack. > Although avoiding 8n+4 aligned data is rather harder. > That's up to the network driver, indeed I have a patch already for the BeagleV one: https://lore.kernel.org/netdev/20210615012107.577ead86@linux.microsoft.com/T/ > Misaligned copies are just best avoided - really even on x86. > The 'real fun' is when the access crosses TLB boundaries. > -- per aspera ad upstream _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-15 2:38 ` Matteo Croce @ 2021-06-16 11:46 ` Guo Ren -1 siblings, 0 replies; 64+ messages in thread From: Guo Ren @ 2021-06-16 11:46 UTC (permalink / raw) To: Matteo Croce Cc: linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng Hi Matteo, Have you tried Glibc generic implementation code? ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t If Glibc codes have the same performance in your hardware, then you could give a generic implementation first. The current Linux generic implementation is so simple in lib/string.c: #ifndef __HAVE_ARCH_MEMCPY /** * memcpy - Copy one area of memory to another * @dest: Where to copy to * @src: Where to copy from * @count: The size of the area. * * You should not use this function to access IO space, use memcpy_toio() * or memcpy_fromio() instead. */ void *memcpy(void *dest, const void *src, size_t count) { char *tmp = dest; const char *s = src; while (count--) *tmp++ = *s++; return dest; } EXPORT_SYMBOL(memcpy); #endif On Tue, Jun 15, 2021 at 10:42 AM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > From: Matteo Croce <mcroce@microsoft.com> > > Write a C version of memcpy() which uses the biggest data size allowed, > without generating unaligned accesses. > > The procedure is made of three steps: > First copy data one byte at time until the destination buffer is aligned > to a long boundary. > Then copy the data one long at time shifting the current and the next u8 > to compose a long at every cycle. > Finally, copy the remainder one byte at time. > > On a BeagleV, the TCP RX throughput increased by 45%: > > before: > > $ iperf3 -c beaglev > Connecting to host beaglev, port 5201 > [ 5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 76.4 MBytes 641 Mbits/sec 27 624 KBytes > [ 5] 1.00-2.00 sec 72.5 MBytes 608 Mbits/sec 0 708 KBytes > [ 5] 2.00-3.00 sec 73.8 MBytes 619 Mbits/sec 10 451 KBytes > [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 564 KBytes > [ 5] 4.00-5.00 sec 73.8 MBytes 619 Mbits/sec 0 658 KBytes > [ 5] 5.00-6.00 sec 73.8 MBytes 619 Mbits/sec 14 522 KBytes > [ 5] 6.00-7.00 sec 73.8 MBytes 619 Mbits/sec 0 621 KBytes > [ 5] 7.00-8.00 sec 72.5 MBytes 608 Mbits/sec 0 706 KBytes > [ 5] 8.00-9.00 sec 73.8 MBytes 619 Mbits/sec 20 580 KBytes > [ 5] 9.00-10.00 sec 73.8 MBytes 619 Mbits/sec 0 672 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 736 MBytes 618 Mbits/sec 71 sender > [ 5] 0.00-10.01 sec 733 MBytes 615 Mbits/sec receiver > > after: > > $ iperf3 -c beaglev > Connecting to host beaglev, port 5201 > [ 5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 109 MBytes 912 Mbits/sec 48 559 KBytes > [ 5] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0 690 KBytes > [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 36 396 KBytes > [ 5] 3.00-4.00 sec 108 MBytes 902 Mbits/sec 0 567 KBytes > [ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 699 KBytes > [ 5] 5.00-6.00 sec 106 MBytes 891 Mbits/sec 32 414 KBytes > [ 5] 6.00-7.00 sec 106 MBytes 891 Mbits/sec 0 583 KBytes > [ 5] 7.00-8.00 sec 106 MBytes 891 Mbits/sec 0 708 KBytes > [ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec 28 433 KBytes > [ 5] 9.00-10.00 sec 108 MBytes 902 Mbits/sec 0 591 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 1.04 GBytes 897 Mbits/sec 144 sender > [ 5] 0.00-10.01 sec 1.04 GBytes 894 Mbits/sec receiver > > And the decreased CPU time of the memcpy() is observable with perf top. > This is the `perf top -Ue task-clock` output when doing the test: > > before: > > Overhead Shared O Symbol > 42.22% [kernel] [k] memcpy > 35.00% [kernel] [k] __asm_copy_to_user > 3.50% [kernel] [k] sifive_l2_flush64_range > 2.30% [kernel] [k] stmmac_napi_poll_rx > 1.11% [kernel] [k] memset > > after: > > Overhead Shared O Symbol > 45.69% [kernel] [k] __asm_copy_to_user > 29.06% [kernel] [k] memcpy > 4.09% [kernel] [k] sifive_l2_flush64_range > 2.77% [kernel] [k] stmmac_napi_poll_rx > 1.24% [kernel] [k] memset > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > --- > arch/riscv/include/asm/string.h | 8 ++- > arch/riscv/kernel/riscv_ksyms.c | 2 - > arch/riscv/lib/Makefile | 2 +- > arch/riscv/lib/memcpy.S | 108 -------------------------------- > arch/riscv/lib/string.c | 94 +++++++++++++++++++++++++++ > 5 files changed, 101 insertions(+), 113 deletions(-) > delete mode 100644 arch/riscv/lib/memcpy.S > create mode 100644 arch/riscv/lib/string.c > > diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h > index 909049366555..6b5d6fc3eab4 100644 > --- a/arch/riscv/include/asm/string.h > +++ b/arch/riscv/include/asm/string.h > @@ -12,9 +12,13 @@ > #define __HAVE_ARCH_MEMSET > extern asmlinkage void *memset(void *, int, size_t); > extern asmlinkage void *__memset(void *, int, size_t); > + > +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE > #define __HAVE_ARCH_MEMCPY > -extern asmlinkage void *memcpy(void *, const void *, size_t); > -extern asmlinkage void *__memcpy(void *, const void *, size_t); > +extern void *memcpy(void *dest, const void *src, size_t count); > +extern void *__memcpy(void *dest, const void *src, size_t count); > +#endif > + > #define __HAVE_ARCH_MEMMOVE > extern asmlinkage void *memmove(void *, const void *, size_t); > extern asmlinkage void *__memmove(void *, const void *, size_t); > diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c > index 5ab1c7e1a6ed..3f6d512a5b97 100644 > --- a/arch/riscv/kernel/riscv_ksyms.c > +++ b/arch/riscv/kernel/riscv_ksyms.c > @@ -10,8 +10,6 @@ > * Assembly functions that may be used (directly or indirectly) by modules > */ > EXPORT_SYMBOL(memset); > -EXPORT_SYMBOL(memcpy); > EXPORT_SYMBOL(memmove); > EXPORT_SYMBOL(__memset); > -EXPORT_SYMBOL(__memcpy); > EXPORT_SYMBOL(__memmove); > diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile > index 25d5c9664e57..2ffe85d4baee 100644 > --- a/arch/riscv/lib/Makefile > +++ b/arch/riscv/lib/Makefile > @@ -1,9 +1,9 @@ > # SPDX-License-Identifier: GPL-2.0-only > lib-y += delay.o > -lib-y += memcpy.o > lib-y += memset.o > lib-y += memmove.o > lib-$(CONFIG_MMU) += uaccess.o > lib-$(CONFIG_64BIT) += tishift.o > +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o > > obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o > diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S > deleted file mode 100644 > index 51ab716253fa..000000000000 > --- a/arch/riscv/lib/memcpy.S > +++ /dev/null > @@ -1,108 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0-only */ > -/* > - * Copyright (C) 2013 Regents of the University of California > - */ > - > -#include <linux/linkage.h> > -#include <asm/asm.h> > - > -/* void *memcpy(void *, const void *, size_t) */ > -ENTRY(__memcpy) > -WEAK(memcpy) > - move t6, a0 /* Preserve return value */ > - > - /* Defer to byte-oriented copy for small sizes */ > - sltiu a3, a2, 128 > - bnez a3, 4f > - /* Use word-oriented copy only if low-order bits match */ > - andi a3, t6, SZREG-1 > - andi a4, a1, SZREG-1 > - bne a3, a4, 4f > - > - beqz a3, 2f /* Skip if already aligned */ > - /* > - * Round to nearest double word-aligned address > - * greater than or equal to start address > - */ > - andi a3, a1, ~(SZREG-1) > - addi a3, a3, SZREG > - /* Handle initial misalignment */ > - sub a4, a3, a1 > -1: > - lb a5, 0(a1) > - addi a1, a1, 1 > - sb a5, 0(t6) > - addi t6, t6, 1 > - bltu a1, a3, 1b > - sub a2, a2, a4 /* Update count */ > - > -2: > - andi a4, a2, ~((16*SZREG)-1) > - beqz a4, 4f > - add a3, a1, a4 > -3: > - REG_L a4, 0(a1) > - REG_L a5, SZREG(a1) > - REG_L a6, 2*SZREG(a1) > - REG_L a7, 3*SZREG(a1) > - REG_L t0, 4*SZREG(a1) > - REG_L t1, 5*SZREG(a1) > - REG_L t2, 6*SZREG(a1) > - REG_L t3, 7*SZREG(a1) > - REG_L t4, 8*SZREG(a1) > - REG_L t5, 9*SZREG(a1) > - REG_S a4, 0(t6) > - REG_S a5, SZREG(t6) > - REG_S a6, 2*SZREG(t6) > - REG_S a7, 3*SZREG(t6) > - REG_S t0, 4*SZREG(t6) > - REG_S t1, 5*SZREG(t6) > - REG_S t2, 6*SZREG(t6) > - REG_S t3, 7*SZREG(t6) > - REG_S t4, 8*SZREG(t6) > - REG_S t5, 9*SZREG(t6) > - REG_L a4, 10*SZREG(a1) > - REG_L a5, 11*SZREG(a1) > - REG_L a6, 12*SZREG(a1) > - REG_L a7, 13*SZREG(a1) > - REG_L t0, 14*SZREG(a1) > - REG_L t1, 15*SZREG(a1) > - addi a1, a1, 16*SZREG > - REG_S a4, 10*SZREG(t6) > - REG_S a5, 11*SZREG(t6) > - REG_S a6, 12*SZREG(t6) > - REG_S a7, 13*SZREG(t6) > - REG_S t0, 14*SZREG(t6) > - REG_S t1, 15*SZREG(t6) > - addi t6, t6, 16*SZREG > - bltu a1, a3, 3b > - andi a2, a2, (16*SZREG)-1 /* Update count */ > - > -4: > - /* Handle trailing misalignment */ > - beqz a2, 6f > - add a3, a1, a2 > - > - /* Use word-oriented copy if co-aligned to word boundary */ > - or a5, a1, t6 > - or a5, a5, a3 > - andi a5, a5, 3 > - bnez a5, 5f > -7: > - lw a4, 0(a1) > - addi a1, a1, 4 > - sw a4, 0(t6) > - addi t6, t6, 4 > - bltu a1, a3, 7b > - > - ret > - > -5: > - lb a4, 0(a1) > - addi a1, a1, 1 > - sb a4, 0(t6) > - addi t6, t6, 1 > - bltu a1, a3, 5b > -6: > - ret > -END(__memcpy) > diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c > new file mode 100644 > index 000000000000..525f9ee25a74 > --- /dev/null > +++ b/arch/riscv/lib/string.c > @@ -0,0 +1,94 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * String functions optimized for hardware which doesn't > + * handle unaligned memory accesses efficiently. > + * > + * Copyright (C) 2021 Matteo Croce > + */ > + > +#include <linux/types.h> > +#include <linux/module.h> > + > +/* size below a classic byte at time copy is done */ > +#define MIN_THRESHOLD 64 > + > +/* convenience types to avoid cast between different pointer types */ > +union types { > + u8 *u8; > + unsigned long *ulong; > + uintptr_t uptr; > +}; > + > +union const_types { > + const u8 *u8; > + unsigned long *ulong; > +}; > + > +void *memcpy(void *dest, const void *src, size_t count) > +{ > + const int bytes_long = BITS_PER_LONG / 8; > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + const int mask = bytes_long - 1; > + const int distance = (src - dest) & mask; > +#endif > + union const_types s = { .u8 = src }; > + union types d = { .u8 = dest }; > + > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + if (count <= MIN_THRESHOLD) > + goto copy_remainder; > + > + /* copy a byte at time until destination is aligned */ > + for (; count && d.uptr & mask; count--) > + *d.u8++ = *s.u8++; > + > + if (distance) { > + unsigned long last, next; > + > + /* move s backward to the previous alignment boundary */ > + s.u8 -= distance; > + > + /* 32/64 bit wide copy from s to d. > + * d is aligned now but s is not, so read s alignment wise, > + * and do proper shift to get the right value. > + * Works only on Little Endian machines. > + */ > + for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) { > + last = next; > + next = s.ulong[1]; > + > + d.ulong[0] = last >> (distance * 8) | > + next << ((bytes_long - distance) * 8); > + > + d.ulong++; > + s.ulong++; > + } > + > + /* restore s with the original offset */ > + s.u8 += distance; > + } else > +#endif > + { > + /* if the source and dest lower bits are the same, do a simple > + * 32/64 bit wide copy. > + */ > + for (; count >= bytes_long; count -= bytes_long) > + *d.ulong++ = *s.ulong++; > + } > + > + /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */ > + goto copy_remainder; > + > +copy_remainder: > + while (count--) > + *d.u8++ = *s.u8++; > + > + return dest; > +} > +EXPORT_SYMBOL(memcpy); > + > +void *__memcpy(void *dest, const void *src, size_t count) > +{ > + return memcpy(dest, src, count); > +} > +EXPORT_SYMBOL(__memcpy); > -- > 2.31.1 > -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-16 11:46 ` Guo Ren 0 siblings, 0 replies; 64+ messages in thread From: Guo Ren @ 2021-06-16 11:46 UTC (permalink / raw) To: Matteo Croce Cc: linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng Hi Matteo, Have you tried Glibc generic implementation code? ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t If Glibc codes have the same performance in your hardware, then you could give a generic implementation first. The current Linux generic implementation is so simple in lib/string.c: #ifndef __HAVE_ARCH_MEMCPY /** * memcpy - Copy one area of memory to another * @dest: Where to copy to * @src: Where to copy from * @count: The size of the area. * * You should not use this function to access IO space, use memcpy_toio() * or memcpy_fromio() instead. */ void *memcpy(void *dest, const void *src, size_t count) { char *tmp = dest; const char *s = src; while (count--) *tmp++ = *s++; return dest; } EXPORT_SYMBOL(memcpy); #endif On Tue, Jun 15, 2021 at 10:42 AM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > From: Matteo Croce <mcroce@microsoft.com> > > Write a C version of memcpy() which uses the biggest data size allowed, > without generating unaligned accesses. > > The procedure is made of three steps: > First copy data one byte at time until the destination buffer is aligned > to a long boundary. > Then copy the data one long at time shifting the current and the next u8 > to compose a long at every cycle. > Finally, copy the remainder one byte at time. > > On a BeagleV, the TCP RX throughput increased by 45%: > > before: > > $ iperf3 -c beaglev > Connecting to host beaglev, port 5201 > [ 5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 76.4 MBytes 641 Mbits/sec 27 624 KBytes > [ 5] 1.00-2.00 sec 72.5 MBytes 608 Mbits/sec 0 708 KBytes > [ 5] 2.00-3.00 sec 73.8 MBytes 619 Mbits/sec 10 451 KBytes > [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 564 KBytes > [ 5] 4.00-5.00 sec 73.8 MBytes 619 Mbits/sec 0 658 KBytes > [ 5] 5.00-6.00 sec 73.8 MBytes 619 Mbits/sec 14 522 KBytes > [ 5] 6.00-7.00 sec 73.8 MBytes 619 Mbits/sec 0 621 KBytes > [ 5] 7.00-8.00 sec 72.5 MBytes 608 Mbits/sec 0 706 KBytes > [ 5] 8.00-9.00 sec 73.8 MBytes 619 Mbits/sec 20 580 KBytes > [ 5] 9.00-10.00 sec 73.8 MBytes 619 Mbits/sec 0 672 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 736 MBytes 618 Mbits/sec 71 sender > [ 5] 0.00-10.01 sec 733 MBytes 615 Mbits/sec receiver > > after: > > $ iperf3 -c beaglev > Connecting to host beaglev, port 5201 > [ 5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 109 MBytes 912 Mbits/sec 48 559 KBytes > [ 5] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0 690 KBytes > [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 36 396 KBytes > [ 5] 3.00-4.00 sec 108 MBytes 902 Mbits/sec 0 567 KBytes > [ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 699 KBytes > [ 5] 5.00-6.00 sec 106 MBytes 891 Mbits/sec 32 414 KBytes > [ 5] 6.00-7.00 sec 106 MBytes 891 Mbits/sec 0 583 KBytes > [ 5] 7.00-8.00 sec 106 MBytes 891 Mbits/sec 0 708 KBytes > [ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec 28 433 KBytes > [ 5] 9.00-10.00 sec 108 MBytes 902 Mbits/sec 0 591 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 1.04 GBytes 897 Mbits/sec 144 sender > [ 5] 0.00-10.01 sec 1.04 GBytes 894 Mbits/sec receiver > > And the decreased CPU time of the memcpy() is observable with perf top. > This is the `perf top -Ue task-clock` output when doing the test: > > before: > > Overhead Shared O Symbol > 42.22% [kernel] [k] memcpy > 35.00% [kernel] [k] __asm_copy_to_user > 3.50% [kernel] [k] sifive_l2_flush64_range > 2.30% [kernel] [k] stmmac_napi_poll_rx > 1.11% [kernel] [k] memset > > after: > > Overhead Shared O Symbol > 45.69% [kernel] [k] __asm_copy_to_user > 29.06% [kernel] [k] memcpy > 4.09% [kernel] [k] sifive_l2_flush64_range > 2.77% [kernel] [k] stmmac_napi_poll_rx > 1.24% [kernel] [k] memset > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > --- > arch/riscv/include/asm/string.h | 8 ++- > arch/riscv/kernel/riscv_ksyms.c | 2 - > arch/riscv/lib/Makefile | 2 +- > arch/riscv/lib/memcpy.S | 108 -------------------------------- > arch/riscv/lib/string.c | 94 +++++++++++++++++++++++++++ > 5 files changed, 101 insertions(+), 113 deletions(-) > delete mode 100644 arch/riscv/lib/memcpy.S > create mode 100644 arch/riscv/lib/string.c > > diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h > index 909049366555..6b5d6fc3eab4 100644 > --- a/arch/riscv/include/asm/string.h > +++ b/arch/riscv/include/asm/string.h > @@ -12,9 +12,13 @@ > #define __HAVE_ARCH_MEMSET > extern asmlinkage void *memset(void *, int, size_t); > extern asmlinkage void *__memset(void *, int, size_t); > + > +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE > #define __HAVE_ARCH_MEMCPY > -extern asmlinkage void *memcpy(void *, const void *, size_t); > -extern asmlinkage void *__memcpy(void *, const void *, size_t); > +extern void *memcpy(void *dest, const void *src, size_t count); > +extern void *__memcpy(void *dest, const void *src, size_t count); > +#endif > + > #define __HAVE_ARCH_MEMMOVE > extern asmlinkage void *memmove(void *, const void *, size_t); > extern asmlinkage void *__memmove(void *, const void *, size_t); > diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c > index 5ab1c7e1a6ed..3f6d512a5b97 100644 > --- a/arch/riscv/kernel/riscv_ksyms.c > +++ b/arch/riscv/kernel/riscv_ksyms.c > @@ -10,8 +10,6 @@ > * Assembly functions that may be used (directly or indirectly) by modules > */ > EXPORT_SYMBOL(memset); > -EXPORT_SYMBOL(memcpy); > EXPORT_SYMBOL(memmove); > EXPORT_SYMBOL(__memset); > -EXPORT_SYMBOL(__memcpy); > EXPORT_SYMBOL(__memmove); > diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile > index 25d5c9664e57..2ffe85d4baee 100644 > --- a/arch/riscv/lib/Makefile > +++ b/arch/riscv/lib/Makefile > @@ -1,9 +1,9 @@ > # SPDX-License-Identifier: GPL-2.0-only > lib-y += delay.o > -lib-y += memcpy.o > lib-y += memset.o > lib-y += memmove.o > lib-$(CONFIG_MMU) += uaccess.o > lib-$(CONFIG_64BIT) += tishift.o > +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o > > obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o > diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S > deleted file mode 100644 > index 51ab716253fa..000000000000 > --- a/arch/riscv/lib/memcpy.S > +++ /dev/null > @@ -1,108 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0-only */ > -/* > - * Copyright (C) 2013 Regents of the University of California > - */ > - > -#include <linux/linkage.h> > -#include <asm/asm.h> > - > -/* void *memcpy(void *, const void *, size_t) */ > -ENTRY(__memcpy) > -WEAK(memcpy) > - move t6, a0 /* Preserve return value */ > - > - /* Defer to byte-oriented copy for small sizes */ > - sltiu a3, a2, 128 > - bnez a3, 4f > - /* Use word-oriented copy only if low-order bits match */ > - andi a3, t6, SZREG-1 > - andi a4, a1, SZREG-1 > - bne a3, a4, 4f > - > - beqz a3, 2f /* Skip if already aligned */ > - /* > - * Round to nearest double word-aligned address > - * greater than or equal to start address > - */ > - andi a3, a1, ~(SZREG-1) > - addi a3, a3, SZREG > - /* Handle initial misalignment */ > - sub a4, a3, a1 > -1: > - lb a5, 0(a1) > - addi a1, a1, 1 > - sb a5, 0(t6) > - addi t6, t6, 1 > - bltu a1, a3, 1b > - sub a2, a2, a4 /* Update count */ > - > -2: > - andi a4, a2, ~((16*SZREG)-1) > - beqz a4, 4f > - add a3, a1, a4 > -3: > - REG_L a4, 0(a1) > - REG_L a5, SZREG(a1) > - REG_L a6, 2*SZREG(a1) > - REG_L a7, 3*SZREG(a1) > - REG_L t0, 4*SZREG(a1) > - REG_L t1, 5*SZREG(a1) > - REG_L t2, 6*SZREG(a1) > - REG_L t3, 7*SZREG(a1) > - REG_L t4, 8*SZREG(a1) > - REG_L t5, 9*SZREG(a1) > - REG_S a4, 0(t6) > - REG_S a5, SZREG(t6) > - REG_S a6, 2*SZREG(t6) > - REG_S a7, 3*SZREG(t6) > - REG_S t0, 4*SZREG(t6) > - REG_S t1, 5*SZREG(t6) > - REG_S t2, 6*SZREG(t6) > - REG_S t3, 7*SZREG(t6) > - REG_S t4, 8*SZREG(t6) > - REG_S t5, 9*SZREG(t6) > - REG_L a4, 10*SZREG(a1) > - REG_L a5, 11*SZREG(a1) > - REG_L a6, 12*SZREG(a1) > - REG_L a7, 13*SZREG(a1) > - REG_L t0, 14*SZREG(a1) > - REG_L t1, 15*SZREG(a1) > - addi a1, a1, 16*SZREG > - REG_S a4, 10*SZREG(t6) > - REG_S a5, 11*SZREG(t6) > - REG_S a6, 12*SZREG(t6) > - REG_S a7, 13*SZREG(t6) > - REG_S t0, 14*SZREG(t6) > - REG_S t1, 15*SZREG(t6) > - addi t6, t6, 16*SZREG > - bltu a1, a3, 3b > - andi a2, a2, (16*SZREG)-1 /* Update count */ > - > -4: > - /* Handle trailing misalignment */ > - beqz a2, 6f > - add a3, a1, a2 > - > - /* Use word-oriented copy if co-aligned to word boundary */ > - or a5, a1, t6 > - or a5, a5, a3 > - andi a5, a5, 3 > - bnez a5, 5f > -7: > - lw a4, 0(a1) > - addi a1, a1, 4 > - sw a4, 0(t6) > - addi t6, t6, 4 > - bltu a1, a3, 7b > - > - ret > - > -5: > - lb a4, 0(a1) > - addi a1, a1, 1 > - sb a4, 0(t6) > - addi t6, t6, 1 > - bltu a1, a3, 5b > -6: > - ret > -END(__memcpy) > diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c > new file mode 100644 > index 000000000000..525f9ee25a74 > --- /dev/null > +++ b/arch/riscv/lib/string.c > @@ -0,0 +1,94 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * String functions optimized for hardware which doesn't > + * handle unaligned memory accesses efficiently. > + * > + * Copyright (C) 2021 Matteo Croce > + */ > + > +#include <linux/types.h> > +#include <linux/module.h> > + > +/* size below a classic byte at time copy is done */ > +#define MIN_THRESHOLD 64 > + > +/* convenience types to avoid cast between different pointer types */ > +union types { > + u8 *u8; > + unsigned long *ulong; > + uintptr_t uptr; > +}; > + > +union const_types { > + const u8 *u8; > + unsigned long *ulong; > +}; > + > +void *memcpy(void *dest, const void *src, size_t count) > +{ > + const int bytes_long = BITS_PER_LONG / 8; > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + const int mask = bytes_long - 1; > + const int distance = (src - dest) & mask; > +#endif > + union const_types s = { .u8 = src }; > + union types d = { .u8 = dest }; > + > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + if (count <= MIN_THRESHOLD) > + goto copy_remainder; > + > + /* copy a byte at time until destination is aligned */ > + for (; count && d.uptr & mask; count--) > + *d.u8++ = *s.u8++; > + > + if (distance) { > + unsigned long last, next; > + > + /* move s backward to the previous alignment boundary */ > + s.u8 -= distance; > + > + /* 32/64 bit wide copy from s to d. > + * d is aligned now but s is not, so read s alignment wise, > + * and do proper shift to get the right value. > + * Works only on Little Endian machines. > + */ > + for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) { > + last = next; > + next = s.ulong[1]; > + > + d.ulong[0] = last >> (distance * 8) | > + next << ((bytes_long - distance) * 8); > + > + d.ulong++; > + s.ulong++; > + } > + > + /* restore s with the original offset */ > + s.u8 += distance; > + } else > +#endif > + { > + /* if the source and dest lower bits are the same, do a simple > + * 32/64 bit wide copy. > + */ > + for (; count >= bytes_long; count -= bytes_long) > + *d.ulong++ = *s.ulong++; > + } > + > + /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */ > + goto copy_remainder; > + > +copy_remainder: > + while (count--) > + *d.u8++ = *s.u8++; > + > + return dest; > +} > +EXPORT_SYMBOL(memcpy); > + > +void *__memcpy(void *dest, const void *src, size_t count) > +{ > + return memcpy(dest, src, count); > +} > +EXPORT_SYMBOL(__memcpy); > -- > 2.31.1 > -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-16 11:46 ` Guo Ren @ 2021-06-16 18:52 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-16 18:52 UTC (permalink / raw) To: Guo Ren Cc: linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > Hi Matteo, > > Have you tried Glibc generic implementation code? > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > If Glibc codes have the same performance in your hardware, then you > could give a generic implementation first. > Hi, I had a look, it seems that it's a C unrolled version with the 'register' keyword. The same one was already merged in nios2: https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar result of the other versions: [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s Regards, -- per aspera ad upstream ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-16 18:52 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-16 18:52 UTC (permalink / raw) To: Guo Ren Cc: linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > Hi Matteo, > > Have you tried Glibc generic implementation code? > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > If Glibc codes have the same performance in your hardware, then you > could give a generic implementation first. > Hi, I had a look, it seems that it's a C unrolled version with the 'register' keyword. The same one was already merged in nios2: https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar result of the other versions: [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s Regards, -- per aspera ad upstream _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy 2021-06-16 18:52 ` Matteo Croce @ 2021-06-17 21:30 ` David Laight -1 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-17 21:30 UTC (permalink / raw) To: 'Matteo Croce', Guo Ren Cc: linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce > Sent: 16 June 2021 19:52 > To: Guo Ren <guoren@kernel.org> > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > Hi Matteo, > > > > Have you tried Glibc generic implementation code? > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > If Glibc codes have the same performance in your hardware, then you > > could give a generic implementation first. Isn't that a byte copy loop - the performance of that ought to be terrible. ... > I had a look, it seems that it's a C unrolled version with the > 'register' keyword. > The same one was already merged in nios2: > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 I know a lot about the nios2 instruction timings. (I've looked at code execution in the fpga's intel 'logic analiser.) It is a very simple 4-clock pipeline cpu with a 2-clock delay before a value read from 'tightly coupled memory' (aka cache) can be used in another instruction. There is also a subtle pipeline stall if a read follows a write to the same memory block because the write is executed one clock later - and would collide with the read. Since it only ever executes one instruction per clock loop unrolling does help - since you never get the loop control 'for free'. OTOH you don't need to use that many registers. But an unrolled loop should approach 2 bytes/clock (32bit cpu). > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > result of the other versions: > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s What clock speed is that running at? It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). While the small riscv cpu might be similar to the nios2 (and mips for that matter), there are also bigger/faster cpu. I'm sure these can execute multiple instructions/clock and possible even read and write at the same time. Unless they also support significant instruction re-ordering the trivial copy loops are going to be slow on such cpu. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-17 21:30 ` David Laight 0 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-17 21:30 UTC (permalink / raw) To: 'Matteo Croce', Guo Ren Cc: linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce > Sent: 16 June 2021 19:52 > To: Guo Ren <guoren@kernel.org> > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > Hi Matteo, > > > > Have you tried Glibc generic implementation code? > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > If Glibc codes have the same performance in your hardware, then you > > could give a generic implementation first. Isn't that a byte copy loop - the performance of that ought to be terrible. ... > I had a look, it seems that it's a C unrolled version with the > 'register' keyword. > The same one was already merged in nios2: > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 I know a lot about the nios2 instruction timings. (I've looked at code execution in the fpga's intel 'logic analiser.) It is a very simple 4-clock pipeline cpu with a 2-clock delay before a value read from 'tightly coupled memory' (aka cache) can be used in another instruction. There is also a subtle pipeline stall if a read follows a write to the same memory block because the write is executed one clock later - and would collide with the read. Since it only ever executes one instruction per clock loop unrolling does help - since you never get the loop control 'for free'. OTOH you don't need to use that many registers. But an unrolled loop should approach 2 bytes/clock (32bit cpu). > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > result of the other versions: > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s What clock speed is that running at? It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). While the small riscv cpu might be similar to the nios2 (and mips for that matter), there are also bigger/faster cpu. I'm sure these can execute multiple instructions/clock and possible even read and write at the same time. Unless they also support significant instruction re-ordering the trivial copy loops are going to be slow on such cpu. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-17 21:30 ` David Laight @ 2021-06-17 21:48 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-17 21:48 UTC (permalink / raw) To: David Laight Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 16 June 2021 19:52 > > To: Guo Ren <guoren@kernel.org> > > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > > > Hi Matteo, > > > > > > Have you tried Glibc generic implementation code? > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > > > If Glibc codes have the same performance in your hardware, then you > > > could give a generic implementation first. > > Isn't that a byte copy loop - the performance of that ought to be terrible. > ... > > > I had a look, it seems that it's a C unrolled version with the > > 'register' keyword. > > The same one was already merged in nios2: > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 > > I know a lot about the nios2 instruction timings. > (I've looked at code execution in the fpga's intel 'logic analiser.) > It is a very simple 4-clock pipeline cpu with a 2-clock delay > before a value read from 'tightly coupled memory' (aka cache) > can be used in another instruction. > There is also a subtle pipeline stall if a read follows a write > to the same memory block because the write is executed one > clock later - and would collide with the read. > Since it only ever executes one instruction per clock loop > unrolling does help - since you never get the loop control 'for free'. > OTOH you don't need to use that many registers. > But an unrolled loop should approach 2 bytes/clock (32bit cpu). > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > > result of the other versions: > > > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s > > What clock speed is that running at? > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). > > While the small riscv cpu might be similar to the nios2 (and mips > for that matter), there are also bigger/faster cpu. > I'm sure these can execute multiple instructions/clock > and possible even read and write at the same time. > Unless they also support significant instruction re-ordering > the trivial copy loops are going to be slow on such cpu. > It's running at 1 GHz. I get 257 Mb/s with a memcpy, a bit more with a memset, but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. -- per aspera ad upstream ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-17 21:48 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-17 21:48 UTC (permalink / raw) To: David Laight Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote: > > From: Matteo Croce > > Sent: 16 June 2021 19:52 > > To: Guo Ren <guoren@kernel.org> > > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > > > Hi Matteo, > > > > > > Have you tried Glibc generic implementation code? > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > > > If Glibc codes have the same performance in your hardware, then you > > > could give a generic implementation first. > > Isn't that a byte copy loop - the performance of that ought to be terrible. > ... > > > I had a look, it seems that it's a C unrolled version with the > > 'register' keyword. > > The same one was already merged in nios2: > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 > > I know a lot about the nios2 instruction timings. > (I've looked at code execution in the fpga's intel 'logic analiser.) > It is a very simple 4-clock pipeline cpu with a 2-clock delay > before a value read from 'tightly coupled memory' (aka cache) > can be used in another instruction. > There is also a subtle pipeline stall if a read follows a write > to the same memory block because the write is executed one > clock later - and would collide with the read. > Since it only ever executes one instruction per clock loop > unrolling does help - since you never get the loop control 'for free'. > OTOH you don't need to use that many registers. > But an unrolled loop should approach 2 bytes/clock (32bit cpu). > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > > result of the other versions: > > > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s > > What clock speed is that running at? > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). > > While the small riscv cpu might be similar to the nios2 (and mips > for that matter), there are also bigger/faster cpu. > I'm sure these can execute multiple instructions/clock > and possible even read and write at the same time. > Unless they also support significant instruction re-ordering > the trivial copy loops are going to be slow on such cpu. > It's running at 1 GHz. I get 257 Mb/s with a memcpy, a bit more with a memset, but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. -- per aspera ad upstream _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-17 21:48 ` Matteo Croce @ 2021-06-18 0:32 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-18 0:32 UTC (permalink / raw) To: David Laight Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote: > > > > From: Matteo Croce > > > Sent: 16 June 2021 19:52 > > > To: Guo Ren <guoren@kernel.org> > > > > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > > > > > Hi Matteo, > > > > > > > > Have you tried Glibc generic implementation code? > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > > > > > If Glibc codes have the same performance in your hardware, then you > > > > could give a generic implementation first. > > > > Isn't that a byte copy loop - the performance of that ought to be terrible. > > ... > > > > > I had a look, it seems that it's a C unrolled version with the > > > 'register' keyword. > > > The same one was already merged in nios2: > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 > > > > I know a lot about the nios2 instruction timings. > > (I've looked at code execution in the fpga's intel 'logic analiser.) > > It is a very simple 4-clock pipeline cpu with a 2-clock delay > > before a value read from 'tightly coupled memory' (aka cache) > > can be used in another instruction. > > There is also a subtle pipeline stall if a read follows a write > > to the same memory block because the write is executed one > > clock later - and would collide with the read. > > Since it only ever executes one instruction per clock loop > > unrolling does help - since you never get the loop control 'for free'. > > OTOH you don't need to use that many registers. > > But an unrolled loop should approach 2 bytes/clock (32bit cpu). > > > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > > > result of the other versions: > > > > > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s > > > > What clock speed is that running at? > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). > > > > While the small riscv cpu might be similar to the nios2 (and mips > > for that matter), there are also bigger/faster cpu. > > I'm sure these can execute multiple instructions/clock > > and possible even read and write at the same time. > > Unless they also support significant instruction re-ordering > > the trivial copy loops are going to be slow on such cpu. > > > > It's running at 1 GHz. > > I get 257 Mb/s with a memcpy, a bit more with a memset, > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. > Err, I forget a mlock() before accessing the memory in userspace. The real speed here is: 8 bit read: 155.42 Mb/s 64 bit read: 277.29 Mb/s 8 bit write: 138.57 Mb/s 64 bit write: 239.21 Mb/s -- per aspera ad upstream ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-18 0:32 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-18 0:32 UTC (permalink / raw) To: David Laight Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote: > > > > From: Matteo Croce > > > Sent: 16 June 2021 19:52 > > > To: Guo Ren <guoren@kernel.org> > > > > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > > > > > Hi Matteo, > > > > > > > > Have you tried Glibc generic implementation code? > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > > > > > If Glibc codes have the same performance in your hardware, then you > > > > could give a generic implementation first. > > > > Isn't that a byte copy loop - the performance of that ought to be terrible. > > ... > > > > > I had a look, it seems that it's a C unrolled version with the > > > 'register' keyword. > > > The same one was already merged in nios2: > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 > > > > I know a lot about the nios2 instruction timings. > > (I've looked at code execution in the fpga's intel 'logic analiser.) > > It is a very simple 4-clock pipeline cpu with a 2-clock delay > > before a value read from 'tightly coupled memory' (aka cache) > > can be used in another instruction. > > There is also a subtle pipeline stall if a read follows a write > > to the same memory block because the write is executed one > > clock later - and would collide with the read. > > Since it only ever executes one instruction per clock loop > > unrolling does help - since you never get the loop control 'for free'. > > OTOH you don't need to use that many registers. > > But an unrolled loop should approach 2 bytes/clock (32bit cpu). > > > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > > > result of the other versions: > > > > > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s > > > > What clock speed is that running at? > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). > > > > While the small riscv cpu might be similar to the nios2 (and mips > > for that matter), there are also bigger/faster cpu. > > I'm sure these can execute multiple instructions/clock > > and possible even read and write at the same time. > > Unless they also support significant instruction re-ordering > > the trivial copy loops are going to be slow on such cpu. > > > > It's running at 1 GHz. > > I get 257 Mb/s with a memcpy, a bit more with a memset, > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. > Err, I forget a mlock() before accessing the memory in userspace. The real speed here is: 8 bit read: 155.42 Mb/s 64 bit read: 277.29 Mb/s 8 bit write: 138.57 Mb/s 64 bit write: 239.21 Mb/s -- per aspera ad upstream _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy 2021-06-18 0:32 ` Matteo Croce @ 2021-06-18 1:05 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-18 1:05 UTC (permalink / raw) To: David Laight Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Fri, Jun 18, 2021 at 2:32 AM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce > <mcroce@linux.microsoft.com> wrote: > > > > On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote: > > > > > > From: Matteo Croce > > > > Sent: 16 June 2021 19:52 > > > > To: Guo Ren <guoren@kernel.org> > > > > > > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > > > > > > > Hi Matteo, > > > > > > > > > > Have you tried Glibc generic implementation code? > > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > > > > > > > If Glibc codes have the same performance in your hardware, then you > > > > > could give a generic implementation first. > > > > > > Isn't that a byte copy loop - the performance of that ought to be terrible. > > > ... > > > > > > > I had a look, it seems that it's a C unrolled version with the > > > > 'register' keyword. > > > > The same one was already merged in nios2: > > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 > > > > > > I know a lot about the nios2 instruction timings. > > > (I've looked at code execution in the fpga's intel 'logic analiser.) > > > It is a very simple 4-clock pipeline cpu with a 2-clock delay > > > before a value read from 'tightly coupled memory' (aka cache) > > > can be used in another instruction. > > > There is also a subtle pipeline stall if a read follows a write > > > to the same memory block because the write is executed one > > > clock later - and would collide with the read. > > > Since it only ever executes one instruction per clock loop > > > unrolling does help - since you never get the loop control 'for free'. > > > OTOH you don't need to use that many registers. > > > But an unrolled loop should approach 2 bytes/clock (32bit cpu). > > > > > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > > > > result of the other versions: > > > > > > > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s > > > > > > What clock speed is that running at? > > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). > > > > > > While the small riscv cpu might be similar to the nios2 (and mips > > > for that matter), there are also bigger/faster cpu. > > > I'm sure these can execute multiple instructions/clock > > > and possible even read and write at the same time. > > > Unless they also support significant instruction re-ordering > > > the trivial copy loops are going to be slow on such cpu. > > > > > > > It's running at 1 GHz. > > > > I get 257 Mb/s with a memcpy, a bit more with a memset, > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. > > > > Err, I forget a mlock() before accessing the memory in userspace. > > The real speed here is: > > 8 bit read: 155.42 Mb/s > 64 bit read: 277.29 Mb/s > 8 bit write: 138.57 Mb/s > 64 bit write: 239.21 Mb/s > Anyway, thanks for the info on nio2 timings. If you think that an unrolled loop would help, we can achieve the same in C. I think we could code something similar to a Duff device (or with jump labels) to unroll the loop but at the same time doing efficient small copies. Regards, -- per aspera ad upstream ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-18 1:05 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-18 1:05 UTC (permalink / raw) To: David Laight Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng On Fri, Jun 18, 2021 at 2:32 AM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce > <mcroce@linux.microsoft.com> wrote: > > > > On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote: > > > > > > From: Matteo Croce > > > > Sent: 16 June 2021 19:52 > > > > To: Guo Ren <guoren@kernel.org> > > > > > > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote: > > > > > > > > > > Hi Matteo, > > > > > > > > > > Have you tried Glibc generic implementation code? > > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9- > > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t > > > > > > > > > > If Glibc codes have the same performance in your hardware, then you > > > > > could give a generic implementation first. > > > > > > Isn't that a byte copy loop - the performance of that ought to be terrible. > > > ... > > > > > > > I had a look, it seems that it's a C unrolled version with the > > > > 'register' keyword. > > > > The same one was already merged in nios2: > > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68 > > > > > > I know a lot about the nios2 instruction timings. > > > (I've looked at code execution in the fpga's intel 'logic analiser.) > > > It is a very simple 4-clock pipeline cpu with a 2-clock delay > > > before a value read from 'tightly coupled memory' (aka cache) > > > can be used in another instruction. > > > There is also a subtle pipeline stall if a read follows a write > > > to the same memory block because the write is executed one > > > clock later - and would collide with the read. > > > Since it only ever executes one instruction per clock loop > > > unrolling does help - since you never get the loop control 'for free'. > > > OTOH you don't need to use that many registers. > > > But an unrolled loop should approach 2 bytes/clock (32bit cpu). > > > > > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar > > > > result of the other versions: > > > > > > > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s > > > > > > What clock speed is that running at? > > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu). > > > > > > While the small riscv cpu might be similar to the nios2 (and mips > > > for that matter), there are also bigger/faster cpu. > > > I'm sure these can execute multiple instructions/clock > > > and possible even read and write at the same time. > > > Unless they also support significant instruction re-ordering > > > the trivial copy loops are going to be slow on such cpu. > > > > > > > It's running at 1 GHz. > > > > I get 257 Mb/s with a memcpy, a bit more with a memset, > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. > > > > Err, I forget a mlock() before accessing the memory in userspace. > > The real speed here is: > > 8 bit read: 155.42 Mb/s > 64 bit read: 277.29 Mb/s > 8 bit write: 138.57 Mb/s > 64 bit write: 239.21 Mb/s > Anyway, thanks for the info on nio2 timings. If you think that an unrolled loop would help, we can achieve the same in C. I think we could code something similar to a Duff device (or with jump labels) to unroll the loop but at the same time doing efficient small copies. Regards, -- per aspera ad upstream _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy 2021-06-18 1:05 ` Matteo Croce @ 2021-06-18 8:32 ` David Laight -1 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-18 8:32 UTC (permalink / raw) To: 'Matteo Croce' Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce > Sent: 18 June 2021 02:05 ... > > > It's running at 1 GHz. > > > > > > I get 257 Mb/s with a memcpy, a bit more with a memset, > > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. > > > > > > > Err, I forget a mlock() before accessing the memory in userspace. What is the mlock() for? The data for a quick loop won't get paged out. You want to test cache to cache copies, so the first loop will always be slow. After that each iteration should be much the same. I use code like: for (;;) { start = read_tsc(); do_test(); histogram[(read_tsc() - start) >> n]++ } (You need to exclude outliers) to get a distribution for the execution times. Tends to be pretty stable - even though different program runs can give different values! > > The real speed here is: > > > > 8 bit read: 155.42 Mb/s > > 64 bit read: 277.29 Mb/s > > 8 bit write: 138.57 Mb/s > > 64 bit write: 239.21 Mb/s > > > > Anyway, thanks for the info on nio2 timings. > If you think that an unrolled loop would help, we can achieve the same in C. > I think we could code something similar to a Duff device (or with jump > labels) to unroll the loop but at the same time doing efficient small copies. Unrolling has to be done with care. It tends to improve benchmarks, but the extra code displaces other code from the i-cache and slows down overall performance. So you need 'just enough' unrolling to avoid cpu stalls. On your system it looks like the memory/cache subsystem is the bottleneck for the tests you are doing. I'd really expect a 1GHz cpu to be able to read/write from its data cache every clock. So I'd expect transfer rates nearer 8000 MB/s, not 250 MB/s. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 1/3] riscv: optimized memcpy @ 2021-06-18 8:32 ` David Laight 0 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2021-06-18 8:32 UTC (permalink / raw) To: 'Matteo Croce' Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce > Sent: 18 June 2021 02:05 ... > > > It's running at 1 GHz. > > > > > > I get 257 Mb/s with a memcpy, a bit more with a memset, > > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing. > > > > > > > Err, I forget a mlock() before accessing the memory in userspace. What is the mlock() for? The data for a quick loop won't get paged out. You want to test cache to cache copies, so the first loop will always be slow. After that each iteration should be much the same. I use code like: for (;;) { start = read_tsc(); do_test(); histogram[(read_tsc() - start) >> n]++ } (You need to exclude outliers) to get a distribution for the execution times. Tends to be pretty stable - even though different program runs can give different values! > > The real speed here is: > > > > 8 bit read: 155.42 Mb/s > > 64 bit read: 277.29 Mb/s > > 8 bit write: 138.57 Mb/s > > 64 bit write: 239.21 Mb/s > > > > Anyway, thanks for the info on nio2 timings. > If you think that an unrolled loop would help, we can achieve the same in C. > I think we could code something similar to a Duff device (or with jump > labels) to unroll the loop but at the same time doing efficient small copies. Unrolling has to be done with care. It tends to improve benchmarks, but the extra code displaces other code from the i-cache and slows down overall performance. So you need 'just enough' unrolling to avoid cpu stalls. On your system it looks like the memory/cache subsystem is the bottleneck for the tests you are doing. I'd really expect a 1GHz cpu to be able to read/write from its data cache every clock. So I'd expect transfer rates nearer 8000 MB/s, not 250 MB/s. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH 2/3] riscv: optimized memmove 2021-06-15 2:38 ` Matteo Croce @ 2021-06-15 2:38 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> When the destination buffer is before the source one, or when the buffers doesn't overlap, it's safe to use memcpy() instead, which is optimized to use a bigger data size possible. Signed-off-by: Matteo Croce <mcroce@microsoft.com> --- arch/riscv/include/asm/string.h | 6 ++-- arch/riscv/kernel/riscv_ksyms.c | 2 -- arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memmove.S | 64 --------------------------------- arch/riscv/lib/string.c | 26 ++++++++++++++ 5 files changed, 29 insertions(+), 70 deletions(-) delete mode 100644 arch/riscv/lib/memmove.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index 6b5d6fc3eab4..25d9b9078569 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t); #define __HAVE_ARCH_MEMCPY extern void *memcpy(void *dest, const void *src, size_t count); extern void *__memcpy(void *dest, const void *src, size_t count); +#define __HAVE_ARCH_MEMMOVE +extern void *memmove(void *dest, const void *src, size_t count); +extern void *__memmove(void *dest, const void *src, size_t count); #endif -#define __HAVE_ARCH_MEMMOVE -extern asmlinkage void *memmove(void *, const void *, size_t); -extern asmlinkage void *__memmove(void *, const void *, size_t); /* For those files which don't want to check by kasan. */ #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__) #define memcpy(dst, src, len) __memcpy(dst, src, len) diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index 3f6d512a5b97..361565c4db7e 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,6 +10,4 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 2ffe85d4baee..484f5ff7b508 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o lib-y += memset.o -lib-y += memmove.o lib-$(CONFIG_MMU) += uaccess.o lib-$(CONFIG_64BIT) += tishift.o lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S deleted file mode 100644 index 07d1d2152ba5..000000000000 --- a/arch/riscv/lib/memmove.S +++ /dev/null @@ -1,64 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ - -#include <linux/linkage.h> -#include <asm/asm.h> - -ENTRY(__memmove) -WEAK(memmove) - move t0, a0 - move t1, a1 - - beq a0, a1, exit_memcpy - beqz a2, exit_memcpy - srli t2, a2, 0x2 - - slt t3, a0, a1 - beqz t3, do_reverse - - andi a2, a2, 0x3 - li t4, 1 - beqz t2, byte_copy - -word_copy: - lw t3, 0(a1) - addi t2, t2, -1 - addi a1, a1, 4 - sw t3, 0(a0) - addi a0, a0, 4 - bnez t2, word_copy - beqz a2, exit_memcpy - j byte_copy - -do_reverse: - add a0, a0, a2 - add a1, a1, a2 - andi a2, a2, 0x3 - li t4, -1 - beqz t2, reverse_byte_copy - -reverse_word_copy: - addi a1, a1, -4 - addi t2, t2, -1 - lw t3, 0(a1) - addi a0, a0, -4 - sw t3, 0(a0) - bnez t2, reverse_word_copy - beqz a2, exit_memcpy - -reverse_byte_copy: - addi a0, a0, -1 - addi a1, a1, -1 - -byte_copy: - lb t3, 0(a1) - addi a2, a2, -1 - sb t3, 0(a0) - add a1, a1, t4 - add a0, a0, t4 - bnez a2, byte_copy - -exit_memcpy: - move a0, t0 - move a1, t1 - ret -END(__memmove) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index 525f9ee25a74..bc006708f075 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -92,3 +92,29 @@ void *__memcpy(void *dest, const void *src, size_t count) return memcpy(dest, src, count); } EXPORT_SYMBOL(__memcpy); + +/* + * Simply check if the buffer overlaps an call memcpy() in case, + * otherwise do a simple one byte at time backward copy. + */ +void *memmove(void *dest, const void *src, size_t count) +{ + if (dest < src || src + count <= dest) + return memcpy(dest, src, count); + + if (dest > src) { + const char *s = src + count; + char *tmp = dest + count; + + while (count--) + *--tmp = *--s; + } + return dest; +} +EXPORT_SYMBOL(memmove); + +void *__memmove(void *dest, const void *src, size_t count) +{ + return memmove(dest, src, count); +} +EXPORT_SYMBOL(__memmove); -- 2.31.1 ^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 2/3] riscv: optimized memmove @ 2021-06-15 2:38 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> When the destination buffer is before the source one, or when the buffers doesn't overlap, it's safe to use memcpy() instead, which is optimized to use a bigger data size possible. Signed-off-by: Matteo Croce <mcroce@microsoft.com> --- arch/riscv/include/asm/string.h | 6 ++-- arch/riscv/kernel/riscv_ksyms.c | 2 -- arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memmove.S | 64 --------------------------------- arch/riscv/lib/string.c | 26 ++++++++++++++ 5 files changed, 29 insertions(+), 70 deletions(-) delete mode 100644 arch/riscv/lib/memmove.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index 6b5d6fc3eab4..25d9b9078569 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t); #define __HAVE_ARCH_MEMCPY extern void *memcpy(void *dest, const void *src, size_t count); extern void *__memcpy(void *dest, const void *src, size_t count); +#define __HAVE_ARCH_MEMMOVE +extern void *memmove(void *dest, const void *src, size_t count); +extern void *__memmove(void *dest, const void *src, size_t count); #endif -#define __HAVE_ARCH_MEMMOVE -extern asmlinkage void *memmove(void *, const void *, size_t); -extern asmlinkage void *__memmove(void *, const void *, size_t); /* For those files which don't want to check by kasan. */ #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__) #define memcpy(dst, src, len) __memcpy(dst, src, len) diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index 3f6d512a5b97..361565c4db7e 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,6 +10,4 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 2ffe85d4baee..484f5ff7b508 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o lib-y += memset.o -lib-y += memmove.o lib-$(CONFIG_MMU) += uaccess.o lib-$(CONFIG_64BIT) += tishift.o lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S deleted file mode 100644 index 07d1d2152ba5..000000000000 --- a/arch/riscv/lib/memmove.S +++ /dev/null @@ -1,64 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ - -#include <linux/linkage.h> -#include <asm/asm.h> - -ENTRY(__memmove) -WEAK(memmove) - move t0, a0 - move t1, a1 - - beq a0, a1, exit_memcpy - beqz a2, exit_memcpy - srli t2, a2, 0x2 - - slt t3, a0, a1 - beqz t3, do_reverse - - andi a2, a2, 0x3 - li t4, 1 - beqz t2, byte_copy - -word_copy: - lw t3, 0(a1) - addi t2, t2, -1 - addi a1, a1, 4 - sw t3, 0(a0) - addi a0, a0, 4 - bnez t2, word_copy - beqz a2, exit_memcpy - j byte_copy - -do_reverse: - add a0, a0, a2 - add a1, a1, a2 - andi a2, a2, 0x3 - li t4, -1 - beqz t2, reverse_byte_copy - -reverse_word_copy: - addi a1, a1, -4 - addi t2, t2, -1 - lw t3, 0(a1) - addi a0, a0, -4 - sw t3, 0(a0) - bnez t2, reverse_word_copy - beqz a2, exit_memcpy - -reverse_byte_copy: - addi a0, a0, -1 - addi a1, a1, -1 - -byte_copy: - lb t3, 0(a1) - addi a2, a2, -1 - sb t3, 0(a0) - add a1, a1, t4 - add a0, a0, t4 - bnez a2, byte_copy - -exit_memcpy: - move a0, t0 - move a1, t1 - ret -END(__memmove) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index 525f9ee25a74..bc006708f075 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -92,3 +92,29 @@ void *__memcpy(void *dest, const void *src, size_t count) return memcpy(dest, src, count); } EXPORT_SYMBOL(__memcpy); + +/* + * Simply check if the buffer overlaps an call memcpy() in case, + * otherwise do a simple one byte at time backward copy. + */ +void *memmove(void *dest, const void *src, size_t count) +{ + if (dest < src || src + count <= dest) + return memcpy(dest, src, count); + + if (dest > src) { + const char *s = src + count; + char *tmp = dest + count; + + while (count--) + *--tmp = *--s; + } + return dest; +} +EXPORT_SYMBOL(memmove); + +void *__memmove(void *dest, const void *src, size_t count) +{ + return memmove(dest, src, count); +} +EXPORT_SYMBOL(__memmove); -- 2.31.1 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 3/3] riscv: optimized memset 2021-06-15 2:38 ` Matteo Croce @ 2021-06-15 2:38 ` Matteo Croce -1 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> The generic memset is defined as a byte at time write. This is always safe, but it's slower than a 4 byte or even 8 byte write. Write a generic memset which fills the data one byte at time until the destination is aligned, then fills using the largest size allowed, and finally fills the remaining data one byte at time. Signed-off-by: Matteo Croce <mcroce@microsoft.com> --- arch/riscv/include/asm/string.h | 10 +-- arch/riscv/kernel/Makefile | 1 - arch/riscv/kernel/riscv_ksyms.c | 13 ---- arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memset.S | 113 -------------------------------- arch/riscv/lib/string.c | 42 ++++++++++++ 6 files changed, 45 insertions(+), 135 deletions(-) delete mode 100644 arch/riscv/kernel/riscv_ksyms.c delete mode 100644 arch/riscv/lib/memset.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index 25d9b9078569..90500635035a 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -6,14 +6,10 @@ #ifndef _ASM_RISCV_STRING_H #define _ASM_RISCV_STRING_H -#include <linux/types.h> -#include <linux/linkage.h> - -#define __HAVE_ARCH_MEMSET -extern asmlinkage void *memset(void *, int, size_t); -extern asmlinkage void *__memset(void *, int, size_t); - #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE +#define __HAVE_ARCH_MEMSET +extern void *memset(void *s, int c, size_t count); +extern void *__memset(void *s, int c, size_t count); #define __HAVE_ARCH_MEMCPY extern void *memcpy(void *dest, const void *src, size_t count); extern void *__memcpy(void *dest, const void *src, size_t count); diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index d3081e4d9600..e635ce1e5645 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -31,7 +31,6 @@ obj-y += syscall_table.o obj-y += sys_riscv.o obj-y += time.o obj-y += traps.o -obj-y += riscv_ksyms.o obj-y += stacktrace.o obj-y += cacheinfo.o obj-y += patch.o diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c deleted file mode 100644 index 361565c4db7e..000000000000 --- a/arch/riscv/kernel/riscv_ksyms.c +++ /dev/null @@ -1,13 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * Copyright (C) 2017 Zihao Yu - */ - -#include <linux/export.h> -#include <linux/uaccess.h> - -/* - * Assembly functions that may be used (directly or indirectly) by modules - */ -EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(__memset); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 484f5ff7b508..e33263cc622a 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,6 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o -lib-y += memset.o lib-$(CONFIG_MMU) += uaccess.o lib-$(CONFIG_64BIT) += tishift.o lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S deleted file mode 100644 index 34c5360c6705..000000000000 --- a/arch/riscv/lib/memset.S +++ /dev/null @@ -1,113 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2013 Regents of the University of California - */ - - -#include <linux/linkage.h> -#include <asm/asm.h> - -/* void *memset(void *, int, size_t) */ -ENTRY(__memset) -WEAK(memset) - move t0, a0 /* Preserve return value */ - - /* Defer to byte-oriented fill for small sizes */ - sltiu a3, a2, 16 - bnez a3, 4f - - /* - * Round to nearest XLEN-aligned address - * greater than or equal to start address - */ - addi a3, t0, SZREG-1 - andi a3, a3, ~(SZREG-1) - beq a3, t0, 2f /* Skip if already aligned */ - /* Handle initial misalignment */ - sub a4, a3, t0 -1: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 1b - sub a2, a2, a4 /* Update count */ - -2: /* Duff's device with 32 XLEN stores per iteration */ - /* Broadcast value into all bytes */ - andi a1, a1, 0xff - slli a3, a1, 8 - or a1, a3, a1 - slli a3, a1, 16 - or a1, a3, a1 -#ifdef CONFIG_64BIT - slli a3, a1, 32 - or a1, a3, a1 -#endif - - /* Calculate end address */ - andi a4, a2, ~(SZREG-1) - add a3, t0, a4 - - andi a4, a4, 31*SZREG /* Calculate remainder */ - beqz a4, 3f /* Shortcut if no remainder */ - neg a4, a4 - addi a4, a4, 32*SZREG /* Calculate initial offset */ - - /* Adjust start address with offset */ - sub t0, t0, a4 - - /* Jump into loop body */ - /* Assumes 32-bit instruction lengths */ - la a5, 3f -#ifdef CONFIG_64BIT - srli a4, a4, 1 -#endif - add a5, a5, a4 - jr a5 -3: - REG_S a1, 0(t0) - REG_S a1, SZREG(t0) - REG_S a1, 2*SZREG(t0) - REG_S a1, 3*SZREG(t0) - REG_S a1, 4*SZREG(t0) - REG_S a1, 5*SZREG(t0) - REG_S a1, 6*SZREG(t0) - REG_S a1, 7*SZREG(t0) - REG_S a1, 8*SZREG(t0) - REG_S a1, 9*SZREG(t0) - REG_S a1, 10*SZREG(t0) - REG_S a1, 11*SZREG(t0) - REG_S a1, 12*SZREG(t0) - REG_S a1, 13*SZREG(t0) - REG_S a1, 14*SZREG(t0) - REG_S a1, 15*SZREG(t0) - REG_S a1, 16*SZREG(t0) - REG_S a1, 17*SZREG(t0) - REG_S a1, 18*SZREG(t0) - REG_S a1, 19*SZREG(t0) - REG_S a1, 20*SZREG(t0) - REG_S a1, 21*SZREG(t0) - REG_S a1, 22*SZREG(t0) - REG_S a1, 23*SZREG(t0) - REG_S a1, 24*SZREG(t0) - REG_S a1, 25*SZREG(t0) - REG_S a1, 26*SZREG(t0) - REG_S a1, 27*SZREG(t0) - REG_S a1, 28*SZREG(t0) - REG_S a1, 29*SZREG(t0) - REG_S a1, 30*SZREG(t0) - REG_S a1, 31*SZREG(t0) - addi t0, t0, 32*SZREG - bltu t0, a3, 3b - andi a2, a2, SZREG-1 /* Update count */ - -4: - /* Handle trailing misalignment */ - beqz a2, 6f - add a3, t0, a2 -5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b -6: - ret -END(__memset) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index bc006708f075..62869627e139 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -118,3 +118,45 @@ void *__memmove(void *dest, const void *src, size_t count) return memmove(dest, src, count); } EXPORT_SYMBOL(__memmove); + +void *memset(void *s, int c, size_t count) +{ + union types dest = { .u8 = s }; + + if (count > MIN_THRESHOLD) { + const int bytes_long = BITS_PER_LONG / 8; + unsigned long cu = (unsigned long)c; + + /* Compose an ulong with 'c' repeated 4/8 times */ + cu = +#if BITS_PER_LONG == 64 + cu << 56 | cu << 48 | cu << 40 | cu << 32 | +#endif + cu << 24 | cu << 16 | cu << 8 | cu; + +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + /* Fill the buffer one byte at time until the destination + * is aligned on a 32/64 bit boundary. + */ + for (; count && dest.uptr % bytes_long; count--) + *dest.u8++ = c; +#endif + + /* Copy using the largest size allowed */ + for (; count >= bytes_long; count -= bytes_long) + *dest.ulong++ = cu; + } + + /* copy the remainder */ + while (count--) + *dest.u8++ = c; + + return s; +} +EXPORT_SYMBOL(memset); + +void *__memset(void *s, int c, size_t count) +{ + return memset(s, c, count); +} +EXPORT_SYMBOL(__memset); -- 2.31.1 ^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 3/3] riscv: optimized memset @ 2021-06-15 2:38 ` Matteo Croce 0 siblings, 0 replies; 64+ messages in thread From: Matteo Croce @ 2021-06-15 2:38 UTC (permalink / raw) To: linux-riscv Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng From: Matteo Croce <mcroce@microsoft.com> The generic memset is defined as a byte at time write. This is always safe, but it's slower than a 4 byte or even 8 byte write. Write a generic memset which fills the data one byte at time until the destination is aligned, then fills using the largest size allowed, and finally fills the remaining data one byte at time. Signed-off-by: Matteo Croce <mcroce@microsoft.com> --- arch/riscv/include/asm/string.h | 10 +-- arch/riscv/kernel/Makefile | 1 - arch/riscv/kernel/riscv_ksyms.c | 13 ---- arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memset.S | 113 -------------------------------- arch/riscv/lib/string.c | 42 ++++++++++++ 6 files changed, 45 insertions(+), 135 deletions(-) delete mode 100644 arch/riscv/kernel/riscv_ksyms.c delete mode 100644 arch/riscv/lib/memset.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index 25d9b9078569..90500635035a 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -6,14 +6,10 @@ #ifndef _ASM_RISCV_STRING_H #define _ASM_RISCV_STRING_H -#include <linux/types.h> -#include <linux/linkage.h> - -#define __HAVE_ARCH_MEMSET -extern asmlinkage void *memset(void *, int, size_t); -extern asmlinkage void *__memset(void *, int, size_t); - #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE +#define __HAVE_ARCH_MEMSET +extern void *memset(void *s, int c, size_t count); +extern void *__memset(void *s, int c, size_t count); #define __HAVE_ARCH_MEMCPY extern void *memcpy(void *dest, const void *src, size_t count); extern void *__memcpy(void *dest, const void *src, size_t count); diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index d3081e4d9600..e635ce1e5645 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -31,7 +31,6 @@ obj-y += syscall_table.o obj-y += sys_riscv.o obj-y += time.o obj-y += traps.o -obj-y += riscv_ksyms.o obj-y += stacktrace.o obj-y += cacheinfo.o obj-y += patch.o diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c deleted file mode 100644 index 361565c4db7e..000000000000 --- a/arch/riscv/kernel/riscv_ksyms.c +++ /dev/null @@ -1,13 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * Copyright (C) 2017 Zihao Yu - */ - -#include <linux/export.h> -#include <linux/uaccess.h> - -/* - * Assembly functions that may be used (directly or indirectly) by modules - */ -EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(__memset); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 484f5ff7b508..e33263cc622a 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,6 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o -lib-y += memset.o lib-$(CONFIG_MMU) += uaccess.o lib-$(CONFIG_64BIT) += tishift.o lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S deleted file mode 100644 index 34c5360c6705..000000000000 --- a/arch/riscv/lib/memset.S +++ /dev/null @@ -1,113 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2013 Regents of the University of California - */ - - -#include <linux/linkage.h> -#include <asm/asm.h> - -/* void *memset(void *, int, size_t) */ -ENTRY(__memset) -WEAK(memset) - move t0, a0 /* Preserve return value */ - - /* Defer to byte-oriented fill for small sizes */ - sltiu a3, a2, 16 - bnez a3, 4f - - /* - * Round to nearest XLEN-aligned address - * greater than or equal to start address - */ - addi a3, t0, SZREG-1 - andi a3, a3, ~(SZREG-1) - beq a3, t0, 2f /* Skip if already aligned */ - /* Handle initial misalignment */ - sub a4, a3, t0 -1: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 1b - sub a2, a2, a4 /* Update count */ - -2: /* Duff's device with 32 XLEN stores per iteration */ - /* Broadcast value into all bytes */ - andi a1, a1, 0xff - slli a3, a1, 8 - or a1, a3, a1 - slli a3, a1, 16 - or a1, a3, a1 -#ifdef CONFIG_64BIT - slli a3, a1, 32 - or a1, a3, a1 -#endif - - /* Calculate end address */ - andi a4, a2, ~(SZREG-1) - add a3, t0, a4 - - andi a4, a4, 31*SZREG /* Calculate remainder */ - beqz a4, 3f /* Shortcut if no remainder */ - neg a4, a4 - addi a4, a4, 32*SZREG /* Calculate initial offset */ - - /* Adjust start address with offset */ - sub t0, t0, a4 - - /* Jump into loop body */ - /* Assumes 32-bit instruction lengths */ - la a5, 3f -#ifdef CONFIG_64BIT - srli a4, a4, 1 -#endif - add a5, a5, a4 - jr a5 -3: - REG_S a1, 0(t0) - REG_S a1, SZREG(t0) - REG_S a1, 2*SZREG(t0) - REG_S a1, 3*SZREG(t0) - REG_S a1, 4*SZREG(t0) - REG_S a1, 5*SZREG(t0) - REG_S a1, 6*SZREG(t0) - REG_S a1, 7*SZREG(t0) - REG_S a1, 8*SZREG(t0) - REG_S a1, 9*SZREG(t0) - REG_S a1, 10*SZREG(t0) - REG_S a1, 11*SZREG(t0) - REG_S a1, 12*SZREG(t0) - REG_S a1, 13*SZREG(t0) - REG_S a1, 14*SZREG(t0) - REG_S a1, 15*SZREG(t0) - REG_S a1, 16*SZREG(t0) - REG_S a1, 17*SZREG(t0) - REG_S a1, 18*SZREG(t0) - REG_S a1, 19*SZREG(t0) - REG_S a1, 20*SZREG(t0) - REG_S a1, 21*SZREG(t0) - REG_S a1, 22*SZREG(t0) - REG_S a1, 23*SZREG(t0) - REG_S a1, 24*SZREG(t0) - REG_S a1, 25*SZREG(t0) - REG_S a1, 26*SZREG(t0) - REG_S a1, 27*SZREG(t0) - REG_S a1, 28*SZREG(t0) - REG_S a1, 29*SZREG(t0) - REG_S a1, 30*SZREG(t0) - REG_S a1, 31*SZREG(t0) - addi t0, t0, 32*SZREG - bltu t0, a3, 3b - andi a2, a2, SZREG-1 /* Update count */ - -4: - /* Handle trailing misalignment */ - beqz a2, 6f - add a3, t0, a2 -5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b -6: - ret -END(__memset) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index bc006708f075..62869627e139 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -118,3 +118,45 @@ void *__memmove(void *dest, const void *src, size_t count) return memmove(dest, src, count); } EXPORT_SYMBOL(__memmove); + +void *memset(void *s, int c, size_t count) +{ + union types dest = { .u8 = s }; + + if (count > MIN_THRESHOLD) { + const int bytes_long = BITS_PER_LONG / 8; + unsigned long cu = (unsigned long)c; + + /* Compose an ulong with 'c' repeated 4/8 times */ + cu = +#if BITS_PER_LONG == 64 + cu << 56 | cu << 48 | cu << 40 | cu << 32 | +#endif + cu << 24 | cu << 16 | cu << 8 | cu; + +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + /* Fill the buffer one byte at time until the destination + * is aligned on a 32/64 bit boundary. + */ + for (; count && dest.uptr % bytes_long; count--) + *dest.u8++ = c; +#endif + + /* Copy using the largest size allowed */ + for (; count >= bytes_long; count -= bytes_long) + *dest.ulong++ = cu; + } + + /* copy the remainder */ + while (count--) + *dest.u8++ = c; + + return s; +} +EXPORT_SYMBOL(memset); + +void *__memset(void *s, int c, size_t count) +{ + return memset(s, c, count); +} +EXPORT_SYMBOL(__memset); -- 2.31.1 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH 0/3] riscv: optimized mem* functions 2021-06-15 2:38 ` Matteo Croce @ 2021-06-15 2:43 ` Bin Meng -1 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-15 2:43 UTC (permalink / raw) To: Matteo Croce Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini Hi Matteo, On Tue, Jun 15, 2021 at 10:39 AM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > From: Matteo Croce <mcroce@microsoft.com> > > Replace the assembly mem{cpy,move,set} with C equivalent. > > Try to access RAM with the largest bit width possible, but without > doing unaligned accesses. > > Tested on a BeagleV Starlight with a SiFive U74 core, where the > improvement is noticeable. > There is already a patch on the ML for optimizing the assembly version. https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ Would you please try that and compare the results? Regards, Bin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 0/3] riscv: optimized mem* functions @ 2021-06-15 2:43 ` Bin Meng 0 siblings, 0 replies; 64+ messages in thread From: Bin Meng @ 2021-06-15 2:43 UTC (permalink / raw) To: Matteo Croce Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini Hi Matteo, On Tue, Jun 15, 2021 at 10:39 AM Matteo Croce <mcroce@linux.microsoft.com> wrote: > > From: Matteo Croce <mcroce@microsoft.com> > > Replace the assembly mem{cpy,move,set} with C equivalent. > > Try to access RAM with the largest bit width possible, but without > doing unaligned accesses. > > Tested on a BeagleV Starlight with a SiFive U74 core, where the > improvement is noticeable. > There is already a patch on the ML for optimizing the assembly version. https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/ Would you please try that and compare the results? Regards, Bin _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH 0/3] riscv: optimize memcpy/memmove/memset @ 2024-01-28 11:10 Jisheng Zhang 2024-01-28 11:10 ` Jisheng Zhang 0 siblings, 1 reply; 64+ messages in thread From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw) To: Paul Walmsley, Palmer Dabbelt, Albert Ou; +Cc: linux-riscv, linux-kernel This series is to renew Matteo's "riscv: optimized mem* functions" sereies. Compared with Matteo's original series, Jisheng made below changes: 1. adopt Emil's change to fix boot failure when build with clang 2. add corresponding changes to purgatory 3. always build optimized string.c rather than only build when optimize for performance 4. implement unroll support when src & dst are both aligned to keep the same performance as assembly version. After disassembling, I found that the unroll version looks something like below, so it acchieves the "unroll" effect as asm version but in C programming language: ld t2,0(a5) ld t0,8(a5) ld t6,16(a5) ld t5,24(a5) ld t4,32(a5) ld t3,40(a5) ld t1,48(a5) ld a1,56(a5) sd t2,0(a6) sd t0,8(a6) sd t6,16(a6) sd t5,24(a6) sd t4,32(a6) sd t3,40(a6) sd t1,48(a6) sd a1,56(a6) And per my testing, unrolling more doesn't help performance, so the "c" version only unrolls by using 8 GP regs rather than 16 ones as asm version. 5. Add proper __pi_memcpy and __pi___memcpy alias 6. more performance numbers. Per my benchmark with [1] on TH1520, CV1800B and JH7110 platforms, the unaligned medium memcpy performance is running about 3.5x ~ 8.6x speed of the unpatched versions's! Check patch1 for more details and performance numbers. Link:https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memcpy.c [1] Here is the original cover letter msg from Matteo: Replace the assembly mem{cpy,move,set} with C equivalent. Try to access RAM with the largest bit width possible, but without doing unaligned accesses. A further improvement could be to use multiple read and writes as the assembly version was trying to do. Tested on a BeagleV Starlight with a SiFive U74 core, where the improvement is noticeable. Matteo Croce (3): riscv: optimized memcpy riscv: optimized memmove riscv: optimized memset arch/riscv/include/asm/string.h | 14 +- arch/riscv/kernel/riscv_ksyms.c | 6 - arch/riscv/lib/Makefile | 9 +- arch/riscv/lib/memcpy.S | 110 ----------- arch/riscv/lib/memmove.S | 317 -------------------------------- arch/riscv/lib/memset.S | 113 ------------ arch/riscv/lib/string.c | 187 +++++++++++++++++++ arch/riscv/purgatory/Makefile | 13 +- 8 files changed, 206 insertions(+), 563 deletions(-) delete mode 100644 arch/riscv/lib/memcpy.S delete mode 100644 arch/riscv/lib/memmove.S delete mode 100644 arch/riscv/lib/memset.S create mode 100644 arch/riscv/lib/string.c -- 2.43.0 ^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH 2/3] riscv: optimized memmove 2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang @ 2024-01-28 11:10 ` Jisheng Zhang 0 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw) To: Paul Walmsley, Palmer Dabbelt, Albert Ou Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot From: Matteo Croce <mcroce@microsoft.com> When the destination buffer is before the source one, or when the buffers doesn't overlap, it's safe to use memcpy() instead, which is optimized to use a bigger data size possible. Signed-off-by: Matteo Croce <mcroce@microsoft.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> --- arch/riscv/include/asm/string.h | 4 +- arch/riscv/kernel/riscv_ksyms.c | 2 - arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memmove.S | 317 -------------------------------- arch/riscv/lib/string.c | 25 +++ 5 files changed, 27 insertions(+), 322 deletions(-) delete mode 100644 arch/riscv/lib/memmove.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index edf1d56e4f13..17c3b40382e1 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -18,8 +18,8 @@ extern void *memcpy(void *dest, const void *src, size_t count); extern void *__memcpy(void *dest, const void *src, size_t count); #define __HAVE_ARCH_MEMMOVE -extern asmlinkage void *memmove(void *, const void *, size_t); -extern asmlinkage void *__memmove(void *, const void *, size_t); +extern void *memmove(void *dest, const void *src, size_t count); +extern void *__memmove(void *dest, const void *src, size_t count); #define __HAVE_ARCH_STRCMP extern asmlinkage int strcmp(const char *cs, const char *ct); diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index c69dc74e0a27..76849d0906ef 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,9 +10,7 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(strcmp); EXPORT_SYMBOL(strlen); EXPORT_SYMBOL(strncmp); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 5f2f94f6db17..5fa88c5a601c 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o lib-y += memset.o -lib-y += memmove.o lib-y += strcmp.o lib-y += strlen.o lib-y += string.o diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S deleted file mode 100644 index cb3e2e7ef0ba..000000000000 --- a/arch/riscv/lib/memmove.S +++ /dev/null @@ -1,317 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com> - */ - -#include <linux/linkage.h> -#include <asm/asm.h> - -SYM_FUNC_START(__memmove) - /* - * Returns - * a0 - dest - * - * Parameters - * a0 - Inclusive first byte of dest - * a1 - Inclusive first byte of src - * a2 - Length of copy n - * - * Because the return matches the parameter register a0, - * we will not clobber or modify that register. - * - * Note: This currently only works on little-endian. - * To port to big-endian, reverse the direction of shifts - * in the 2 misaligned fixup copy loops. - */ - - /* Return if nothing to do */ - beq a0, a1, .Lreturn_from_memmove - beqz a2, .Lreturn_from_memmove - - /* - * Register Uses - * Forward Copy: a1 - Index counter of src - * Reverse Copy: a4 - Index counter of src - * Forward Copy: t3 - Index counter of dest - * Reverse Copy: t4 - Index counter of dest - * Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest - * Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest - * Both Copy Modes: t0 - Link / Temporary for load-store - * Both Copy Modes: t1 - Temporary for load-store - * Both Copy Modes: t2 - Temporary for load-store - * Both Copy Modes: a5 - dest to src alignment offset - * Both Copy Modes: a6 - Shift ammount - * Both Copy Modes: a7 - Inverse Shift ammount - * Both Copy Modes: a2 - Alternate breakpoint for unrolled loops - */ - - /* - * Solve for some register values now. - * Byte copy does not need t5 or t6. - */ - mv t3, a0 - add t4, a0, a2 - add a4, a1, a2 - - /* - * Byte copy if copying less than (2 * SZREG) bytes. This can - * cause problems with the bulk copy implementation and is - * small enough not to bother. - */ - andi t0, a2, -(2 * SZREG) - beqz t0, .Lbyte_copy - - /* - * Now solve for t5 and t6. - */ - andi t5, t3, -SZREG - andi t6, t4, -SZREG - /* - * If dest(Register t3) rounded down to the nearest naturally - * aligned SZREG address, does not equal dest, then add SZREG - * to find the low-bound of SZREG alignment in the dest memory - * region. Note that this could overshoot the dest memory - * region if n is less than SZREG. This is one reason why - * we always byte copy if n is less than SZREG. - * Otherwise, dest is already naturally aligned to SZREG. - */ - beq t5, t3, 1f - addi t5, t5, SZREG - 1: - - /* - * If the dest and src are co-aligned to SZREG, then there is - * no need for the full rigmarole of a full misaligned fixup copy. - * Instead, do a simpler co-aligned copy. - */ - xor t0, a0, a1 - andi t1, t0, (SZREG - 1) - beqz t1, .Lcoaligned_copy - /* Fall through to misaligned fixup copy */ - -.Lmisaligned_fixup_copy: - bltu a1, a0, .Lmisaligned_fixup_copy_reverse - -.Lmisaligned_fixup_copy_forward: - jal t0, .Lbyte_copy_until_aligned_forward - - andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */ - slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */ - sub a5, a1, t3 /* Find the difference between src and dest */ - andi a1, a1, -SZREG /* Align the src pointer */ - addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/ - - /* - * Compute The Inverse Shift - * a7 = XLEN - a6 = XLEN + -a6 - * 2s complement negation to find the negative: -a6 = ~a6 + 1 - * Add that to XLEN. XLEN = SZREG * 8. - */ - not a7, a6 - addi a7, a7, (SZREG * 8 + 1) - - /* - * Fix Misalignment Copy Loop - Forward - * load_val0 = load_ptr[0]; - * do { - * load_val1 = load_ptr[1]; - * store_ptr += 2; - * store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7}); - * - * if (store_ptr == {a2}) - * break; - * - * load_val0 = load_ptr[2]; - * load_ptr += 2; - * store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7}); - * - * } while (store_ptr != store_ptr_end); - * store_ptr = store_ptr_end; - */ - - REG_L t0, (0 * SZREG)(a1) - 1: - REG_L t1, (1 * SZREG)(a1) - addi t3, t3, (2 * SZREG) - srl t0, t0, a6 - sll t2, t1, a7 - or t2, t0, t2 - REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3) - - beq t3, a2, 2f - - REG_L t0, (2 * SZREG)(a1) - addi a1, a1, (2 * SZREG) - srl t1, t1, a6 - sll t2, t0, a7 - or t2, t1, t2 - REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3) - - bne t3, t6, 1b - 2: - mv t3, t6 /* Fix the dest pointer in case the loop was broken */ - - add a1, t3, a5 /* Restore the src pointer */ - j .Lbyte_copy_forward /* Copy any remaining bytes */ - -.Lmisaligned_fixup_copy_reverse: - jal t0, .Lbyte_copy_until_aligned_reverse - - andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */ - slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */ - sub a5, a4, t4 /* Find the difference between src and dest */ - andi a4, a4, -SZREG /* Align the src pointer */ - addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/ - - /* - * Compute The Inverse Shift - * a7 = XLEN - a6 = XLEN + -a6 - * 2s complement negation to find the negative: -a6 = ~a6 + 1 - * Add that to XLEN. XLEN = SZREG * 8. - */ - not a7, a6 - addi a7, a7, (SZREG * 8 + 1) - - /* - * Fix Misalignment Copy Loop - Reverse - * load_val1 = load_ptr[0]; - * do { - * load_val0 = load_ptr[-1]; - * store_ptr -= 2; - * store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7}); - * - * if (store_ptr == {a2}) - * break; - * - * load_val1 = load_ptr[-2]; - * load_ptr -= 2; - * store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7}); - * - * } while (store_ptr != store_ptr_end); - * store_ptr = store_ptr_end; - */ - - REG_L t1, ( 0 * SZREG)(a4) - 1: - REG_L t0, (-1 * SZREG)(a4) - addi t4, t4, (-2 * SZREG) - sll t1, t1, a7 - srl t2, t0, a6 - or t2, t1, t2 - REG_S t2, ( 1 * SZREG)(t4) - - beq t4, a2, 2f - - REG_L t1, (-2 * SZREG)(a4) - addi a4, a4, (-2 * SZREG) - sll t0, t0, a7 - srl t2, t1, a6 - or t2, t0, t2 - REG_S t2, ( 0 * SZREG)(t4) - - bne t4, t5, 1b - 2: - mv t4, t5 /* Fix the dest pointer in case the loop was broken */ - - add a4, t4, a5 /* Restore the src pointer */ - j .Lbyte_copy_reverse /* Copy any remaining bytes */ - -/* - * Simple copy loops for SZREG co-aligned memory locations. - * These also make calls to do byte copies for any unaligned - * data at their terminations. - */ -.Lcoaligned_copy: - bltu a1, a0, .Lcoaligned_copy_reverse - -.Lcoaligned_copy_forward: - jal t0, .Lbyte_copy_until_aligned_forward - - 1: - REG_L t1, ( 0 * SZREG)(a1) - addi a1, a1, SZREG - addi t3, t3, SZREG - REG_S t1, (-1 * SZREG)(t3) - bne t3, t6, 1b - - j .Lbyte_copy_forward /* Copy any remaining bytes */ - -.Lcoaligned_copy_reverse: - jal t0, .Lbyte_copy_until_aligned_reverse - - 1: - REG_L t1, (-1 * SZREG)(a4) - addi a4, a4, -SZREG - addi t4, t4, -SZREG - REG_S t1, ( 0 * SZREG)(t4) - bne t4, t5, 1b - - j .Lbyte_copy_reverse /* Copy any remaining bytes */ - -/* - * These are basically sub-functions within the function. They - * are used to byte copy until the dest pointer is in alignment. - * At which point, a bulk copy method can be used by the - * calling code. These work on the same registers as the bulk - * copy loops. Therefore, the register values can be picked - * up from where they were left and we avoid code duplication - * without any overhead except the call in and return jumps. - */ -.Lbyte_copy_until_aligned_forward: - beq t3, t5, 2f - 1: - lb t1, 0(a1) - addi a1, a1, 1 - addi t3, t3, 1 - sb t1, -1(t3) - bne t3, t5, 1b - 2: - jalr zero, 0x0(t0) /* Return to multibyte copy loop */ - -.Lbyte_copy_until_aligned_reverse: - beq t4, t6, 2f - 1: - lb t1, -1(a4) - addi a4, a4, -1 - addi t4, t4, -1 - sb t1, 0(t4) - bne t4, t6, 1b - 2: - jalr zero, 0x0(t0) /* Return to multibyte copy loop */ - -/* - * Simple byte copy loops. - * These will byte copy until they reach the end of data to copy. - * At that point, they will call to return from memmove. - */ -.Lbyte_copy: - bltu a1, a0, .Lbyte_copy_reverse - -.Lbyte_copy_forward: - beq t3, t4, 2f - 1: - lb t1, 0(a1) - addi a1, a1, 1 - addi t3, t3, 1 - sb t1, -1(t3) - bne t3, t4, 1b - 2: - ret - -.Lbyte_copy_reverse: - beq t4, t3, 2f - 1: - lb t1, -1(a4) - addi a4, a4, -1 - addi t4, t4, -1 - sb t1, 0(t4) - bne t4, t3, 1b - 2: - -.Lreturn_from_memmove: - ret - -SYM_FUNC_END(__memmove) -SYM_FUNC_ALIAS_WEAK(memmove, __memmove) -SYM_FUNC_ALIAS(__pi_memmove, __memmove) -SYM_FUNC_ALIAS(__pi___memmove, __memmove) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index 5f9c83ec548d..20677c8067da 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy) EXPORT_SYMBOL(memcpy); void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); + +/* + * Simply check if the buffer overlaps an call memcpy() in case, + * otherwise do a simple one byte at time backward copy. + */ +void *__memmove(void *dest, const void *src, size_t count) +{ + if (dest < src || src + count <= dest) + return __memcpy(dest, src, count); + + if (dest > src) { + const char *s = src + count; + char *tmp = dest + count; + + while (count--) + *--tmp = *--s; + } + return dest; +} +EXPORT_SYMBOL(__memmove); + +void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove); +EXPORT_SYMBOL(memmove); +void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove); +void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove); -- 2.43.0 ^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 2/3] riscv: optimized memmove @ 2024-01-28 11:10 ` Jisheng Zhang 0 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw) To: Paul Walmsley, Palmer Dabbelt, Albert Ou Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot From: Matteo Croce <mcroce@microsoft.com> When the destination buffer is before the source one, or when the buffers doesn't overlap, it's safe to use memcpy() instead, which is optimized to use a bigger data size possible. Signed-off-by: Matteo Croce <mcroce@microsoft.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> --- arch/riscv/include/asm/string.h | 4 +- arch/riscv/kernel/riscv_ksyms.c | 2 - arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memmove.S | 317 -------------------------------- arch/riscv/lib/string.c | 25 +++ 5 files changed, 27 insertions(+), 322 deletions(-) delete mode 100644 arch/riscv/lib/memmove.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index edf1d56e4f13..17c3b40382e1 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -18,8 +18,8 @@ extern void *memcpy(void *dest, const void *src, size_t count); extern void *__memcpy(void *dest, const void *src, size_t count); #define __HAVE_ARCH_MEMMOVE -extern asmlinkage void *memmove(void *, const void *, size_t); -extern asmlinkage void *__memmove(void *, const void *, size_t); +extern void *memmove(void *dest, const void *src, size_t count); +extern void *__memmove(void *dest, const void *src, size_t count); #define __HAVE_ARCH_STRCMP extern asmlinkage int strcmp(const char *cs, const char *ct); diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index c69dc74e0a27..76849d0906ef 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,9 +10,7 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(strcmp); EXPORT_SYMBOL(strlen); EXPORT_SYMBOL(strncmp); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 5f2f94f6db17..5fa88c5a601c 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o lib-y += memset.o -lib-y += memmove.o lib-y += strcmp.o lib-y += strlen.o lib-y += string.o diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S deleted file mode 100644 index cb3e2e7ef0ba..000000000000 --- a/arch/riscv/lib/memmove.S +++ /dev/null @@ -1,317 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com> - */ - -#include <linux/linkage.h> -#include <asm/asm.h> - -SYM_FUNC_START(__memmove) - /* - * Returns - * a0 - dest - * - * Parameters - * a0 - Inclusive first byte of dest - * a1 - Inclusive first byte of src - * a2 - Length of copy n - * - * Because the return matches the parameter register a0, - * we will not clobber or modify that register. - * - * Note: This currently only works on little-endian. - * To port to big-endian, reverse the direction of shifts - * in the 2 misaligned fixup copy loops. - */ - - /* Return if nothing to do */ - beq a0, a1, .Lreturn_from_memmove - beqz a2, .Lreturn_from_memmove - - /* - * Register Uses - * Forward Copy: a1 - Index counter of src - * Reverse Copy: a4 - Index counter of src - * Forward Copy: t3 - Index counter of dest - * Reverse Copy: t4 - Index counter of dest - * Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest - * Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest - * Both Copy Modes: t0 - Link / Temporary for load-store - * Both Copy Modes: t1 - Temporary for load-store - * Both Copy Modes: t2 - Temporary for load-store - * Both Copy Modes: a5 - dest to src alignment offset - * Both Copy Modes: a6 - Shift ammount - * Both Copy Modes: a7 - Inverse Shift ammount - * Both Copy Modes: a2 - Alternate breakpoint for unrolled loops - */ - - /* - * Solve for some register values now. - * Byte copy does not need t5 or t6. - */ - mv t3, a0 - add t4, a0, a2 - add a4, a1, a2 - - /* - * Byte copy if copying less than (2 * SZREG) bytes. This can - * cause problems with the bulk copy implementation and is - * small enough not to bother. - */ - andi t0, a2, -(2 * SZREG) - beqz t0, .Lbyte_copy - - /* - * Now solve for t5 and t6. - */ - andi t5, t3, -SZREG - andi t6, t4, -SZREG - /* - * If dest(Register t3) rounded down to the nearest naturally - * aligned SZREG address, does not equal dest, then add SZREG - * to find the low-bound of SZREG alignment in the dest memory - * region. Note that this could overshoot the dest memory - * region if n is less than SZREG. This is one reason why - * we always byte copy if n is less than SZREG. - * Otherwise, dest is already naturally aligned to SZREG. - */ - beq t5, t3, 1f - addi t5, t5, SZREG - 1: - - /* - * If the dest and src are co-aligned to SZREG, then there is - * no need for the full rigmarole of a full misaligned fixup copy. - * Instead, do a simpler co-aligned copy. - */ - xor t0, a0, a1 - andi t1, t0, (SZREG - 1) - beqz t1, .Lcoaligned_copy - /* Fall through to misaligned fixup copy */ - -.Lmisaligned_fixup_copy: - bltu a1, a0, .Lmisaligned_fixup_copy_reverse - -.Lmisaligned_fixup_copy_forward: - jal t0, .Lbyte_copy_until_aligned_forward - - andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */ - slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */ - sub a5, a1, t3 /* Find the difference between src and dest */ - andi a1, a1, -SZREG /* Align the src pointer */ - addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/ - - /* - * Compute The Inverse Shift - * a7 = XLEN - a6 = XLEN + -a6 - * 2s complement negation to find the negative: -a6 = ~a6 + 1 - * Add that to XLEN. XLEN = SZREG * 8. - */ - not a7, a6 - addi a7, a7, (SZREG * 8 + 1) - - /* - * Fix Misalignment Copy Loop - Forward - * load_val0 = load_ptr[0]; - * do { - * load_val1 = load_ptr[1]; - * store_ptr += 2; - * store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7}); - * - * if (store_ptr == {a2}) - * break; - * - * load_val0 = load_ptr[2]; - * load_ptr += 2; - * store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7}); - * - * } while (store_ptr != store_ptr_end); - * store_ptr = store_ptr_end; - */ - - REG_L t0, (0 * SZREG)(a1) - 1: - REG_L t1, (1 * SZREG)(a1) - addi t3, t3, (2 * SZREG) - srl t0, t0, a6 - sll t2, t1, a7 - or t2, t0, t2 - REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3) - - beq t3, a2, 2f - - REG_L t0, (2 * SZREG)(a1) - addi a1, a1, (2 * SZREG) - srl t1, t1, a6 - sll t2, t0, a7 - or t2, t1, t2 - REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3) - - bne t3, t6, 1b - 2: - mv t3, t6 /* Fix the dest pointer in case the loop was broken */ - - add a1, t3, a5 /* Restore the src pointer */ - j .Lbyte_copy_forward /* Copy any remaining bytes */ - -.Lmisaligned_fixup_copy_reverse: - jal t0, .Lbyte_copy_until_aligned_reverse - - andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */ - slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */ - sub a5, a4, t4 /* Find the difference between src and dest */ - andi a4, a4, -SZREG /* Align the src pointer */ - addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/ - - /* - * Compute The Inverse Shift - * a7 = XLEN - a6 = XLEN + -a6 - * 2s complement negation to find the negative: -a6 = ~a6 + 1 - * Add that to XLEN. XLEN = SZREG * 8. - */ - not a7, a6 - addi a7, a7, (SZREG * 8 + 1) - - /* - * Fix Misalignment Copy Loop - Reverse - * load_val1 = load_ptr[0]; - * do { - * load_val0 = load_ptr[-1]; - * store_ptr -= 2; - * store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7}); - * - * if (store_ptr == {a2}) - * break; - * - * load_val1 = load_ptr[-2]; - * load_ptr -= 2; - * store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7}); - * - * } while (store_ptr != store_ptr_end); - * store_ptr = store_ptr_end; - */ - - REG_L t1, ( 0 * SZREG)(a4) - 1: - REG_L t0, (-1 * SZREG)(a4) - addi t4, t4, (-2 * SZREG) - sll t1, t1, a7 - srl t2, t0, a6 - or t2, t1, t2 - REG_S t2, ( 1 * SZREG)(t4) - - beq t4, a2, 2f - - REG_L t1, (-2 * SZREG)(a4) - addi a4, a4, (-2 * SZREG) - sll t0, t0, a7 - srl t2, t1, a6 - or t2, t0, t2 - REG_S t2, ( 0 * SZREG)(t4) - - bne t4, t5, 1b - 2: - mv t4, t5 /* Fix the dest pointer in case the loop was broken */ - - add a4, t4, a5 /* Restore the src pointer */ - j .Lbyte_copy_reverse /* Copy any remaining bytes */ - -/* - * Simple copy loops for SZREG co-aligned memory locations. - * These also make calls to do byte copies for any unaligned - * data at their terminations. - */ -.Lcoaligned_copy: - bltu a1, a0, .Lcoaligned_copy_reverse - -.Lcoaligned_copy_forward: - jal t0, .Lbyte_copy_until_aligned_forward - - 1: - REG_L t1, ( 0 * SZREG)(a1) - addi a1, a1, SZREG - addi t3, t3, SZREG - REG_S t1, (-1 * SZREG)(t3) - bne t3, t6, 1b - - j .Lbyte_copy_forward /* Copy any remaining bytes */ - -.Lcoaligned_copy_reverse: - jal t0, .Lbyte_copy_until_aligned_reverse - - 1: - REG_L t1, (-1 * SZREG)(a4) - addi a4, a4, -SZREG - addi t4, t4, -SZREG - REG_S t1, ( 0 * SZREG)(t4) - bne t4, t5, 1b - - j .Lbyte_copy_reverse /* Copy any remaining bytes */ - -/* - * These are basically sub-functions within the function. They - * are used to byte copy until the dest pointer is in alignment. - * At which point, a bulk copy method can be used by the - * calling code. These work on the same registers as the bulk - * copy loops. Therefore, the register values can be picked - * up from where they were left and we avoid code duplication - * without any overhead except the call in and return jumps. - */ -.Lbyte_copy_until_aligned_forward: - beq t3, t5, 2f - 1: - lb t1, 0(a1) - addi a1, a1, 1 - addi t3, t3, 1 - sb t1, -1(t3) - bne t3, t5, 1b - 2: - jalr zero, 0x0(t0) /* Return to multibyte copy loop */ - -.Lbyte_copy_until_aligned_reverse: - beq t4, t6, 2f - 1: - lb t1, -1(a4) - addi a4, a4, -1 - addi t4, t4, -1 - sb t1, 0(t4) - bne t4, t6, 1b - 2: - jalr zero, 0x0(t0) /* Return to multibyte copy loop */ - -/* - * Simple byte copy loops. - * These will byte copy until they reach the end of data to copy. - * At that point, they will call to return from memmove. - */ -.Lbyte_copy: - bltu a1, a0, .Lbyte_copy_reverse - -.Lbyte_copy_forward: - beq t3, t4, 2f - 1: - lb t1, 0(a1) - addi a1, a1, 1 - addi t3, t3, 1 - sb t1, -1(t3) - bne t3, t4, 1b - 2: - ret - -.Lbyte_copy_reverse: - beq t4, t3, 2f - 1: - lb t1, -1(a4) - addi a4, a4, -1 - addi t4, t4, -1 - sb t1, 0(t4) - bne t4, t3, 1b - 2: - -.Lreturn_from_memmove: - ret - -SYM_FUNC_END(__memmove) -SYM_FUNC_ALIAS_WEAK(memmove, __memmove) -SYM_FUNC_ALIAS(__pi_memmove, __memmove) -SYM_FUNC_ALIAS(__pi___memmove, __memmove) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index 5f9c83ec548d..20677c8067da 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy) EXPORT_SYMBOL(memcpy); void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); + +/* + * Simply check if the buffer overlaps an call memcpy() in case, + * otherwise do a simple one byte at time backward copy. + */ +void *__memmove(void *dest, const void *src, size_t count) +{ + if (dest < src || src + count <= dest) + return __memcpy(dest, src, count); + + if (dest > src) { + const char *s = src + count; + char *tmp = dest + count; + + while (count--) + *--tmp = *--s; + } + return dest; +} +EXPORT_SYMBOL(__memmove); + +void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove); +EXPORT_SYMBOL(memmove); +void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove); +void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove); -- 2.43.0 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 64+ messages in thread
* RE: [PATCH 2/3] riscv: optimized memmove 2024-01-28 11:10 ` Jisheng Zhang @ 2024-01-28 12:47 ` David Laight -1 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2024-01-28 12:47 UTC (permalink / raw) To: 'Jisheng Zhang', Paul Walmsley, Palmer Dabbelt, Albert Ou Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot From: Jisheng Zhang > Sent: 28 January 2024 11:10 > > When the destination buffer is before the source one, or when the > buffers doesn't overlap, it's safe to use memcpy() instead, which is > optimized to use a bigger data size possible. > ... > + * Simply check if the buffer overlaps an call memcpy() in case, > + * otherwise do a simple one byte at time backward copy. I'd at least do a 64bit copy loop if the addresses are aligned. Thinks a bit more.... Put the copy 64 bytes code (the body of the memcpy() loop) into it an inline function and call it with increasing addresses in memcpy() are decrementing addresses in memmove. So memcpy() contains: src_lim = src_lim + count; ... alignment copy for (; src + 64 <= src_lim; src += 64; dest += 64) copy_64_bytes(dest, src); ... tail copy Then you can do something very similar for backwards copies. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 2/3] riscv: optimized memmove @ 2024-01-28 12:47 ` David Laight 0 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2024-01-28 12:47 UTC (permalink / raw) To: 'Jisheng Zhang', Paul Walmsley, Palmer Dabbelt, Albert Ou Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot From: Jisheng Zhang > Sent: 28 January 2024 11:10 > > When the destination buffer is before the source one, or when the > buffers doesn't overlap, it's safe to use memcpy() instead, which is > optimized to use a bigger data size possible. > ... > + * Simply check if the buffer overlaps an call memcpy() in case, > + * otherwise do a simple one byte at time backward copy. I'd at least do a 64bit copy loop if the addresses are aligned. Thinks a bit more.... Put the copy 64 bytes code (the body of the memcpy() loop) into it an inline function and call it with increasing addresses in memcpy() are decrementing addresses in memmove. So memcpy() contains: src_lim = src_lim + count; ... alignment copy for (; src + 64 <= src_lim; src += 64; dest += 64) copy_64_bytes(dest, src); ... tail copy Then you can do something very similar for backwards copies. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove 2024-01-28 12:47 ` David Laight @ 2024-01-30 11:30 ` Jisheng Zhang -1 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-30 11:30 UTC (permalink / raw) To: David Laight Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote: > From: Jisheng Zhang > > Sent: 28 January 2024 11:10 > > > > When the destination buffer is before the source one, or when the > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > optimized to use a bigger data size possible. > > > ... > > + * Simply check if the buffer overlaps an call memcpy() in case, > > + * otherwise do a simple one byte at time backward copy. > > I'd at least do a 64bit copy loop if the addresses are aligned. > > Thinks a bit more.... > > Put the copy 64 bytes code (the body of the memcpy() loop) > into it an inline function and call it with increasing addresses > in memcpy() are decrementing addresses in memmove. Hi David, Besides the 64 bytes copy, there's another optimization in __memcpy: word-by-word copy even if s and d are not aligned. So if we make the two optimizd copy as inline functions and call them in memmove(), we almost duplicate the __memcpy code, so I think directly calling __memcpy is a bit better. Thanks > > So memcpy() contains: > src_lim = src_lim + count; > ... alignment copy > for (; src + 64 <= src_lim; src += 64; dest += 64) > copy_64_bytes(dest, src); > ... tail copy > > Then you can do something very similar for backwards copies. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove @ 2024-01-30 11:30 ` Jisheng Zhang 0 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-30 11:30 UTC (permalink / raw) To: David Laight Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote: > From: Jisheng Zhang > > Sent: 28 January 2024 11:10 > > > > When the destination buffer is before the source one, or when the > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > optimized to use a bigger data size possible. > > > ... > > + * Simply check if the buffer overlaps an call memcpy() in case, > > + * otherwise do a simple one byte at time backward copy. > > I'd at least do a 64bit copy loop if the addresses are aligned. > > Thinks a bit more.... > > Put the copy 64 bytes code (the body of the memcpy() loop) > into it an inline function and call it with increasing addresses > in memcpy() are decrementing addresses in memmove. Hi David, Besides the 64 bytes copy, there's another optimization in __memcpy: word-by-word copy even if s and d are not aligned. So if we make the two optimizd copy as inline functions and call them in memmove(), we almost duplicate the __memcpy code, so I think directly calling __memcpy is a bit better. Thanks > > So memcpy() contains: > src_lim = src_lim + count; > ... alignment copy > for (; src + 64 <= src_lim; src += 64; dest += 64) > copy_64_bytes(dest, src); > ... tail copy > > Then you can do something very similar for backwards copies. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 2/3] riscv: optimized memmove 2024-01-30 11:30 ` Jisheng Zhang @ 2024-01-30 11:51 ` David Laight -1 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2024-01-30 11:51 UTC (permalink / raw) To: 'Jisheng Zhang' Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot From: Jisheng Zhang > Sent: 30 January 2024 11:31 > > On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote: > > From: Jisheng Zhang > > > Sent: 28 January 2024 11:10 > > > > > > When the destination buffer is before the source one, or when the > > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > > optimized to use a bigger data size possible. > > > > > ... > > > + * Simply check if the buffer overlaps an call memcpy() in case, > > > + * otherwise do a simple one byte at time backward copy. > > > > I'd at least do a 64bit copy loop if the addresses are aligned. > > > > Thinks a bit more.... > > > > Put the copy 64 bytes code (the body of the memcpy() loop) > > into it an inline function and call it with increasing addresses > > in memcpy() are decrementing addresses in memmove. > > Hi David, > > Besides the 64 bytes copy, there's another optimization in __memcpy: > word-by-word copy even if s and d are not aligned. > So if we make the two optimizd copy as inline functions and call them > in memmove(), we almost duplicate the __memcpy code, so I think > directly calling __memcpy is a bit better. If a forwards copy is valid call memcpy() - which I think you do. If not you can still use the same 'copy 8 register' code that memcpy() uses - just with a decrementing block address. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* RE: [PATCH 2/3] riscv: optimized memmove @ 2024-01-30 11:51 ` David Laight 0 siblings, 0 replies; 64+ messages in thread From: David Laight @ 2024-01-30 11:51 UTC (permalink / raw) To: 'Jisheng Zhang' Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot From: Jisheng Zhang > Sent: 30 January 2024 11:31 > > On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote: > > From: Jisheng Zhang > > > Sent: 28 January 2024 11:10 > > > > > > When the destination buffer is before the source one, or when the > > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > > optimized to use a bigger data size possible. > > > > > ... > > > + * Simply check if the buffer overlaps an call memcpy() in case, > > > + * otherwise do a simple one byte at time backward copy. > > > > I'd at least do a 64bit copy loop if the addresses are aligned. > > > > Thinks a bit more.... > > > > Put the copy 64 bytes code (the body of the memcpy() loop) > > into it an inline function and call it with increasing addresses > > in memcpy() are decrementing addresses in memmove. > > Hi David, > > Besides the 64 bytes copy, there's another optimization in __memcpy: > word-by-word copy even if s and d are not aligned. > So if we make the two optimizd copy as inline functions and call them > in memmove(), we almost duplicate the __memcpy code, so I think > directly calling __memcpy is a bit better. If a forwards copy is valid call memcpy() - which I think you do. If not you can still use the same 'copy 8 register' code that memcpy() uses - just with a decrementing block address. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove 2024-01-28 11:10 ` Jisheng Zhang @ 2024-01-30 11:39 ` Nick Kossifidis -1 siblings, 0 replies; 64+ messages in thread From: Nick Kossifidis @ 2024-01-30 11:39 UTC (permalink / raw) To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot On 1/28/24 13:10, Jisheng Zhang wrote: > From: Matteo Croce <mcroce@microsoft.com> > > When the destination buffer is before the source one, or when the > buffers doesn't overlap, it's safe to use memcpy() instead, which is > optimized to use a bigger data size possible. > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > Reported-by: kernel test robot <lkp@intel.com> > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> I'd expect to have memmove handle both fw/bw copying and then memcpy being an alias to memmove, to also take care when regions overlap and avoid undefined behavior. > --- a/arch/riscv/lib/string.c > +++ b/arch/riscv/lib/string.c > @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy) > EXPORT_SYMBOL(memcpy); > void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > + > +/* > + * Simply check if the buffer overlaps an call memcpy() in case, > + * otherwise do a simple one byte at time backward copy. > + */ > +void *__memmove(void *dest, const void *src, size_t count) > +{ > + if (dest < src || src + count <= dest) > + return __memcpy(dest, src, count); > + > + if (dest > src) { > + const char *s = src + count; > + char *tmp = dest + count; > + > + while (count--) > + *--tmp = *--s; > + } > + return dest; > +} > +EXPORT_SYMBOL(__memmove); > + Here is an approach for the backwards case to get things started... static void copy_bw(void *dst_ptr, const void *src_ptr, size_t len) { union const_data src = { .as_bytes = src_ptr + len }; union data dst = { .as_bytes = dst_ptr + len }; size_t remaining = len; size_t src_offt = 0; if (len < 2 * WORD_SIZE) goto trailing_bw; for(; dst.as_uptr & WORD_MASK; remaining--) *--dst.as_bytes = *--src.as_bytes; src_offt = src.as_uptr & WORD_MASK; if (!src_offt) { for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) *--dst.as_ulong = *--src.as_ulong; } else { unsigned long cur, prev; src.as_bytes -= src_offt; for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) { cur = *src.as_ulong; prev = *--src.as_ulong; *--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) | prev >> (src_offt * 8); } src.as_bytes += src_offt; } trailing_bw: while (remaining-- > 0) *--dst.as_bytes = *--src.as_bytes; } Regards, Nick _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove @ 2024-01-30 11:39 ` Nick Kossifidis 0 siblings, 0 replies; 64+ messages in thread From: Nick Kossifidis @ 2024-01-30 11:39 UTC (permalink / raw) To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot On 1/28/24 13:10, Jisheng Zhang wrote: > From: Matteo Croce <mcroce@microsoft.com> > > When the destination buffer is before the source one, or when the > buffers doesn't overlap, it's safe to use memcpy() instead, which is > optimized to use a bigger data size possible. > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > Reported-by: kernel test robot <lkp@intel.com> > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> I'd expect to have memmove handle both fw/bw copying and then memcpy being an alias to memmove, to also take care when regions overlap and avoid undefined behavior. > --- a/arch/riscv/lib/string.c > +++ b/arch/riscv/lib/string.c > @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy) > EXPORT_SYMBOL(memcpy); > void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > + > +/* > + * Simply check if the buffer overlaps an call memcpy() in case, > + * otherwise do a simple one byte at time backward copy. > + */ > +void *__memmove(void *dest, const void *src, size_t count) > +{ > + if (dest < src || src + count <= dest) > + return __memcpy(dest, src, count); > + > + if (dest > src) { > + const char *s = src + count; > + char *tmp = dest + count; > + > + while (count--) > + *--tmp = *--s; > + } > + return dest; > +} > +EXPORT_SYMBOL(__memmove); > + Here is an approach for the backwards case to get things started... static void copy_bw(void *dst_ptr, const void *src_ptr, size_t len) { union const_data src = { .as_bytes = src_ptr + len }; union data dst = { .as_bytes = dst_ptr + len }; size_t remaining = len; size_t src_offt = 0; if (len < 2 * WORD_SIZE) goto trailing_bw; for(; dst.as_uptr & WORD_MASK; remaining--) *--dst.as_bytes = *--src.as_bytes; src_offt = src.as_uptr & WORD_MASK; if (!src_offt) { for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) *--dst.as_ulong = *--src.as_ulong; } else { unsigned long cur, prev; src.as_bytes -= src_offt; for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) { cur = *src.as_ulong; prev = *--src.as_ulong; *--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) | prev >> (src_offt * 8); } src.as_bytes += src_offt; } trailing_bw: while (remaining-- > 0) *--dst.as_bytes = *--src.as_bytes; } Regards, Nick ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove 2024-01-30 11:39 ` Nick Kossifidis @ 2024-01-30 13:12 ` Jisheng Zhang -1 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-30 13:12 UTC (permalink / raw) To: Nick Kossifidis Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote: > On 1/28/24 13:10, Jisheng Zhang wrote: > > From: Matteo Croce <mcroce@microsoft.com> > > > > When the destination buffer is before the source one, or when the > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > optimized to use a bigger data size possible. > > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > > Reported-by: kernel test robot <lkp@intel.com> > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > > I'd expect to have memmove handle both fw/bw copying and then memcpy being > an alias to memmove, to also take care when regions overlap and avoid > undefined behavior. Hi Nick, Here is somthing from man memcpy: "void *memcpy(void dest[restrict .n], const void src[restrict .n], size_t n); The memcpy() function copies n bytes from memory area src to memory area dest. The memory areas must not overlap. Use memmove(3) if the memory areas do over‐ lap." IMHO, the "restrict" implies that there's no overlap. If overlap happens, the manual doesn't say what will happen. From another side, I have a concern: currently, other arch don't have this alias behavior, IIUC(at least, per my understanding of arm and arm64 memcpy implementations)they just copy forward. I want to keep similar behavior for riscv. So I want to hear more before going towards alias-memcpy-to-memmove direction. Thanks > > > > --- a/arch/riscv/lib/string.c > > +++ b/arch/riscv/lib/string.c > > @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy) > > EXPORT_SYMBOL(memcpy); > > void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > > void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > > + > > +/* > > + * Simply check if the buffer overlaps an call memcpy() in case, > > + * otherwise do a simple one byte at time backward copy. > > + */ > > +void *__memmove(void *dest, const void *src, size_t count) > > +{ > > + if (dest < src || src + count <= dest) > > + return __memcpy(dest, src, count); > > + > > + if (dest > src) { > > + const char *s = src + count; > > + char *tmp = dest + count; > > + > > + while (count--) > > + *--tmp = *--s; > > + } > > + return dest; > > +} > > +EXPORT_SYMBOL(__memmove); > > + > > Here is an approach for the backwards case to get things started... > > static void > copy_bw(void *dst_ptr, const void *src_ptr, size_t len) > { > union const_data src = { .as_bytes = src_ptr + len }; > union data dst = { .as_bytes = dst_ptr + len }; > size_t remaining = len; > size_t src_offt = 0; > > if (len < 2 * WORD_SIZE) > goto trailing_bw; > > for(; dst.as_uptr & WORD_MASK; remaining--) > *--dst.as_bytes = *--src.as_bytes; > > src_offt = src.as_uptr & WORD_MASK; > if (!src_offt) { > for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) > *--dst.as_ulong = *--src.as_ulong; > } else { > unsigned long cur, prev; > src.as_bytes -= src_offt; > for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) { > cur = *src.as_ulong; > prev = *--src.as_ulong; > *--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) | > prev >> (src_offt * 8); > } > src.as_bytes += src_offt; > } > > trailing_bw: > while (remaining-- > 0) > *--dst.as_bytes = *--src.as_bytes; > } > > Regards, > Nick ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove @ 2024-01-30 13:12 ` Jisheng Zhang 0 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-30 13:12 UTC (permalink / raw) To: Nick Kossifidis Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote: > On 1/28/24 13:10, Jisheng Zhang wrote: > > From: Matteo Croce <mcroce@microsoft.com> > > > > When the destination buffer is before the source one, or when the > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > optimized to use a bigger data size possible. > > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > > Reported-by: kernel test robot <lkp@intel.com> > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > > I'd expect to have memmove handle both fw/bw copying and then memcpy being > an alias to memmove, to also take care when regions overlap and avoid > undefined behavior. Hi Nick, Here is somthing from man memcpy: "void *memcpy(void dest[restrict .n], const void src[restrict .n], size_t n); The memcpy() function copies n bytes from memory area src to memory area dest. The memory areas must not overlap. Use memmove(3) if the memory areas do over‐ lap." IMHO, the "restrict" implies that there's no overlap. If overlap happens, the manual doesn't say what will happen. From another side, I have a concern: currently, other arch don't have this alias behavior, IIUC(at least, per my understanding of arm and arm64 memcpy implementations)they just copy forward. I want to keep similar behavior for riscv. So I want to hear more before going towards alias-memcpy-to-memmove direction. Thanks > > > > --- a/arch/riscv/lib/string.c > > +++ b/arch/riscv/lib/string.c > > @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy) > > EXPORT_SYMBOL(memcpy); > > void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > > void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); > > + > > +/* > > + * Simply check if the buffer overlaps an call memcpy() in case, > > + * otherwise do a simple one byte at time backward copy. > > + */ > > +void *__memmove(void *dest, const void *src, size_t count) > > +{ > > + if (dest < src || src + count <= dest) > > + return __memcpy(dest, src, count); > > + > > + if (dest > src) { > > + const char *s = src + count; > > + char *tmp = dest + count; > > + > > + while (count--) > > + *--tmp = *--s; > > + } > > + return dest; > > +} > > +EXPORT_SYMBOL(__memmove); > > + > > Here is an approach for the backwards case to get things started... > > static void > copy_bw(void *dst_ptr, const void *src_ptr, size_t len) > { > union const_data src = { .as_bytes = src_ptr + len }; > union data dst = { .as_bytes = dst_ptr + len }; > size_t remaining = len; > size_t src_offt = 0; > > if (len < 2 * WORD_SIZE) > goto trailing_bw; > > for(; dst.as_uptr & WORD_MASK; remaining--) > *--dst.as_bytes = *--src.as_bytes; > > src_offt = src.as_uptr & WORD_MASK; > if (!src_offt) { > for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) > *--dst.as_ulong = *--src.as_ulong; > } else { > unsigned long cur, prev; > src.as_bytes -= src_offt; > for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) { > cur = *src.as_ulong; > prev = *--src.as_ulong; > *--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) | > prev >> (src_offt * 8); > } > src.as_bytes += src_offt; > } > > trailing_bw: > while (remaining-- > 0) > *--dst.as_bytes = *--src.as_bytes; > } > > Regards, > Nick _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove 2024-01-30 13:12 ` Jisheng Zhang @ 2024-01-30 16:52 ` Nick Kossifidis -1 siblings, 0 replies; 64+ messages in thread From: Nick Kossifidis @ 2024-01-30 16:52 UTC (permalink / raw) To: Jisheng Zhang Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On 1/30/24 15:12, Jisheng Zhang wrote: > On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote: >> On 1/28/24 13:10, Jisheng Zhang wrote: >>> From: Matteo Croce <mcroce@microsoft.com> >>> >>> When the destination buffer is before the source one, or when the >>> buffers doesn't overlap, it's safe to use memcpy() instead, which is >>> optimized to use a bigger data size possible. >>> >>> Signed-off-by: Matteo Croce <mcroce@microsoft.com> >>> Reported-by: kernel test robot <lkp@intel.com> >>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> >> >> I'd expect to have memmove handle both fw/bw copying and then memcpy being >> an alias to memmove, to also take care when regions overlap and avoid >> undefined behavior. > > Hi Nick, > > Here is somthing from man memcpy: > > "void *memcpy(void dest[restrict .n], const void src[restrict .n], > size_t n); > > The memcpy() function copies n bytes from memory area src to memory area dest. > The memory areas must not overlap. Use memmove(3) if the memory areas do over‐ > lap." > > IMHO, the "restrict" implies that there's no overlap. If overlap > happens, the manual doesn't say what will happen. > > From another side, I have a concern: currently, other arch don't have > this alias behavior, IIUC(at least, per my understanding of arm and arm64 > memcpy implementations)they just copy forward. I want to keep similar behavior > for riscv. > > So I want to hear more before going towards alias-memcpy-to-memmove direction. > > Thanks If you read Matteo's original post that was also his suggestion, and Linus has also commented on that. In general it's better to handle the case where the regions provided to memcpy() overlap than to resort to "undefined behavior", I provided a backwards copy example that you can use so that we can have both fw and bw copying for memmove(), and use memmove() in any case. The [restrict .n] in the prototype is just there to say that the size of src is restricted by n (the next argument). If someone uses memcpy() with overlapping regions, which is always a possibility, in your case it'll result corrupted data, we won't even get a warning (still counts as undefined behavior) about it. Regards, Nick ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove @ 2024-01-30 16:52 ` Nick Kossifidis 0 siblings, 0 replies; 64+ messages in thread From: Nick Kossifidis @ 2024-01-30 16:52 UTC (permalink / raw) To: Jisheng Zhang Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On 1/30/24 15:12, Jisheng Zhang wrote: > On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote: >> On 1/28/24 13:10, Jisheng Zhang wrote: >>> From: Matteo Croce <mcroce@microsoft.com> >>> >>> When the destination buffer is before the source one, or when the >>> buffers doesn't overlap, it's safe to use memcpy() instead, which is >>> optimized to use a bigger data size possible. >>> >>> Signed-off-by: Matteo Croce <mcroce@microsoft.com> >>> Reported-by: kernel test robot <lkp@intel.com> >>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> >> >> I'd expect to have memmove handle both fw/bw copying and then memcpy being >> an alias to memmove, to also take care when regions overlap and avoid >> undefined behavior. > > Hi Nick, > > Here is somthing from man memcpy: > > "void *memcpy(void dest[restrict .n], const void src[restrict .n], > size_t n); > > The memcpy() function copies n bytes from memory area src to memory area dest. > The memory areas must not overlap. Use memmove(3) if the memory areas do over‐ > lap." > > IMHO, the "restrict" implies that there's no overlap. If overlap > happens, the manual doesn't say what will happen. > > From another side, I have a concern: currently, other arch don't have > this alias behavior, IIUC(at least, per my understanding of arm and arm64 > memcpy implementations)they just copy forward. I want to keep similar behavior > for riscv. > > So I want to hear more before going towards alias-memcpy-to-memmove direction. > > Thanks If you read Matteo's original post that was also his suggestion, and Linus has also commented on that. In general it's better to handle the case where the regions provided to memcpy() overlap than to resort to "undefined behavior", I provided a backwards copy example that you can use so that we can have both fw and bw copying for memmove(), and use memmove() in any case. The [restrict .n] in the prototype is just there to say that the size of src is restricted by n (the next argument). If someone uses memcpy() with overlapping regions, which is always a possibility, in your case it'll result corrupted data, we won't even get a warning (still counts as undefined behavior) about it. Regards, Nick _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove 2024-01-30 16:52 ` Nick Kossifidis @ 2024-01-31 5:25 ` Jisheng Zhang -1 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-31 5:25 UTC (permalink / raw) To: Nick Kossifidis Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On Tue, Jan 30, 2024 at 06:52:24PM +0200, Nick Kossifidis wrote: > On 1/30/24 15:12, Jisheng Zhang wrote: > > On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote: > > > On 1/28/24 13:10, Jisheng Zhang wrote: > > > > From: Matteo Croce <mcroce@microsoft.com> > > > > > > > > When the destination buffer is before the source one, or when the > > > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > > > optimized to use a bigger data size possible. > > > > > > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > > > > Reported-by: kernel test robot <lkp@intel.com> > > > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > > > > > > I'd expect to have memmove handle both fw/bw copying and then memcpy being > > > an alias to memmove, to also take care when regions overlap and avoid > > > undefined behavior. > > > > Hi Nick, > > > > Here is somthing from man memcpy: > > > > "void *memcpy(void dest[restrict .n], const void src[restrict .n], > > size_t n); > > > > The memcpy() function copies n bytes from memory area src to memory area dest. > > The memory areas must not overlap. Use memmove(3) if the memory areas do over‐ > > lap." > > > > IMHO, the "restrict" implies that there's no overlap. If overlap > > happens, the manual doesn't say what will happen. > > > > From another side, I have a concern: currently, other arch don't have > > this alias behavior, IIUC(at least, per my understanding of arm and arm64 > > memcpy implementations)they just copy forward. I want to keep similar behavior > > for riscv. > > > > So I want to hear more before going towards alias-memcpy-to-memmove direction. > > > > Thanks > Hi Nick, > If you read Matteo's original post that was also his suggestion, and Linus I did read all discussions in Matteo's v1 ~ v5 before this renew. Per my understanding, Matteo also concerned no such memcpy-alias-memmove behavior in other arch's implementations. > has also commented on that. In general it's better to handle the case where Linus commented on https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132 about glibc alias memcpy to memove rather than the patch series. > the regions provided to memcpy() overlap than to resort to "undefined > behavior", I provided a backwards copy example that you can use so that we > can have both fw and bw copying for memmove(), and use memmove() in any > case. The [restrict .n] in the prototype is just there to say that the size > of src is restricted by n (the next argument). If someone uses memcpy() with I didn't have c99 spec in hand, but I found gcc explanations about restrict keyword from [1]: "the restrict declaration promises that the code will not access that object in any other way--only through p." So if there's overlap in memcpy, then it contradicts the restrict implication. [1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html And from the manual, if the memcpy users must ensure "The memory areas must not overlap." So I think all linux kernel's memcpy implementations(only copy fw and don't take overlap into consideration) are right. I did see the alias-memcpy-as-memmove in some libc implementations, but this is not the style in current kernel's implementations. Given current riscv asm implementation also doesn't do the alias and copy-fw only, and this series improves performance and doesn't introduce the Is it better to divide this into two steps: Firstly, merge this series if there's no obvious bug; secondly, do the alias as you suggested, since you have a basic implementation, you could even submit your patch ;) What do you think about this two steps solution? Thanks > overlapping regions, which is always a possibility, in your case it'll > result corrupted data, we won't even get a warning (still counts as > undefined behavior) about it. > > Regards, > Nick > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove @ 2024-01-31 5:25 ` Jisheng Zhang 0 siblings, 0 replies; 64+ messages in thread From: Jisheng Zhang @ 2024-01-31 5:25 UTC (permalink / raw) To: Nick Kossifidis Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On Tue, Jan 30, 2024 at 06:52:24PM +0200, Nick Kossifidis wrote: > On 1/30/24 15:12, Jisheng Zhang wrote: > > On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote: > > > On 1/28/24 13:10, Jisheng Zhang wrote: > > > > From: Matteo Croce <mcroce@microsoft.com> > > > > > > > > When the destination buffer is before the source one, or when the > > > > buffers doesn't overlap, it's safe to use memcpy() instead, which is > > > > optimized to use a bigger data size possible. > > > > > > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > > > > Reported-by: kernel test robot <lkp@intel.com> > > > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org> > > > > > > I'd expect to have memmove handle both fw/bw copying and then memcpy being > > > an alias to memmove, to also take care when regions overlap and avoid > > > undefined behavior. > > > > Hi Nick, > > > > Here is somthing from man memcpy: > > > > "void *memcpy(void dest[restrict .n], const void src[restrict .n], > > size_t n); > > > > The memcpy() function copies n bytes from memory area src to memory area dest. > > The memory areas must not overlap. Use memmove(3) if the memory areas do over‐ > > lap." > > > > IMHO, the "restrict" implies that there's no overlap. If overlap > > happens, the manual doesn't say what will happen. > > > > From another side, I have a concern: currently, other arch don't have > > this alias behavior, IIUC(at least, per my understanding of arm and arm64 > > memcpy implementations)they just copy forward. I want to keep similar behavior > > for riscv. > > > > So I want to hear more before going towards alias-memcpy-to-memmove direction. > > > > Thanks > Hi Nick, > If you read Matteo's original post that was also his suggestion, and Linus I did read all discussions in Matteo's v1 ~ v5 before this renew. Per my understanding, Matteo also concerned no such memcpy-alias-memmove behavior in other arch's implementations. > has also commented on that. In general it's better to handle the case where Linus commented on https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132 about glibc alias memcpy to memove rather than the patch series. > the regions provided to memcpy() overlap than to resort to "undefined > behavior", I provided a backwards copy example that you can use so that we > can have both fw and bw copying for memmove(), and use memmove() in any > case. The [restrict .n] in the prototype is just there to say that the size > of src is restricted by n (the next argument). If someone uses memcpy() with I didn't have c99 spec in hand, but I found gcc explanations about restrict keyword from [1]: "the restrict declaration promises that the code will not access that object in any other way--only through p." So if there's overlap in memcpy, then it contradicts the restrict implication. [1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html And from the manual, if the memcpy users must ensure "The memory areas must not overlap." So I think all linux kernel's memcpy implementations(only copy fw and don't take overlap into consideration) are right. I did see the alias-memcpy-as-memmove in some libc implementations, but this is not the style in current kernel's implementations. Given current riscv asm implementation also doesn't do the alias and copy-fw only, and this series improves performance and doesn't introduce the Is it better to divide this into two steps: Firstly, merge this series if there's no obvious bug; secondly, do the alias as you suggested, since you have a basic implementation, you could even submit your patch ;) What do you think about this two steps solution? Thanks > overlapping regions, which is always a possibility, in your case it'll > result corrupted data, we won't even get a warning (still counts as > undefined behavior) about it. > > Regards, > Nick > _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove 2024-01-31 5:25 ` Jisheng Zhang @ 2024-01-31 9:13 ` Nick Kossifidis -1 siblings, 0 replies; 64+ messages in thread From: Nick Kossifidis @ 2024-01-31 9:13 UTC (permalink / raw) To: Jisheng Zhang Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On 1/31/24 07:25, Jisheng Zhang wrote: > > I didn't have c99 spec in hand, but I found gcc explanations about > restrict keyword from [1]: > > "the restrict declaration promises that the code will not access that > object in any other way--only through p." > > So if there's overlap in memcpy, then it contradicts the restrict > implication. > > [1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html > The union used in the code also contradicts this. BTW the restrict qualifier isn't used in kernel's lib/string.c nor in the current implementation (https://elixir.bootlin.com/linux/latest/source/arch/riscv/include/asm/string.h#L16). > And from the manual, if the memcpy users must ensure "The memory areas > must not overlap." So I think all linux kernel's memcpy implementations(only copy > fw and don't take overlap into consideration) are right. > > I did see the alias-memcpy-as-memmove in some libc implementations, but > this is not the style in current kernel's implementations. > > Given current riscv asm implementation also doesn't do the alias and > copy-fw only, and this series improves performance and doesn't introduce the > Is it better to divide this into two steps: Firstly, merge this series > if there's no obvious bug; secondly, do the alias as you suggested, > since you have a basic implementation, you could even submit your patch > ;) What do you think about this two steps solution? > I still don't understand why you prefer undefined behavior over just aliasing memcpy to memmove. Anyway, do as you wish, I don't have time to work on this unfortunately. Feel free to use the code I shared for bw copy etc. Regards, Nick ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 2/3] riscv: optimized memmove @ 2024-01-31 9:13 ` Nick Kossifidis 0 siblings, 0 replies; 64+ messages in thread From: Nick Kossifidis @ 2024-01-31 9:13 UTC (permalink / raw) To: Jisheng Zhang Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel, Matteo Croce, kernel test robot On 1/31/24 07:25, Jisheng Zhang wrote: > > I didn't have c99 spec in hand, but I found gcc explanations about > restrict keyword from [1]: > > "the restrict declaration promises that the code will not access that > object in any other way--only through p." > > So if there's overlap in memcpy, then it contradicts the restrict > implication. > > [1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html > The union used in the code also contradicts this. BTW the restrict qualifier isn't used in kernel's lib/string.c nor in the current implementation (https://elixir.bootlin.com/linux/latest/source/arch/riscv/include/asm/string.h#L16). > And from the manual, if the memcpy users must ensure "The memory areas > must not overlap." So I think all linux kernel's memcpy implementations(only copy > fw and don't take overlap into consideration) are right. > > I did see the alias-memcpy-as-memmove in some libc implementations, but > this is not the style in current kernel's implementations. > > Given current riscv asm implementation also doesn't do the alias and > copy-fw only, and this series improves performance and doesn't introduce the > Is it better to divide this into two steps: Firstly, merge this series > if there's no obvious bug; secondly, do the alias as you suggested, > since you have a basic implementation, you could even submit your patch > ;) What do you think about this two steps solution? > I still don't understand why you prefer undefined behavior over just aliasing memcpy to memmove. Anyway, do as you wish, I don't have time to work on this unfortunately. Feel free to use the code I shared for bw copy etc. Regards, Nick _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2024-01-31 11:31 UTC | newest] Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-15 2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce 2021-06-15 2:38 ` Matteo Croce 2021-06-15 2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce 2021-06-15 2:38 ` Matteo Croce 2021-06-15 8:57 ` David Laight 2021-06-15 8:57 ` David Laight 2021-06-15 13:08 ` Bin Meng 2021-06-15 13:08 ` Bin Meng 2021-06-15 13:18 ` David Laight 2021-06-15 13:18 ` David Laight 2021-06-15 13:28 ` Bin Meng 2021-06-15 13:28 ` Bin Meng 2021-06-15 16:12 ` Emil Renner Berthing 2021-06-15 16:12 ` Emil Renner Berthing 2021-06-16 0:33 ` Bin Meng 2021-06-16 0:33 ` Bin Meng 2021-06-16 2:01 ` Matteo Croce 2021-06-16 2:01 ` Matteo Croce 2021-06-16 8:24 ` David Laight 2021-06-16 8:24 ` David Laight 2021-06-16 10:48 ` Akira Tsukamoto 2021-06-16 10:48 ` Akira Tsukamoto 2021-06-16 19:06 ` Matteo Croce 2021-06-16 19:06 ` Matteo Croce 2021-06-15 13:44 ` Matteo Croce 2021-06-15 13:44 ` Matteo Croce 2021-06-16 11:46 ` Guo Ren 2021-06-16 11:46 ` Guo Ren 2021-06-16 18:52 ` Matteo Croce 2021-06-16 18:52 ` Matteo Croce 2021-06-17 21:30 ` David Laight 2021-06-17 21:30 ` David Laight 2021-06-17 21:48 ` Matteo Croce 2021-06-17 21:48 ` Matteo Croce 2021-06-18 0:32 ` Matteo Croce 2021-06-18 0:32 ` Matteo Croce 2021-06-18 1:05 ` Matteo Croce 2021-06-18 1:05 ` Matteo Croce 2021-06-18 8:32 ` David Laight 2021-06-18 8:32 ` David Laight 2021-06-15 2:38 ` [PATCH 2/3] riscv: optimized memmove Matteo Croce 2021-06-15 2:38 ` Matteo Croce 2021-06-15 2:38 ` [PATCH 3/3] riscv: optimized memset Matteo Croce 2021-06-15 2:38 ` Matteo Croce 2021-06-15 2:43 ` [PATCH 0/3] riscv: optimized mem* functions Bin Meng 2021-06-15 2:43 ` Bin Meng 2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang 2024-01-28 11:10 ` [PATCH 2/3] riscv: optimized memmove Jisheng Zhang 2024-01-28 11:10 ` Jisheng Zhang 2024-01-28 12:47 ` David Laight 2024-01-28 12:47 ` David Laight 2024-01-30 11:30 ` Jisheng Zhang 2024-01-30 11:30 ` Jisheng Zhang 2024-01-30 11:51 ` David Laight 2024-01-30 11:51 ` David Laight 2024-01-30 11:39 ` Nick Kossifidis 2024-01-30 11:39 ` Nick Kossifidis 2024-01-30 13:12 ` Jisheng Zhang 2024-01-30 13:12 ` Jisheng Zhang 2024-01-30 16:52 ` Nick Kossifidis 2024-01-30 16:52 ` Nick Kossifidis 2024-01-31 5:25 ` Jisheng Zhang 2024-01-31 5:25 ` Jisheng Zhang 2024-01-31 9:13 ` Nick Kossifidis 2024-01-31 9:13 ` Nick Kossifidis
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.