linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] riscv: optimized mem* functions
@ 2021-06-15  2:38 Matteo Croce
  2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

Replace the assembly mem{cpy,move,set} with C equivalent.

Try to access RAM with the largest bit width possible, but without
doing unaligned accesses.

Tested on a BeagleV Starlight with a SiFive U74 core, where the
improvement is noticeable.

Matteo Croce (3):
  riscv: optimized memcpy
  riscv: optimized memmove
  riscv: optimized memset

 arch/riscv/include/asm/string.h |  18 ++--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  17 ----
 arch/riscv/lib/Makefile         |   4 +-
 arch/riscv/lib/memcpy.S         | 108 ---------------------
 arch/riscv/lib/memmove.S        |  64 -------------
 arch/riscv/lib/memset.S         | 113 ----------------------
 arch/riscv/lib/string.c         | 162 ++++++++++++++++++++++++++++++++
 8 files changed, 172 insertions(+), 315 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memcpy.S
 delete mode 100644 arch/riscv/lib/memmove.S
 delete mode 100644 arch/riscv/lib/memset.S
 create mode 100644 arch/riscv/lib/string.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce
@ 2021-06-15  2:38 ` Matteo Croce
  2021-06-15  8:57   ` David Laight
  2021-06-16 11:46   ` Guo Ren
  2021-06-15  2:38 ` [PATCH 2/3] riscv: optimized memmove Matteo Croce
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |   8 ++-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   2 +-
 arch/riscv/lib/memcpy.S         | 108 --------------------------------
 arch/riscv/lib/string.c         |  94 +++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 113 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 909049366555..6b5d6fc3eab4 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,13 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+#endif
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 5ab1c7e1a6ed..3f6d512a5b97 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,8 +10,6 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 25d5c9664e57..2ffe85d4baee 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
+lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
 
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 51ab716253fa..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-ENTRY(__memcpy)
-WEAK(memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-END(__memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..525f9ee25a74
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* size below a classic byte at time copy is done */
+#define MIN_THRESHOLD 64
+
+/* convenience types to avoid cast between different pointer types */
+union types {
+	u8 *u8;
+	unsigned long *ulong;
+	uintptr_t uptr;
+};
+
+union const_types {
+	const u8 *u8;
+	unsigned long *ulong;
+};
+
+void *memcpy(void *dest, const void *src, size_t count)
+{
+	const int bytes_long = BITS_PER_LONG / 8;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	const int mask = bytes_long - 1;
+	const int distance = (src - dest) & mask;
+#endif
+	union const_types s = { .u8 = src };
+	union types d = { .u8 = dest };
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	if (count <= MIN_THRESHOLD)
+		goto copy_remainder;
+
+	/* copy a byte at time until destination is aligned */
+	for (; count && d.uptr & mask; count--)
+		*d.u8++ = *s.u8++;
+
+	if (distance) {
+		unsigned long last, next;
+
+		/* move s backward to the previous alignment boundary */
+		s.u8 -= distance;
+
+		/* 32/64 bit wide copy from s to d.
+		 * d is aligned now but s is not, so read s alignment wise,
+		 * and do proper shift to get the right value.
+		 * Works only on Little Endian machines.
+		 */
+		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
+			last = next;
+			next = s.ulong[1];
+
+			d.ulong[0] = last >> (distance * 8) |
+				     next << ((bytes_long - distance) * 8);
+
+			d.ulong++;
+			s.ulong++;
+		}
+
+		/* restore s with the original offset */
+		s.u8 += distance;
+	} else
+#endif
+	{
+		/* if the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= bytes_long; count -= bytes_long)
+			*d.ulong++ = *s.ulong++;
+	}
+
+	/* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
+	goto copy_remainder;
+
+copy_remainder:
+	while (count--)
+		*d.u8++ = *s.u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(memcpy);
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	return memcpy(dest, src, count);
+}
+EXPORT_SYMBOL(__memcpy);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] riscv: optimized memmove
  2021-06-15  2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce
  2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
@ 2021-06-15  2:38 ` Matteo Croce
  2021-06-15  2:38 ` [PATCH 3/3] riscv: optimized memset Matteo Croce
  2021-06-15  2:43 ` [PATCH 0/3] riscv: optimized mem* functions Bin Meng
  3 siblings, 0 replies; 26+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  6 ++--
 arch/riscv/kernel/riscv_ksyms.c |  2 --
 arch/riscv/lib/Makefile         |  1 -
 arch/riscv/lib/memmove.S        | 64 ---------------------------------
 arch/riscv/lib/string.c         | 26 ++++++++++++++
 5 files changed, 29 insertions(+), 70 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 6b5d6fc3eab4..25d9b9078569 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
+#define __HAVE_ARCH_MEMMOVE
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 #endif
 
-#define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
 /* For those files which don't want to check by kasan. */
 #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 3f6d512a5b97..361565c4db7e 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,6 +10,4 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 2ffe85d4baee..484f5ff7b508 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index 07d1d2152ba5..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 525f9ee25a74..bc006708f075 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -92,3 +92,29 @@ void *__memcpy(void *dest, const void *src, size_t count)
 	return memcpy(dest, src, count);
 }
 EXPORT_SYMBOL(__memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(memmove);
+
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	return memmove(dest, src, count);
+}
+EXPORT_SYMBOL(__memmove);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] riscv: optimized memset
  2021-06-15  2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce
  2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
  2021-06-15  2:38 ` [PATCH 2/3] riscv: optimized memmove Matteo Croce
@ 2021-06-15  2:38 ` Matteo Croce
  2021-06-15  2:43 ` [PATCH 0/3] riscv: optimized mem* functions Bin Meng
  3 siblings, 0 replies; 26+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  10 +--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  13 ----
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memset.S         | 113 --------------------------------
 arch/riscv/lib/string.c         |  42 ++++++++++++
 6 files changed, 45 insertions(+), 135 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memset.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 25d9b9078569..90500635035a 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -6,14 +6,10 @@
 #ifndef _ASM_RISCV_STRING_H
 #define _ASM_RISCV_STRING_H
 
-#include <linux/types.h>
-#include <linux/linkage.h>
-
-#define __HAVE_ARCH_MEMSET
-extern asmlinkage void *memset(void *, int, size_t);
-extern asmlinkage void *__memset(void *, int, size_t);
-
 #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t count);
+extern void *__memset(void *s, int c, size_t count);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index d3081e4d9600..e635ce1e5645 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -31,7 +31,6 @@ obj-y	+= syscall_table.o
 obj-y	+= sys_riscv.o
 obj-y	+= time.o
 obj-y	+= traps.o
-obj-y	+= riscv_ksyms.o
 obj-y	+= stacktrace.o
 obj-y	+= cacheinfo.o
 obj-y	+= patch.o
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
deleted file mode 100644
index 361565c4db7e..000000000000
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ /dev/null
@@ -1,13 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2017 Zihao Yu
- */
-
-#include <linux/export.h>
-#include <linux/uaccess.h>
-
-/*
- * Assembly functions that may be used (directly or indirectly) by modules
- */
-EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(__memset);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 484f5ff7b508..e33263cc622a 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memset.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
deleted file mode 100644
index 34c5360c6705..000000000000
--- a/arch/riscv/lib/memset.S
+++ /dev/null
@@ -1,113 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memset(void *, int, size_t) */
-ENTRY(__memset)
-WEAK(memset)
-	move t0, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented fill for small sizes */
-	sltiu a3, a2, 16
-	bnez a3, 4f
-
-	/*
-	 * Round to nearest XLEN-aligned address
-	 * greater than or equal to start address
-	 */
-	addi a3, t0, SZREG-1
-	andi a3, a3, ~(SZREG-1)
-	beq a3, t0, 2f  /* Skip if already aligned */
-	/* Handle initial misalignment */
-	sub a4, a3, t0
-1:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2: /* Duff's device with 32 XLEN stores per iteration */
-	/* Broadcast value into all bytes */
-	andi a1, a1, 0xff
-	slli a3, a1, 8
-	or a1, a3, a1
-	slli a3, a1, 16
-	or a1, a3, a1
-#ifdef CONFIG_64BIT
-	slli a3, a1, 32
-	or a1, a3, a1
-#endif
-
-	/* Calculate end address */
-	andi a4, a2, ~(SZREG-1)
-	add a3, t0, a4
-
-	andi a4, a4, 31*SZREG  /* Calculate remainder */
-	beqz a4, 3f            /* Shortcut if no remainder */
-	neg a4, a4
-	addi a4, a4, 32*SZREG  /* Calculate initial offset */
-
-	/* Adjust start address with offset */
-	sub t0, t0, a4
-
-	/* Jump into loop body */
-	/* Assumes 32-bit instruction lengths */
-	la a5, 3f
-#ifdef CONFIG_64BIT
-	srli a4, a4, 1
-#endif
-	add a5, a5, a4
-	jr a5
-3:
-	REG_S a1,        0(t0)
-	REG_S a1,    SZREG(t0)
-	REG_S a1,  2*SZREG(t0)
-	REG_S a1,  3*SZREG(t0)
-	REG_S a1,  4*SZREG(t0)
-	REG_S a1,  5*SZREG(t0)
-	REG_S a1,  6*SZREG(t0)
-	REG_S a1,  7*SZREG(t0)
-	REG_S a1,  8*SZREG(t0)
-	REG_S a1,  9*SZREG(t0)
-	REG_S a1, 10*SZREG(t0)
-	REG_S a1, 11*SZREG(t0)
-	REG_S a1, 12*SZREG(t0)
-	REG_S a1, 13*SZREG(t0)
-	REG_S a1, 14*SZREG(t0)
-	REG_S a1, 15*SZREG(t0)
-	REG_S a1, 16*SZREG(t0)
-	REG_S a1, 17*SZREG(t0)
-	REG_S a1, 18*SZREG(t0)
-	REG_S a1, 19*SZREG(t0)
-	REG_S a1, 20*SZREG(t0)
-	REG_S a1, 21*SZREG(t0)
-	REG_S a1, 22*SZREG(t0)
-	REG_S a1, 23*SZREG(t0)
-	REG_S a1, 24*SZREG(t0)
-	REG_S a1, 25*SZREG(t0)
-	REG_S a1, 26*SZREG(t0)
-	REG_S a1, 27*SZREG(t0)
-	REG_S a1, 28*SZREG(t0)
-	REG_S a1, 29*SZREG(t0)
-	REG_S a1, 30*SZREG(t0)
-	REG_S a1, 31*SZREG(t0)
-	addi t0, t0, 32*SZREG
-	bltu t0, a3, 3b
-	andi a2, a2, SZREG-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, t0, a2
-5:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 5b
-6:
-	ret
-END(__memset)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index bc006708f075..62869627e139 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -118,3 +118,45 @@ void *__memmove(void *dest, const void *src, size_t count)
 	return memmove(dest, src, count);
 }
 EXPORT_SYMBOL(__memmove);
+
+void *memset(void *s, int c, size_t count)
+{
+	union types dest = { .u8 = s };
+
+	if (count > MIN_THRESHOLD) {
+		const int bytes_long = BITS_PER_LONG / 8;
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+		cu =
+#if BITS_PER_LONG == 64
+			cu << 56 | cu << 48 | cu << 40 | cu << 32 |
+#endif
+			cu << 24 | cu << 16 | cu << 8 | cu;
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+		/* Fill the buffer one byte at time until the destination
+		 * is aligned on a 32/64 bit boundary.
+		 */
+		for (; count && dest.uptr % bytes_long; count--)
+			*dest.u8++ = c;
+#endif
+
+		/* Copy using the largest size allowed */
+		for (; count >= bytes_long; count -= bytes_long)
+			*dest.ulong++ = cu;
+	}
+
+	/* copy the remainder */
+	while (count--)
+		*dest.u8++ = c;
+
+	return s;
+}
+EXPORT_SYMBOL(memset);
+
+void *__memset(void *s, int c, size_t count)
+{
+	return memset(s, c, count);
+}
+EXPORT_SYMBOL(__memset);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] riscv: optimized mem* functions
  2021-06-15  2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce
                   ` (2 preceding siblings ...)
  2021-06-15  2:38 ` [PATCH 3/3] riscv: optimized memset Matteo Croce
@ 2021-06-15  2:43 ` Bin Meng
  3 siblings, 0 replies; 26+ messages in thread
From: Bin Meng @ 2021-06-15  2:43 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

Hi Matteo,

On Tue, Jun 15, 2021 at 10:39 AM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Replace the assembly mem{cpy,move,set} with C equivalent.
>
> Try to access RAM with the largest bit width possible, but without
> doing unaligned accesses.
>
> Tested on a BeagleV Starlight with a SiFive U74 core, where the
> improvement is noticeable.
>

There is already a patch on the ML for optimizing the assembly version.
https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/

Would you please try that and compare the results?

Regards,
Bin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
@ 2021-06-15  8:57   ` David Laight
  2021-06-15 13:08     ` Bin Meng
  2021-06-16 11:46   ` Guo Ren
  1 sibling, 1 reply; 26+ messages in thread
From: David Laight @ 2021-06-15  8:57 UTC (permalink / raw)
  To: 'Matteo Croce', linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 15 June 2021 03:38
> 
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.

I'm surprised that the C loop:

> +		for (; count >= bytes_long; count -= bytes_long)
> +			*d.ulong++ = *s.ulong++;

ends up being faster than the ASM 'read lots' - 'write lots' loop.

Especially since there was an earlier patch to convert
copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
instead of a tight single register copy loop.

I'd also guess that the performance needs to be measured on
different classes of riscv cpu.

A simple cpu will behave differently to one that can execute
multiple instructions per clock.
Any form of 'out of order' execution also changes things.
The other big change is whether the cpu can to a memory
read and write in the same clock.

I'd guess that riscv exist with some/all of those features.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  8:57   ` David Laight
@ 2021-06-15 13:08     ` Bin Meng
  2021-06-15 13:18       ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Bin Meng @ 2021-06-15 13:08 UTC (permalink / raw)
  To: David Laight
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 15 June 2021 03:38
> >
> > Write a C version of memcpy() which uses the biggest data size allowed,
> > without generating unaligned accesses.
>
> I'm surprised that the C loop:
>
> > +             for (; count >= bytes_long; count -= bytes_long)
> > +                     *d.ulong++ = *s.ulong++;
>
> ends up being faster than the ASM 'read lots' - 'write lots' loop.

I believe that's because the assembly version has some unaligned
access cases, which end up being trap-n-emulated in the OpenSBI
firmware, and that is a big overhead.

>
> Especially since there was an earlier patch to convert
> copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
> instead of a tight single register copy loop.
>
> I'd also guess that the performance needs to be measured on
> different classes of riscv cpu.
>
> A simple cpu will behave differently to one that can execute
> multiple instructions per clock.
> Any form of 'out of order' execution also changes things.
> The other big change is whether the cpu can to a memory
> read and write in the same clock.
>
> I'd guess that riscv exist with some/all of those features.

Regards,
Bin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:08     ` Bin Meng
@ 2021-06-15 13:18       ` David Laight
  2021-06-15 13:28         ` Bin Meng
  2021-06-15 13:44         ` Matteo Croce
  0 siblings, 2 replies; 26+ messages in thread
From: David Laight @ 2021-06-15 13:18 UTC (permalink / raw)
  To: 'Bin Meng'
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

From: Bin Meng
> Sent: 15 June 2021 14:09
> 
> On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> >
...
> > I'm surprised that the C loop:
> >
> > > +             for (; count >= bytes_long; count -= bytes_long)
> > > +                     *d.ulong++ = *s.ulong++;
> >
> > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> 
> I believe that's because the assembly version has some unaligned
> access cases, which end up being trap-n-emulated in the OpenSBI
> firmware, and that is a big overhead.

Ah, that would make sense since the asm user copy code
was broken for misaligned copies.
I suspect memcpy() was broken the same way.

I'm surprised IP_NET_ALIGN isn't set to 2 to try to
avoid all these misaligned copies in the network stack.
Although avoiding 8n+4 aligned data is rather harder.

Misaligned copies are just best avoided - really even on x86.
The 'real fun' is when the access crosses TLB boundaries.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:18       ` David Laight
@ 2021-06-15 13:28         ` Bin Meng
  2021-06-15 16:12           ` Emil Renner Berthing
  2021-06-15 13:44         ` Matteo Croce
  1 sibling, 1 reply; 26+ messages in thread
From: Bin Meng @ 2021-06-15 13:28 UTC (permalink / raw)
  To: David Laight, Gary Guo
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 9:18 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > +             for (; count >= bytes_long; count -= bytes_long)
> > > > +                     *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>

Yes, Gary Guo sent one patch long time ago against the broken assembly
version, but that patch was still not applied as of today.
https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/

I suggest Matteo re-test using Gary's version.

> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>
> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.

Regards,
Bin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:18       ` David Laight
  2021-06-15 13:28         ` Bin Meng
@ 2021-06-15 13:44         ` Matteo Croce
  1 sibling, 0 replies; 26+ messages in thread
From: Matteo Croce @ 2021-06-15 13:44 UTC (permalink / raw)
  To: David Laight
  Cc: Bin Meng, linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 3:18 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > +             for (; count >= bytes_long; count -= bytes_long)
> > > > +                     *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>
> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>

That's up to the network driver, indeed I have a patch already for the
BeagleV one:

https://lore.kernel.org/netdev/20210615012107.577ead86@linux.microsoft.com/T/

> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.
>

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:28         ` Bin Meng
@ 2021-06-15 16:12           ` Emil Renner Berthing
  2021-06-16  0:33             ` Bin Meng
  0 siblings, 1 reply; 26+ messages in thread
From: Emil Renner Berthing @ 2021-06-15 16:12 UTC (permalink / raw)
  To: Bin Meng
  Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> ...
> Yes, Gary Guo sent one patch long time ago against the broken assembly
> version, but that patch was still not applied as of today.
> https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
>
> I suggest Matteo re-test using Gary's version.

That's a good idea, but if you read the replies to Gary's original patch
https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
.. both Gary, Palmer and David would rather like a C-based version.
This is one attempt at providing that.

> > I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> > avoid all these misaligned copies in the network stack.
> > Although avoiding 8n+4 aligned data is rather harder.
> >
> > Misaligned copies are just best avoided - really even on x86.
> > The 'real fun' is when the access crosses TLB boundaries.
>
> Regards,
> Bin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 16:12           ` Emil Renner Berthing
@ 2021-06-16  0:33             ` Bin Meng
  2021-06-16  2:01               ` Matteo Croce
  0 siblings, 1 reply; 26+ messages in thread
From: Bin Meng @ 2021-06-16  0:33 UTC (permalink / raw)
  To: Emil Renner Berthing
  Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing <kernel@esmil.dk> wrote:
>
> On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> > ...
> > Yes, Gary Guo sent one patch long time ago against the broken assembly
> > version, but that patch was still not applied as of today.
> > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
> >
> > I suggest Matteo re-test using Gary's version.
>
> That's a good idea, but if you read the replies to Gary's original patch
> https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> .. both Gary, Palmer and David would rather like a C-based version.
> This is one attempt at providing that.

Yep, I prefer C as well :)

But if you check commit 04091d6, the assembly version was introduced
for KASAN. So if we are to change it back to C, please make sure KASAN
is not broken.

Regards,
Bin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  0:33             ` Bin Meng
@ 2021-06-16  2:01               ` Matteo Croce
  2021-06-16  8:24                 ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Matteo Croce @ 2021-06-16  2:01 UTC (permalink / raw)
  To: Bin Meng
  Cc: Emil Renner Berthing, David Laight, Gary Guo, linux-riscv,
	linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, 16 Jun 2021 08:33:21 +0800
Bin Meng <bmeng.cn@gmail.com> wrote:

> On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing
> <kernel@esmil.dk> wrote:
> >
> > On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> > > ...
> > > Yes, Gary Guo sent one patch long time ago against the broken
> > > assembly version, but that patch was still not applied as of
> > > today.
> > > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
> > >
> > > I suggest Matteo re-test using Gary's version.
> >
> > That's a good idea, but if you read the replies to Gary's original
> > patch
> > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > .. both Gary, Palmer and David would rather like a C-based version.
> > This is one attempt at providing that.
> 
> Yep, I prefer C as well :)
> 
> But if you check commit 04091d6, the assembly version was introduced
> for KASAN. So if we are to change it back to C, please make sure KASAN
> is not broken.
> 
> Regards,
> Bin
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

I added a small benchmark for memcpy() and memset() in lib/test_string.c:

memcpy_align_selftest():

#define PG_SIZE	(1 << (MAX_ORDER - 1 + PAGE_SHIFT))

	page1 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
	page2 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);

	for (i = 0; i < sizeof(void*); i++) {
		t0 = ktime_get();
		memset(dst + i, 0, PG_SIZE - i);
		t1 = ktime_get();
		printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i,
			PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
	}

memset_align_selftest():
	page = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
	for (i = 0; i < sizeof(void*); i++) {
		for (j = 0; j < sizeof(void*); j++) {
			t0 = ktime_get();
			memcpy(dst + j, src + i, PG_SIZE - max(i, j));
			t1 = ktime_get();
			printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n",
				i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
		}
	}

And I ran it agains the three implementations, current, Gary's assembler
and mine in C.

Current:
[   38.980687] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[   39.021612] Strings selftest: memcpy(src+0, dst+1): 113 Mb/s
[   39.062191] Strings selftest: memcpy(src+0, dst+2): 114 Mb/s
[   39.102669] Strings selftest: memcpy(src+0, dst+3): 114 Mb/s
[   39.127423] Strings selftest: memcpy(src+0, dst+4): 209 Mb/s
[   39.167836] Strings selftest: memcpy(src+0, dst+5): 115 Mb/s
[   39.208305] Strings selftest: memcpy(src+0, dst+6): 114 Mb/s
[   39.248712] Strings selftest: memcpy(src+0, dst+7): 115 Mb/s
[   39.288144] Strings selftest: memcpy(src+1, dst+0): 118 Mb/s
[   39.309190] Strings selftest: memcpy(src+1, dst+1): 260 Mb/s
[   39.349721] Strings selftest: memcpy(src+1, dst+2): 114 Mb/s
[...]
[   41.289423] Strings selftest: memcpy(src+7, dst+5): 114 Mb/s
[   41.328801] Strings selftest: memcpy(src+7, dst+6): 118 Mb/s
[   41.349907] Strings selftest: memcpy(src+7, dst+7): 259 Mb/s

[   41.377735] Strings selftest: memset(dst+0): 241 Mb/s
[   41.397882] Strings selftest: memset(dst+1): 265 Mb/s
[   41.417666] Strings selftest: memset(dst+2): 272 Mb/s
[   41.437169] Strings selftest: memset(dst+3): 277 Mb/s
[   41.456656] Strings selftest: memset(dst+4): 277 Mb/s
[   41.476125] Strings selftest: memset(dst+5): 278 Mb/s
[   41.495555] Strings selftest: memset(dst+6): 278 Mb/s
[   41.515002] Strings selftest: memset(dst+7): 278 Mb/s

Gary's
[   27.438112] Strings selftest: memcpy(src+0, dst+0): 232 Mb/s
[   27.461586] Strings selftest: memcpy(src+0, dst+1): 224 Mb/s
[   27.484691] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[   27.507693] Strings selftest: memcpy(src+0, dst+3): 230 Mb/s
[   27.530758] Strings selftest: memcpy(src+0, dst+4): 229 Mb/s
[   27.553840] Strings selftest: memcpy(src+0, dst+5): 229 Mb/s
[   27.576793] Strings selftest: memcpy(src+0, dst+6): 231 Mb/s
[   27.599862] Strings selftest: memcpy(src+0, dst+7): 230 Mb/s
[   27.622888] Strings selftest: memcpy(src+1, dst+0): 230 Mb/s
[   27.643964] Strings selftest: memcpy(src+1, dst+1): 259 Mb/s
[   27.666926] Strings selftest: memcpy(src+1, dst+2): 231 Mb/s
[...]
[   28.831726] Strings selftest: memcpy(src+7, dst+5): 230 Mb/s
[   28.854790] Strings selftest: memcpy(src+7, dst+6): 229 Mb/s
[   28.875844] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[   28.903666] Strings selftest: memset(dst+0): 240 Mb/s
[   28.923533] Strings selftest: memset(dst+1): 269 Mb/s
[   28.943100] Strings selftest: memset(dst+2): 275 Mb/s
[   28.962554] Strings selftest: memset(dst+3): 277 Mb/s
[   28.982009] Strings selftest: memset(dst+4): 277 Mb/s
[   29.001412] Strings selftest: memset(dst+5): 278 Mb/s
[   29.020894] Strings selftest: memset(dst+6): 277 Mb/s
[   29.040383] Strings selftest: memset(dst+7): 276 Mb/s

Mine:
[   33.916144] Strings selftest: memcpy(src+0, dst+0): 222 Mb/s
[   33.939520] Strings selftest: memcpy(src+0, dst+1): 226 Mb/s
[   33.962666] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[   33.985749] Strings selftest: memcpy(src+0, dst+3): 229 Mb/s
[   34.008748] Strings selftest: memcpy(src+0, dst+4): 231 Mb/s
[   34.031970] Strings selftest: memcpy(src+0, dst+5): 228 Mb/s
[   34.055065] Strings selftest: memcpy(src+0, dst+6): 229 Mb/s
[   34.078068] Strings selftest: memcpy(src+0, dst+7): 231 Mb/s
[   34.101177] Strings selftest: memcpy(src+1, dst+0): 229 Mb/s
[   34.122995] Strings selftest: memcpy(src+1, dst+1): 247 Mb/s
[   34.146072] Strings selftest: memcpy(src+1, dst+2): 229 Mb/s
[...]
[   35.315594] Strings selftest: memcpy(src+7, dst+5): 229 Mb/s
[   35.338617] Strings selftest: memcpy(src+7, dst+6): 230 Mb/s
[   35.360464] Strings selftest: memcpy(src+7, dst+7): 247 Mb/s

[   35.388929] Strings selftest: memset(dst+0): 232 Mb/s
[   35.409351] Strings selftest: memset(dst+1): 260 Mb/s
[   35.429434] Strings selftest: memset(dst+2): 266 Mb/s
[   35.449460] Strings selftest: memset(dst+3): 267 Mb/s
[   35.469479] Strings selftest: memset(dst+4): 267 Mb/s
[   35.489481] Strings selftest: memset(dst+5): 268 Mb/s
[   35.509443] Strings selftest: memset(dst+6): 269 Mb/s
[   35.529449] Strings selftest: memset(dst+7): 268 Mb/s

Leaving out the first memcpy/set of every test which is always slower, (maybe
because of a cache miss?), the current implementation copies 260 Mb/s when
the low order bits match, and 114 otherwise.
Memset is stable at 278 Mb/s.

Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
and 230 Mb/s otherwise. Memset is the same as the current one.

Mine has the same speed of Gary's one when the low order bits mismatch, but
it's slower when equally aligned, it stops at 247 Mb/s.
Memset is slighty slower ad 269 Mb/s.


I'm not familiar with RISC-V assembly, but looking at Gary's assembler and I
think that he manually unrolled the loop by copying 16 uint64_t at time
using 16 registers.
I managed to do the same with a small change in the C code and a pragma directive:

This for memcpy():

	if (distance) {
		unsigned long last, next;
		int i;

		s.u8 -= distance;

		for (; count >= bytes_long * 8 + mask; count -= bytes_long * 8) {
			next = s.ulong[0];
			for (i = 0; i < 8; i++) {
				last = next;
				next = s.ulong[i + 1];

				d.ulong[i] = last >> (distance * 8) |
					next << ((bytes_long - distance) * 8);
			}

			d.ulong += 8;
			s.ulong += 8;
		}

		s.u8 += distance;
	} else {
		/* 8 byte wide copy */
		int i;
		for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
			for (i = 0; i < 8; i++)
				d.ulong[i] = s.ulong[i];
			d.ulong += 8;
			s.ulong += 8;
		}
	}

And this for memset:

		for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
			for (i = 0; i < 8; i++)
				dest.ulong[i] = cu;

			dest.ulong += 8;
		}

And the generated machine code is very, very similar to Gary's one!
And these are the result:

[   35.898366] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[   35.920942] Strings selftest: memcpy(src+0, dst+1): 236 Mb/s
[   35.943171] Strings selftest: memcpy(src+0, dst+2): 241 Mb/s
[   35.965291] Strings selftest: memcpy(src+0, dst+3): 242 Mb/s
[   35.987374] Strings selftest: memcpy(src+0, dst+4): 244 Mb/s
[   36.009554] Strings selftest: memcpy(src+0, dst+5): 242 Mb/s
[   36.031721] Strings selftest: memcpy(src+0, dst+6): 242 Mb/s
[   36.053881] Strings selftest: memcpy(src+0, dst+7): 242 Mb/s
[   36.075949] Strings selftest: memcpy(src+1, dst+0): 243 Mb/s
[   36.097084] Strings selftest: memcpy(src+1, dst+1): 258 Mb/s
[   36.119269] Strings selftest: memcpy(src+1, dst+2): 242 Mb/s
[...]
[   37.242433] Strings selftest: memcpy(src+7, dst+5): 242 Mb/s
[   37.264571] Strings selftest: memcpy(src+7, dst+6): 242 Mb/s
[   37.285609] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[   37.313633] Strings selftest: memset(dst+0): 237 Mb/s
[   37.333682] Strings selftest: memset(dst+1): 266 Mb/s
[   37.353375] Strings selftest: memset(dst+2): 273 Mb/s
[   37.373000] Strings selftest: memset(dst+3): 274 Mb/s
[   37.392608] Strings selftest: memset(dst+4): 274 Mb/s
[   37.412220] Strings selftest: memset(dst+5): 274 Mb/s
[   37.431848] Strings selftest: memset(dst+6): 274 Mb/s
[   37.451467] Strings selftest: memset(dst+7): 274 Mb/s

This version is even faster than the assembly one, but it won't work for
copies/set smaller that at least 64 bytes, or even 128. With small buffers
it will copy bytes one at time, so I don't know if it's worth it.

What is preferred in your opinion, an implementation which is always fast
with all sizes, or one which is a bit faster but slow with small copies?

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  2:01               ` Matteo Croce
@ 2021-06-16  8:24                 ` David Laight
  2021-06-16 10:48                   ` Akira Tsukamoto
  2021-06-16 19:06                   ` Matteo Croce
  0 siblings, 2 replies; 26+ messages in thread
From: David Laight @ 2021-06-16  8:24 UTC (permalink / raw)
  To: 'Matteo Croce', Bin Meng
  Cc: Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

From: Matteo Croce
> Sent: 16 June 2021 03:02
...
> > > That's a good idea, but if you read the replies to Gary's original
> > > patch
> > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > .. both Gary, Palmer and David would rather like a C-based version.
> > > This is one attempt at providing that.
> >
> > Yep, I prefer C as well :)
> >
> > But if you check commit 04091d6, the assembly version was introduced
> > for KASAN. So if we are to change it back to C, please make sure KASAN
> > is not broken.
> >
...
> Leaving out the first memcpy/set of every test which is always slower, (maybe
> because of a cache miss?), the current implementation copies 260 Mb/s when
> the low order bits match, and 114 otherwise.
> Memset is stable at 278 Mb/s.
> 
> Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> and 230 Mb/s otherwise. Memset is the same as the current one.

Any idea what the attainable performance is for the cpu you are using?
Since both memset and memcpy are running at much the same speed
I suspect it is all limited by the writes.

272MB/s is only 34M writes/sec.
This seems horribly slow for a modern cpu.
So is this actually really limited by the cache writes to physical memory?

You might want to do some tests (userspace is fine) where you
check much smaller lengths that definitely sit within the data cache.

It is also worth checking how much overhead there is for
short copies - they are almost certainly more common than
you might expect.
This is one problem with excessive loop unrolling - the 'special
cases' for the ends of the buffer start having a big effect
on small copies.

For cpu that support misaligned memory accesses, one 'trick'
for transfers longer than a 'word' is to do a (probably) misaligned
transfer of the last word of the buffer first followed by the
transfer of the rest of the buffer (overlapping a few bytes at the end).
This saves on conditionals and temporary values.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  8:24                 ` David Laight
@ 2021-06-16 10:48                   ` Akira Tsukamoto
  2021-06-16 19:06                   ` Matteo Croce
  1 sibling, 0 replies; 26+ messages in thread
From: Akira Tsukamoto @ 2021-06-16 10:48 UTC (permalink / raw)
  To: David Laight
  Cc: Matteo Croce, Bin Meng, Emil Renner Berthing, Gary Guo,
	linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Drew Fustini

On Wed, Jun 16, 2021 at 5:24 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>
> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>
> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.

I am fine with Matteo's memcpy.

The two culprits seen by the `perf top -Ue task-clock` output during the
tcp and ucp network are

> Overhead  Shared O  Symbol
>  42.22%  [kernel]  [k] memcpy
>  35.00%  [kernel]  [k] __asm_copy_to_user

so we really need to optimize both memcpy and __asm_copy_to_user.

The main reason of speed up in memcpy is that

> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.

which are in the code:

Gary's:
+       /* Calculate shifts */
+       slli    t3, a3, 3
+       sub    t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+       /* Load the initial value and align a1 */
+       andi    a1, a1, ~(SZREG-1)
+       REG_L    a5, 0(a1)
+
+       addi    t0, t0, -(SZREG-1)
+       /* At least one iteration will be executed here, no check */
+1:
+       srl    a4, a5, t3
+       REG_L    a5, SZREG(a1)
+       addi    a1, a1, SZREG
+       sll    a2, a5, t4
+       or    a2, a2, a4
+       REG_S    a2, 0(a0)
+       addi    a0, a0, SZREG
+       bltu    a0, t0, 1b

and Matteo ported to C:

+#pragma GCC unroll 8
+        for (next = s.ulong[0]; count >= bytes_long + mask; count -=
bytes_long) {
+            last = next;
+            next = s.ulong[1];
+
+            d.ulong[0] = last >> (distance * 8) |
+                     next << ((bytes_long - distance) * 8);
+
+            d.ulong++;
+            s.ulong++;
+        }

I believe this is reasonable and enough to be in the upstream.

Akira


>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
  2021-06-15  8:57   ` David Laight
@ 2021-06-16 11:46   ` Guo Ren
  2021-06-16 18:52     ` Matteo Croce
  1 sibling, 1 reply; 26+ messages in thread
From: Guo Ren @ 2021-06-16 11:46 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

Hi Matteo,

Have you tried Glibc generic implementation code?
ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t

If Glibc codes have the same performance in your hardware, then you
could give a generic implementation first.

The current Linux generic implementation is so simple in lib/string.c:
#ifndef __HAVE_ARCH_MEMCPY
/**
 * memcpy - Copy one area of memory to another
 * @dest: Where to copy to
 * @src: Where to copy from
 * @count: The size of the area.
 *
 * You should not use this function to access IO space, use memcpy_toio()
 * or memcpy_fromio() instead.
 */
void *memcpy(void *dest, const void *src, size_t count)
{
        char *tmp = dest;
        const char *s = src;

        while (count--)
                *tmp++ = *s++;
        return dest;
}
EXPORT_SYMBOL(memcpy);
#endif

On Tue, Jun 15, 2021 at 10:42 AM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
>
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
>
> On a BeagleV, the TCP RX throughput increased by 45%:
>
> before:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
> [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
> [  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
> [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
> [  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
> [  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
> [  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
> [  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
> [  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
> [  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
> [  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver
>
> after:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
> [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
> [  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
> [  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
> [  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
> [  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
> [  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
> [  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
> [  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
> [  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
> [  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver
>
> And the decreased CPU time of the memcpy() is observable with perf top.
> This is the `perf top -Ue task-clock` output when doing the test:
>
> before:
>
> Overhead  Shared O  Symbol
>   42.22%  [kernel]  [k] memcpy
>   35.00%  [kernel]  [k] __asm_copy_to_user
>    3.50%  [kernel]  [k] sifive_l2_flush64_range
>    2.30%  [kernel]  [k] stmmac_napi_poll_rx
>    1.11%  [kernel]  [k] memset
>
> after:
>
> Overhead  Shared O  Symbol
>   45.69%  [kernel]  [k] __asm_copy_to_user
>   29.06%  [kernel]  [k] memcpy
>    4.09%  [kernel]  [k] sifive_l2_flush64_range
>    2.77%  [kernel]  [k] stmmac_napi_poll_rx
>    1.24%  [kernel]  [k] memset
>
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>  arch/riscv/include/asm/string.h |   8 ++-
>  arch/riscv/kernel/riscv_ksyms.c |   2 -
>  arch/riscv/lib/Makefile         |   2 +-
>  arch/riscv/lib/memcpy.S         | 108 --------------------------------
>  arch/riscv/lib/string.c         |  94 +++++++++++++++++++++++++++
>  5 files changed, 101 insertions(+), 113 deletions(-)
>  delete mode 100644 arch/riscv/lib/memcpy.S
>  create mode 100644 arch/riscv/lib/string.c
>
> diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
> index 909049366555..6b5d6fc3eab4 100644
> --- a/arch/riscv/include/asm/string.h
> +++ b/arch/riscv/include/asm/string.h
> @@ -12,9 +12,13 @@
>  #define __HAVE_ARCH_MEMSET
>  extern asmlinkage void *memset(void *, int, size_t);
>  extern asmlinkage void *__memset(void *, int, size_t);
> +
> +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
>  #define __HAVE_ARCH_MEMCPY
> -extern asmlinkage void *memcpy(void *, const void *, size_t);
> -extern asmlinkage void *__memcpy(void *, const void *, size_t);
> +extern void *memcpy(void *dest, const void *src, size_t count);
> +extern void *__memcpy(void *dest, const void *src, size_t count);
> +#endif
> +
>  #define __HAVE_ARCH_MEMMOVE
>  extern asmlinkage void *memmove(void *, const void *, size_t);
>  extern asmlinkage void *__memmove(void *, const void *, size_t);
> diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
> index 5ab1c7e1a6ed..3f6d512a5b97 100644
> --- a/arch/riscv/kernel/riscv_ksyms.c
> +++ b/arch/riscv/kernel/riscv_ksyms.c
> @@ -10,8 +10,6 @@
>   * Assembly functions that may be used (directly or indirectly) by modules
>   */
>  EXPORT_SYMBOL(memset);
> -EXPORT_SYMBOL(memcpy);
>  EXPORT_SYMBOL(memmove);
>  EXPORT_SYMBOL(__memset);
> -EXPORT_SYMBOL(__memcpy);
>  EXPORT_SYMBOL(__memmove);
> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> index 25d5c9664e57..2ffe85d4baee 100644
> --- a/arch/riscv/lib/Makefile
> +++ b/arch/riscv/lib/Makefile
> @@ -1,9 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  lib-y                  += delay.o
> -lib-y                  += memcpy.o
>  lib-y                  += memset.o
>  lib-y                  += memmove.o
>  lib-$(CONFIG_MMU)      += uaccess.o
>  lib-$(CONFIG_64BIT)    += tishift.o
> +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
>
>  obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> deleted file mode 100644
> index 51ab716253fa..000000000000
> --- a/arch/riscv/lib/memcpy.S
> +++ /dev/null
> @@ -1,108 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2013 Regents of the University of California
> - */
> -
> -#include <linux/linkage.h>
> -#include <asm/asm.h>
> -
> -/* void *memcpy(void *, const void *, size_t) */
> -ENTRY(__memcpy)
> -WEAK(memcpy)
> -       move t6, a0  /* Preserve return value */
> -
> -       /* Defer to byte-oriented copy for small sizes */
> -       sltiu a3, a2, 128
> -       bnez a3, 4f
> -       /* Use word-oriented copy only if low-order bits match */
> -       andi a3, t6, SZREG-1
> -       andi a4, a1, SZREG-1
> -       bne a3, a4, 4f
> -
> -       beqz a3, 2f  /* Skip if already aligned */
> -       /*
> -        * Round to nearest double word-aligned address
> -        * greater than or equal to start address
> -        */
> -       andi a3, a1, ~(SZREG-1)
> -       addi a3, a3, SZREG
> -       /* Handle initial misalignment */
> -       sub a4, a3, a1
> -1:
> -       lb a5, 0(a1)
> -       addi a1, a1, 1
> -       sb a5, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 1b
> -       sub a2, a2, a4  /* Update count */
> -
> -2:
> -       andi a4, a2, ~((16*SZREG)-1)
> -       beqz a4, 4f
> -       add a3, a1, a4
> -3:
> -       REG_L a4,       0(a1)
> -       REG_L a5,   SZREG(a1)
> -       REG_L a6, 2*SZREG(a1)
> -       REG_L a7, 3*SZREG(a1)
> -       REG_L t0, 4*SZREG(a1)
> -       REG_L t1, 5*SZREG(a1)
> -       REG_L t2, 6*SZREG(a1)
> -       REG_L t3, 7*SZREG(a1)
> -       REG_L t4, 8*SZREG(a1)
> -       REG_L t5, 9*SZREG(a1)
> -       REG_S a4,       0(t6)
> -       REG_S a5,   SZREG(t6)
> -       REG_S a6, 2*SZREG(t6)
> -       REG_S a7, 3*SZREG(t6)
> -       REG_S t0, 4*SZREG(t6)
> -       REG_S t1, 5*SZREG(t6)
> -       REG_S t2, 6*SZREG(t6)
> -       REG_S t3, 7*SZREG(t6)
> -       REG_S t4, 8*SZREG(t6)
> -       REG_S t5, 9*SZREG(t6)
> -       REG_L a4, 10*SZREG(a1)
> -       REG_L a5, 11*SZREG(a1)
> -       REG_L a6, 12*SZREG(a1)
> -       REG_L a7, 13*SZREG(a1)
> -       REG_L t0, 14*SZREG(a1)
> -       REG_L t1, 15*SZREG(a1)
> -       addi a1, a1, 16*SZREG
> -       REG_S a4, 10*SZREG(t6)
> -       REG_S a5, 11*SZREG(t6)
> -       REG_S a6, 12*SZREG(t6)
> -       REG_S a7, 13*SZREG(t6)
> -       REG_S t0, 14*SZREG(t6)
> -       REG_S t1, 15*SZREG(t6)
> -       addi t6, t6, 16*SZREG
> -       bltu a1, a3, 3b
> -       andi a2, a2, (16*SZREG)-1  /* Update count */
> -
> -4:
> -       /* Handle trailing misalignment */
> -       beqz a2, 6f
> -       add a3, a1, a2
> -
> -       /* Use word-oriented copy if co-aligned to word boundary */
> -       or a5, a1, t6
> -       or a5, a5, a3
> -       andi a5, a5, 3
> -       bnez a5, 5f
> -7:
> -       lw a4, 0(a1)
> -       addi a1, a1, 4
> -       sw a4, 0(t6)
> -       addi t6, t6, 4
> -       bltu a1, a3, 7b
> -
> -       ret
> -
> -5:
> -       lb a4, 0(a1)
> -       addi a1, a1, 1
> -       sb a4, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 5b
> -6:
> -       ret
> -END(__memcpy)
> diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> new file mode 100644
> index 000000000000..525f9ee25a74
> --- /dev/null
> +++ b/arch/riscv/lib/string.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * String functions optimized for hardware which doesn't
> + * handle unaligned memory accesses efficiently.
> + *
> + * Copyright (C) 2021 Matteo Croce
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +
> +/* size below a classic byte at time copy is done */
> +#define MIN_THRESHOLD 64
> +
> +/* convenience types to avoid cast between different pointer types */
> +union types {
> +       u8 *u8;
> +       unsigned long *ulong;
> +       uintptr_t uptr;
> +};
> +
> +union const_types {
> +       const u8 *u8;
> +       unsigned long *ulong;
> +};
> +
> +void *memcpy(void *dest, const void *src, size_t count)
> +{
> +       const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +       const int mask = bytes_long - 1;
> +       const int distance = (src - dest) & mask;
> +#endif
> +       union const_types s = { .u8 = src };
> +       union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +       if (count <= MIN_THRESHOLD)
> +               goto copy_remainder;
> +
> +       /* copy a byte at time until destination is aligned */
> +       for (; count && d.uptr & mask; count--)
> +               *d.u8++ = *s.u8++;
> +
> +       if (distance) {
> +               unsigned long last, next;
> +
> +               /* move s backward to the previous alignment boundary */
> +               s.u8 -= distance;
> +
> +               /* 32/64 bit wide copy from s to d.
> +                * d is aligned now but s is not, so read s alignment wise,
> +                * and do proper shift to get the right value.
> +                * Works only on Little Endian machines.
> +                */
> +               for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> +                       last = next;
> +                       next = s.ulong[1];
> +
> +                       d.ulong[0] = last >> (distance * 8) |
> +                                    next << ((bytes_long - distance) * 8);
> +
> +                       d.ulong++;
> +                       s.ulong++;
> +               }
> +
> +               /* restore s with the original offset */
> +               s.u8 += distance;
> +       } else
> +#endif
> +       {
> +               /* if the source and dest lower bits are the same, do a simple
> +                * 32/64 bit wide copy.
> +                */
> +               for (; count >= bytes_long; count -= bytes_long)
> +                       *d.ulong++ = *s.ulong++;
> +       }
> +
> +       /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> +       goto copy_remainder;
> +
> +copy_remainder:
> +       while (count--)
> +               *d.u8++ = *s.u8++;
> +
> +       return dest;
> +}
> +EXPORT_SYMBOL(memcpy);
> +
> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +       return memcpy(dest, src, count);
> +}
> +EXPORT_SYMBOL(__memcpy);
> --
> 2.31.1
>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16 11:46   ` Guo Ren
@ 2021-06-16 18:52     ` Matteo Croce
  2021-06-17 21:30       ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Matteo Croce @ 2021-06-16 18:52 UTC (permalink / raw)
  To: Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
>
> Hi Matteo,
>
> Have you tried Glibc generic implementation code?
> ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
>
> If Glibc codes have the same performance in your hardware, then you
> could give a generic implementation first.
>

Hi,

I had a look, it seems that it's a C unrolled version with the
'register' keyword.
The same one was already merged in nios2:
https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
result of the other versions:

[  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  8:24                 ` David Laight
  2021-06-16 10:48                   ` Akira Tsukamoto
@ 2021-06-16 19:06                   ` Matteo Croce
  1 sibling, 0 replies; 26+ messages in thread
From: Matteo Croce @ 2021-06-16 19:06 UTC (permalink / raw)
  To: David Laight
  Cc: Bin Meng, Emil Renner Berthing, Gary Guo, linux-riscv,
	linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, Jun 16, 2021 at 10:24 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>

I get similar results in userspace, this tool write to RAM with
variable data width:

root@beaglev:~/src# ./unalign_check 1 0 1
size:           1 Mb
write size:      8 bit
unalignment:    0 byte
elapsed time:   0.01 sec
throughput:     124.36 Mb/s

# ./unalign_check 1 0 8
size:           1 Mb
write size:      64 bit
unalignment:    0 byte
elapsed time:   0.00 sec
throughput:     252.12 Mb/s

> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>

I too believe that they are much more common than long ones.
Indeed, I wish to reduce the MIN_THRESHOLD value from 64 to 32 or even 16.
Or having it dependend on the word size, e.g. sizeof(long) * 2.

Suggestions?

> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16 18:52     ` Matteo Croce
@ 2021-06-17 21:30       ` David Laight
  2021-06-17 21:48         ` Matteo Croce
  0 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2021-06-17 21:30 UTC (permalink / raw)
  To: 'Matteo Croce', Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 16 June 2021 19:52
> To: Guo Ren <guoren@kernel.org>
> 
> On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> >
> > Hi Matteo,
> >
> > Have you tried Glibc generic implementation code?
> > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> >
> > If Glibc codes have the same performance in your hardware, then you
> > could give a generic implementation first.

Isn't that a byte copy loop - the performance of that ought to be terrible.
...

> I had a look, it seems that it's a C unrolled version with the
> 'register' keyword.
> The same one was already merged in nios2:
> https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I know a lot about the nios2 instruction timings.
(I've looked at code execution in the fpga's intel 'logic analiser.)
It is a very simple 4-clock pipeline cpu with a 2-clock delay
before a value read from 'tightly coupled memory' (aka cache)
can be used in another instruction.
There is also a subtle pipeline stall if a read follows a write
to the same memory block because the write is executed one
clock later - and would collide with the read.
Since it only ever executes one instruction per clock loop
unrolling does help - since you never get the loop control 'for free'.
OTOH you don't need to use that many registers.
But an unrolled loop should approach 2 bytes/clock (32bit cpu).

> I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> result of the other versions:
> 
> [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

What clock speed is that running at?
It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).

While the small riscv cpu might be similar to the nios2 (and mips
for that matter), there are also bigger/faster cpu.
I'm sure these can execute multiple instructions/clock
and possible even read and write at the same time.
Unless they also support significant instruction re-ordering
the trivial copy loops are going to be slow on such cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-17 21:30       ` David Laight
@ 2021-06-17 21:48         ` Matteo Croce
  2021-06-18  0:32           ` Matteo Croce
  0 siblings, 1 reply; 26+ messages in thread
From: Matteo Croce @ 2021-06-17 21:48 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 19:52
> > To: Guo Ren <guoren@kernel.org>
> >
> > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > >
> > > Hi Matteo,
> > >
> > > Have you tried Glibc generic implementation code?
> > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > >
> > > If Glibc codes have the same performance in your hardware, then you
> > > could give a generic implementation first.
>
> Isn't that a byte copy loop - the performance of that ought to be terrible.
> ...
>
> > I had a look, it seems that it's a C unrolled version with the
> > 'register' keyword.
> > The same one was already merged in nios2:
> > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
>
> I know a lot about the nios2 instruction timings.
> (I've looked at code execution in the fpga's intel 'logic analiser.)
> It is a very simple 4-clock pipeline cpu with a 2-clock delay
> before a value read from 'tightly coupled memory' (aka cache)
> can be used in another instruction.
> There is also a subtle pipeline stall if a read follows a write
> to the same memory block because the write is executed one
> clock later - and would collide with the read.
> Since it only ever executes one instruction per clock loop
> unrolling does help - since you never get the loop control 'for free'.
> OTOH you don't need to use that many registers.
> But an unrolled loop should approach 2 bytes/clock (32bit cpu).
>
> > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > result of the other versions:
> >
> > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
>
> What clock speed is that running at?
> It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
>
> While the small riscv cpu might be similar to the nios2 (and mips
> for that matter), there are also bigger/faster cpu.
> I'm sure these can execute multiple instructions/clock
> and possible even read and write at the same time.
> Unless they also support significant instruction re-ordering
> the trivial copy loops are going to be slow on such cpu.
>

It's running at 1 GHz.

I get 257 Mb/s with a memcpy, a bit more with a memset,
but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-17 21:48         ` Matteo Croce
@ 2021-06-18  0:32           ` Matteo Croce
  2021-06-18  1:05             ` Matteo Croce
  0 siblings, 1 reply; 26+ messages in thread
From: Matteo Croce @ 2021-06-18  0:32 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Matteo Croce
> > > Sent: 16 June 2021 19:52
> > > To: Guo Ren <guoren@kernel.org>
> > >
> > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > > >
> > > > Hi Matteo,
> > > >
> > > > Have you tried Glibc generic implementation code?
> > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > >
> > > > If Glibc codes have the same performance in your hardware, then you
> > > > could give a generic implementation first.
> >
> > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > ...
> >
> > > I had a look, it seems that it's a C unrolled version with the
> > > 'register' keyword.
> > > The same one was already merged in nios2:
> > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> >
> > I know a lot about the nios2 instruction timings.
> > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > before a value read from 'tightly coupled memory' (aka cache)
> > can be used in another instruction.
> > There is also a subtle pipeline stall if a read follows a write
> > to the same memory block because the write is executed one
> > clock later - and would collide with the read.
> > Since it only ever executes one instruction per clock loop
> > unrolling does help - since you never get the loop control 'for free'.
> > OTOH you don't need to use that many registers.
> > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> >
> > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > result of the other versions:
> > >
> > > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> >
> > What clock speed is that running at?
> > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> >
> > While the small riscv cpu might be similar to the nios2 (and mips
> > for that matter), there are also bigger/faster cpu.
> > I'm sure these can execute multiple instructions/clock
> > and possible even read and write at the same time.
> > Unless they also support significant instruction re-ordering
> > the trivial copy loops are going to be slow on such cpu.
> >
>
> It's running at 1 GHz.
>
> I get 257 Mb/s with a memcpy, a bit more with a memset,
> but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
>

Err, I forget a mlock() before accessing the memory in userspace.

The real speed here is:

8 bit read: 155.42 Mb/s
64 bit read: 277.29 Mb/s
8 bit write: 138.57 Mb/s
64 bit write: 239.21 Mb/s

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-18  0:32           ` Matteo Croce
@ 2021-06-18  1:05             ` Matteo Croce
  2021-06-18  8:32               ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Matteo Croce @ 2021-06-18  1:05 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Fri, Jun 18, 2021 at 2:32 AM Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
> <mcroce@linux.microsoft.com> wrote:
> >
> > On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Matteo Croce
> > > > Sent: 16 June 2021 19:52
> > > > To: Guo Ren <guoren@kernel.org>
> > > >
> > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > > > >
> > > > > Hi Matteo,
> > > > >
> > > > > Have you tried Glibc generic implementation code?
> > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > > >
> > > > > If Glibc codes have the same performance in your hardware, then you
> > > > > could give a generic implementation first.
> > >
> > > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > > ...
> > >
> > > > I had a look, it seems that it's a C unrolled version with the
> > > > 'register' keyword.
> > > > The same one was already merged in nios2:
> > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> > >
> > > I know a lot about the nios2 instruction timings.
> > > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > > before a value read from 'tightly coupled memory' (aka cache)
> > > can be used in another instruction.
> > > There is also a subtle pipeline stall if a read follows a write
> > > to the same memory block because the write is executed one
> > > clock later - and would collide with the read.
> > > Since it only ever executes one instruction per clock loop
> > > unrolling does help - since you never get the loop control 'for free'.
> > > OTOH you don't need to use that many registers.
> > > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> > >
> > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > > result of the other versions:
> > > >
> > > > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> > >
> > > What clock speed is that running at?
> > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> > >
> > > While the small riscv cpu might be similar to the nios2 (and mips
> > > for that matter), there are also bigger/faster cpu.
> > > I'm sure these can execute multiple instructions/clock
> > > and possible even read and write at the same time.
> > > Unless they also support significant instruction re-ordering
> > > the trivial copy loops are going to be slow on such cpu.
> > >
> >
> > It's running at 1 GHz.
> >
> > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> >
>
> Err, I forget a mlock() before accessing the memory in userspace.
>
> The real speed here is:
>
> 8 bit read: 155.42 Mb/s
> 64 bit read: 277.29 Mb/s
> 8 bit write: 138.57 Mb/s
> 64 bit write: 239.21 Mb/s
>

Anyway, thanks for the info on nio2 timings.
If you think that an unrolled loop would help, we can achieve the same in C.
I think we could code something similar to a Duff device (or with jump
labels) to unroll the loop but at the same time doing efficient small copies.

Regards,

--
per aspera ad upstream

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-18  1:05             ` Matteo Croce
@ 2021-06-18  8:32               ` David Laight
  0 siblings, 0 replies; 26+ messages in thread
From: David Laight @ 2021-06-18  8:32 UTC (permalink / raw)
  To: 'Matteo Croce'
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 18 June 2021 02:05
...
> > > It's running at 1 GHz.
> > >
> > > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> > >
> >
> > Err, I forget a mlock() before accessing the memory in userspace.

What is the mlock() for?
The data for a quick loop won't get paged out.
You want to test cache to cache copies, so the first loop
will always be slow.
After that each iteration should be much the same.
I use code like:
	for (;;) {
		start = read_tsc();
		do_test();
		histogram[(read_tsc() - start) >> n]++
	}
(You need to exclude outliers)
to get a distribution for the execution times.
Tends to be pretty stable - even though different program
runs can give different values!
	
> > The real speed here is:
> >
> > 8 bit read: 155.42 Mb/s
> > 64 bit read: 277.29 Mb/s
> > 8 bit write: 138.57 Mb/s
> > 64 bit write: 239.21 Mb/s
> >
> 
> Anyway, thanks for the info on nio2 timings.
> If you think that an unrolled loop would help, we can achieve the same in C.
> I think we could code something similar to a Duff device (or with jump
> labels) to unroll the loop but at the same time doing efficient small copies.

Unrolling has to be done with care.
It tends to improve benchmarks, but the extra code displaces
other code from the i-cache and slows down overall performance.
So you need 'just enough' unrolling to avoid cpu stalls.

On your system it looks like the memory/cache subsystem
is the bottleneck for the tests you are doing.
I'd really expect a 1GHz cpu to be able to read/write from
its data cache every clock.
So I'd expect transfer rates nearer 8000 MB/s, not 250 MB/s.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
  2024-01-28 12:35   ` David Laight
@ 2024-01-30 12:11   ` Nick Kossifidis
  1 sibling, 0 replies; 26+ messages in thread
From: Nick Kossifidis @ 2024-01-30 12:11 UTC (permalink / raw)
  To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

On 1/28/24 13:10, Jisheng Zhang wrote:
> +
> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +	union const_types s = { .as_u8 = src };
> +	union types d = { .as_u8 = dest };
> +	int distance = 0;
> +
> +	if (count < MIN_THRESHOLD)
> +		goto copy_remainder;
> +
> +	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
> +		/* Copy a byte at time until destination is aligned. */
> +		for (; d.as_uptr & WORD_MASK; count--)
> +			*d.as_u8++ = *s.as_u8++;
> +
> +		distance = s.as_uptr & WORD_MASK;
> +	}
> +
> +	if (distance) {
> +		unsigned long last, next;
> +
> +		/*
> +		 * s is distance bytes ahead of d, and d just reached
> +		 * the alignment boundary. Move s backward to word align it
> +		 * and shift data to compensate for distance, in order to do
> +		 * word-by-word copy.
> +		 */
> +		s.as_u8 -= distance;
> +
> +		next = s.as_ulong[0];
> +		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
> +			last = next;
> +			next = s.as_ulong[1];
> +
> +			d.as_ulong[0] = last >> (distance * 8) |
> +					next << ((BYTES_LONG - distance) * 8);
> +
> +			d.as_ulong++;
> +			s.as_ulong++;
> +		}
> +
> +		/* Restore s with the original offset. */
> +		s.as_u8 += distance;
> +	} else {
> +		/*
> +		 * If the source and dest lower bits are the same, do a simple
> +		 * aligned copy.
> +		 */
> +		size_t aligned_count = count & ~(BYTES_LONG * 8 - 1);
> +
> +		__memcpy_aligned(d.as_ulong, s.as_ulong, aligned_count);
> +		d.as_u8 += aligned_count;
> +		s.as_u8 += aligned_count;
> +		count &= BYTES_LONG * 8 - 1;
> +	}
> +
> +copy_remainder:
> +	while (count--)
> +		*d.as_u8++ = *s.as_u8++;
> +
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memcpy);
> +

We could also implement memcmp this way, e.g.:

int
memcmp(const void *s1, const void *s2, size_t len)
{
	union const_data a = { .as_bytes = s1 };
	union const_data b = { .as_bytes = s2 };
	unsigned long a_val = 0;
	unsigned long b_val = 0;
	size_t remaining = len;
	size_t a_offt = 0;

	/* Nothing to do */
	if (!s1 || !s2 || s1 == s2 || !len)
		return 0;

	if (len < 2 * WORD_SIZE)
		goto trailing_fw;

	for(; b.as_uptr & WORD_MASK; remaining--) {
		a_val = *a.as_bytes++;
		b_val = *b.as_bytes++;
		if (a_val != b_val)
			goto done;
	}

	a_offt = a.as_uptr & WORD_MASK;
	if (!a_offt) {
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
			a_val = *a.as_ulong++;
			b_val = *b.as_ulong++;
			if (a_val != b_val)
				break;

		}
	} else {
		unsigned long a_cur, a_next;
		a.as_bytes -= a_offt;
		a_next = *a.as_ulong;
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE, b.as_ulong++) {
			a_cur = a_next;
			a_next = *++a.as_ulong;
			a_val = a_cur >> (a_offt * 8) |
				a_next << ((WORD_SIZE - a_offt) * 8);
			b_val = *b.as_ulong;
			if (a_val != b_val) {
				a.as_bytes += a_offt;
				break;
			}
		}
		a.as_bytes += a_offt;
	}

  trailing_fw:
	while (remaining-- > 0) {
		a_val = *a.as_bytes++;
		b_val = *b.as_bytes++;
		if (a_val != b_val)
			break;
	}

  done:
	if (!remaining)
		return 0;

	return (int) (a_val - b_val);
}

Regards,
Nick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
@ 2024-01-28 12:35   ` David Laight
  2024-01-30 12:11   ` Nick Kossifidis
  1 sibling, 0 replies; 26+ messages in thread
From: David Laight @ 2024-01-28 12:35 UTC (permalink / raw)
  To: 'Jisheng Zhang', Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 28 January 2024 11:10
> 
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
> 
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
> 
> On a BeagleV, the TCP RX throughput increased by 45%:
...
> +static void __memcpy_aligned(unsigned long *dest, const unsigned long *src, size_t count)
> +{

You should be able to remove an instruction from the loop by using:
	const unsigned long *src_lim = src + count;
	for (; src < src_lim; ) {

> +	for (; count > 0; count -= BYTES_LONG * 8) {
> +		register unsigned long d0, d1, d2, d3, d4, d5, d6, d7;

register is completely ignored and pointless.
(More annoyingly auto is also ignored.)

> +		d0 = src[0];
> +		d1 = src[1];
> +		d2 = src[2];
> +		d3 = src[3];
> +		d4 = src[4];
> +		d5 = src[5];
> +		d6 = src[6];
> +		d7 = src[7];
> +		dest[0] = d0;
> +		dest[1] = d1;
> +		dest[2] = d2;
> +		dest[3] = d3;
> +		dest[4] = d4;
> +		dest[5] = d5;
> +		dest[6] = d6;
> +		dest[7] = d7;
> +		dest += 8;
> +		src += 8;

There two lines belong in the for (...) statement.

> +	}
> +}

If you __always_inline the function you can pass &src and &dest
and use the updated pointers following the loop.

I don't believe that risc-v supports 'reg+reg+(imm5<<3)' addressing
(although there is probably space in the instruction for it.
Actually 'reg+reg' addressing could be supported for loads but
not stores - since the latter would require 3 registers be read.

We use the Nios-II cpu in some fpgas. Intel are removing support
in favour of Risc-V - we are thinking of re-implementing Nios-II
ourselves!
I don't think they understand what the cpu get used for!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] riscv: optimized memcpy
  2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
@ 2024-01-28 11:10 ` Jisheng Zhang
  2024-01-28 12:35   ` David Laight
  2024-01-30 12:11   ` Nick Kossifidis
  0 siblings, 2 replies; 26+ messages in thread
From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Compared with Matteo's original series, Jisheng made below changes:
1. adopt Emil's change to fix boot failure when build with clang
2. add corresponding changes to purgatory
3. always build optimized string.c rather than only build when optimize
for performance
4. implement unroll support when src & dst are both aligned to keep
the same performance as assembly version. After disassembling, I found
that the unroll version looks something like below, so it acchieves
the "unroll" effect as asm version but in C programming language:
	ld	t2,0(a5)
	ld	t0,8(a5)
	ld	t6,16(a5)
	ld	t5,24(a5)
	ld	t4,32(a5)
	ld	t3,40(a5)
	ld	t1,48(a5)
	ld	a1,56(a5)
	sd	t2,0(a6)
	sd	t0,8(a6)
	sd	t6,16(a6)
	sd	t5,24(a6)
	sd	t4,32(a6)
	sd	t3,40(a6)
	sd	t1,48(a6)
	sd	a1,56(a6)
And per my testing, unrolling more doesn't help performance, so
the "c" version only unrolls by using 8 GP regs rather than 16
ones as asm version.
5. Add proper __pi_memcpy and __pi___memcpy alias
6. more performance numbers.

Jisheng's commit msg:
Use the benchmark program from [1], I got below results on TH1520,
CV1800B and JH7110 platforms.

*TH1520 platform (I fixed cpu freq at 750MHZ):

Before the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.52 64K: 0.43 128K: 0.38 256K: 0.35 512K: 0.31 1024K: 0.22 avg 0.34
memcpy_call 32K: 0.41 64K: 0.35 128K: 0.33 256K: 0.31 512K: 0.28 1024K: 0.20 avg 0.30

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.46 16B: 0.61 32B: 0.84 64B: 0.89 128B: 3.31 256B: 3.44 512B: 3.51
memcpy_call 8B: 0.18 16B: 0.26 32B: 0.50 64B: 0.90 128B: 1.57 256B: 2.31 512B: 2.92

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.19 16B: 0.18 32B: 0.25 64B: 0.30 128B: 0.33 256B: 0.35 512B: 0.36
memcpy_call 8B: 0.16 16B: 0.22 32B: 0.39 64B: 0.70 128B: 1.11 256B: 1.46 512B: 1.81

Large memcpy (bytes/ns):
   __memcpy 1K: 3.57 2K: 3.85 4K: 3.75 8K: 3.98 16K: 4.03 32K: 4.06 64K: 4.40
memcpy_call 1K: 3.13 2K: 3.75 4K: 3.99 8K: 4.29 16K: 4.40 32K: 4.46 64K: 4.63

After the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.32 64K: 0.28 128K: 0.26 256K: 0.24 512K: 0.22 1024K: 0.17 avg 0.24
memcpy_call 32K: 0.39 64K: 0.34 128K: 0.32 256K: 0.30 512K: 0.27 1024K: 0.20 avg 0.29

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.20 16B: 0.22 32B: 0.25 64B: 2.43 128B: 3.19 256B: 3.36 512B: 3.55
memcpy_call 8B: 0.18 16B: 0.24 32B: 0.46 64B: 0.88 128B: 1.53 256B: 2.30 512B: 2.92

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.22 16B: 0.29 32B: 0.49 64B: 0.51 128B: 0.87 256B: 1.08 512B: 1.27
memcpy_call 8B: 0.12 16B: 0.21 32B: 0.40 64B: 0.70 128B: 1.10 256B: 1.46 512B: 1.80

Large memcpy (bytes/ns):
   __memcpy 1K: 3.63 2K: 3.66 4K: 3.78 8K: 3.87 16K: 3.96 32K: 4.11 64K: 4.40
memcpy_call 1K: 3.32 2K: 3.68 4K: 3.99 8K: 4.17 16K: 4.25 32K: 4.48 64K: 4.60

As can be seen, the unaligned medium memcpy performance is improved by
about 252%, I.E got 3.5x speed of original's. The performance of other
style mempcy is kept the same as original's.

And since the TH1520 supports HAVE_EFFICIENT_UNALIGNED_ACCESS, we can
optimize the memcpy futher without taking care of alignment at all.
Random memcpy (bytes/ns):
              __memcpy 32K: 0.35 64K: 0.31 128K: 0.28 256K: 0.25 512K: 0.23 1024K: 0.17 av
g 0.25
           memcpy_call 32K: 0.40 64K: 0.35 128K: 0.33 256K: 0.31 512K: 0.27 1024K: 0.20 av
g 0.30

Aligned medium memcpy (bytes/ns):
              __memcpy 8B: 0.21 16B: 0.23 32B: 0.27 64B: 3.34 128B: 3.42 256B: 3.50 512B:
3.58
           memcpy_call 8B: 0.18 16B: 0.24 32B: 0.46 64B: 0.88 128B: 1.53 256B: 2.31 512B:
2.92

Unaligned medium memcpy (bytes/ns):
              __memcpy 8B: 0.20 16B: 0.23 32B: 0.28 64B: 3.05 128B: 2.70 256B: 2.82 512B:
2.88
           memcpy_call 8B: 0.16 16B: 0.21 32B: 0.38 64B: 0.70 128B: 1.11 256B: 1.50 512B:
1.81

Large memcpy (bytes/ns):
              __memcpy 1K: 3.62 2K: 3.71 4K: 3.76 8K: 3.92 16K: 3.96 32K: 4.12 64K: 4.40
           memcpy_call 1K: 3.11 2K: 3.66 4K: 4.02 8K: 4.16 16K: 4.34 32K: 4.47 64K: 4.62

As can be seen, the unaligned medium memcpy is improved by 700%, I.E 8x
speed of original's.

*CV1800B platform:

Before the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.21 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08
memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.26 16B: 0.36 32B: 0.48 64B: 0.51 128B: 2.01 256B: 2.44 512B: 2.73
memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.11 16B: 0.12 32B: 0.15 64B: 0.16 128B: 0.16 256B: 0.17 512B: 0.17
memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77

Large memcpy (bytes/ns):
   __memcpy 1K: 2.90 2K: 2.91 4K: 3.00 8K: 3.04 16K: 3.03 32K: 2.89 64K: 2.52
memcpy_call 1K: 1.62 2K: 1.74 4K: 1.80 8K: 1.83 16K: 1.84 32K: 1.78 64K: 1.54

After the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.15 64K: 0.08 128K: 0.06 256K: 0.06 512K: 0.05 1024K: 0.05 avg 0.07
memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.11 16B: 0.11 32B: 0.14 64B: 1.15 128B: 1.62 256B: 2.06 512B: 2.40
memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.11 16B: 0.12 32B: 0.21 64B: 0.32 128B: 0.43 256B: 0.52 512B: 0.59
memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77

Large memcpy (bytes/ns):
   __memcpy 1K: 2.56 2K: 2.71 4K: 2.78 8K: 2.81 16K: 2.80 32K: 2.68 64K: 2.51
memcpy_call 1K: 1.62 2K: 1.74 4K: 1.80 8K: 1.83 16K: 1.84 32K: 1.78 64K: 1.54

We get similar performance improvement as TH1520. And since CV1800B also
supports HAVE_EFFICIENT_UNALIGNED_ACCESS, so the performance can be
improved futher:
Random memcpy (bytes/ns):
   __memcpy 32K: 0.15 64K: 0.08 128K: 0.07 256K: 0.06 512K: 0.05 1024K: 0.05 avg 0.07
memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.13 16B: 0.14 32B: 0.15 64B: 1.55 128B: 2.01 256B: 2.36 512B: 2.58
memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.13 16B: 0.14 32B: 0.15 64B: 1.06 128B: 1.26 256B: 1.39 512B: 1.46
memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77

Large memcpy (bytes/ns):
   __memcpy 1K: 2.65 2K: 2.76 4K: 2.80 8K: 2.82 16K: 2.81 32K: 2.68 64K: 2.51
memcpy_call 1K: 1.63 2K: 1.74 4K: 1.80 8K: 1.84 16K: 1.84 32K: 1.78 64K: 1.54

Now the unaligned medium memcpy is running at 8.6x speed of original's!

*JH7110 (I fixed cpufreq at 1.5GHZ)

Before the patch:
Random memcpy (bytes/ns):
   __memcpy 32K: 0.45 64K: 0.40 128K: 0.36 256K: 0.33 512K: 0.33 1024K: 0.31 avg 0.36
memcpy_call 32K: 0.43 64K: 0.38 128K: 0.34 256K: 0.31 512K: 0.31 1024K: 0.29 avg 0.34

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.42 16B: 0.55 32B: 0.65 64B: 0.72 128B: 2.91 256B: 3.36 512B: 3.65
memcpy_call 8B: 0.16 16B: 0.36 32B: 0.67 64B: 1.14 128B: 1.70 256B: 2.26 512B: 2.71

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.17 16B: 0.18 32B: 0.19 64B: 0.19 128B: 0.19 256B: 0.20 512B: 0.20
memcpy_call 8B: 0.16 16B: 0.20 32B: 0.36 64B: 0.62 128B: 0.94 256B: 1.28 512B: 1.52

Large memcpy (bytes/ns):
   __memcpy 1K: 3.62 2K: 3.82 4K: 3.90 8K: 3.95 16K: 3.97 32K: 1.33 64K: 1.33
memcpy_call 1K: 2.93 2K: 3.14 4K: 3.25 8K: 3.31 16K: 3.19 32K: 1.27 64K: 1.28

After the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.26 64K: 0.24 128K: 0.23 256K: 0.22 512K: 0.22 1024K: 0.21 avg 0.23
memcpy_call 32K: 0.42 64K: 0.38 128K: 0.34 256K: 0.31 512K: 0.31 1024K: 0.29 avg 0.34

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.17 16B: 0.17 32B: 0.18 64B: 1.94 128B: 2.56 256B: 3.04 512B: 3.36
memcpy_call 8B: 0.17 16B: 0.36 32B: 0.65 64B: 1.12 128B: 1.73 256B: 2.37 512B: 2.91

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.17 16B: 0.24 32B: 0.41 64B: 0.63 128B: 0.85 256B: 1.00 512B: 1.14
memcpy_call 8B: 0.16 16B: 0.22 32B: 0.38 64B: 0.65 128B: 0.99 256B: 1.35 512B: 1.61

Large memcpy (bytes/ns):
   __memcpy 1K: 3.43 2K: 3.59 4K: 3.67 8K: 3.72 16K: 3.73 32K: 1.28 64K: 1.28
memcpy_call 1K: 3.21 2K: 3.46 4K: 3.60 8K: 3.68 16K: 3.51 32K: 1.27 64K: 1.28

As can be seen, the unaligned medium memcpy performance is improved by
about 470%, I.E 5.7x speed of original's. The performance of other
style mempcy is kept the same as original's.

Link:https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memcpy.c [1]

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Co-developed-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/include/asm/string.h |   6 +-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   7 +-
 arch/riscv/lib/memcpy.S         | 110 -----------------------------
 arch/riscv/lib/string.c         | 121 ++++++++++++++++++++++++++++++++
 arch/riscv/purgatory/Makefile   |  10 +--
 6 files changed, 136 insertions(+), 120 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index a96b1fea24fe..edf1d56e4f13 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,11 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index a72879b4249a..c69dc74e0a27 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,11 +10,9 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strlen);
 EXPORT_SYMBOL(strncmp);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index bd6e6c1b0497..5f2f94f6db17 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,10 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-y			+= strcmp.o
 lib-y			+= strlen.o
+lib-y			+= string.o
 lib-y			+= strncmp.o
 lib-y			+= csum.o
 ifeq ($(CONFIG_MMU), y)
@@ -14,6 +14,11 @@ lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
 
+# string.o implements standard library functions like memset/memcpy etc.
+# Use -ffreestanding to ensure that the compiler does not try to "optimize"
+# them into calls to themselves.
+CFLAGS_string.o := -ffreestanding
+
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 lib-$(CONFIG_RISCV_ISA_V)	+= xor.o
 lib-$(CONFIG_RISCV_ISA_V)	+= riscv_v_helpers.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 44e009ec5fef..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,110 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-SYM_FUNC_START(__memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-SYM_FUNC_END(__memcpy)
-SYM_FUNC_ALIAS_WEAK(memcpy, __memcpy)
-SYM_FUNC_ALIAS(__pi_memcpy, __memcpy)
-SYM_FUNC_ALIAS(__pi___memcpy, __memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..5f9c83ec548d
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* Minimum size for a word copy to be convenient */
+#define BYTES_LONG	sizeof(long)
+#define WORD_MASK	(BYTES_LONG - 1)
+#define MIN_THRESHOLD	(BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	const unsigned long *as_ulong;
+	const uintptr_t as_uptr;
+};
+
+static void __memcpy_aligned(unsigned long *dest, const unsigned long *src, size_t count)
+{
+	for (; count > 0; count -= BYTES_LONG * 8) {
+		register unsigned long d0, d1, d2, d3, d4, d5, d6, d7;
+		d0 = src[0];
+		d1 = src[1];
+		d2 = src[2];
+		d3 = src[3];
+		d4 = src[4];
+		d5 = src[5];
+		d6 = src[6];
+		d7 = src[7];
+		dest[0] = d0;
+		dest[1] = d1;
+		dest[2] = d2;
+		dest[3] = d3;
+		dest[4] = d4;
+		dest[5] = d5;
+		dest[6] = d6;
+		dest[7] = d7;
+		dest += 8;
+		src += 8;
+	}
+}
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (count < MIN_THRESHOLD)
+		goto copy_remainder;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & WORD_MASK; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & WORD_MASK;
+	}
+
+	if (distance) {
+		unsigned long last, next;
+
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = last >> (distance * 8) |
+					next << ((BYTES_LONG - distance) * 8);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * aligned copy.
+		 */
+		size_t aligned_count = count & ~(BYTES_LONG * 8 - 1);
+
+		__memcpy_aligned(d.as_ulong, s.as_ulong, aligned_count);
+		d.as_u8 += aligned_count;
+		s.as_u8 += aligned_count;
+		count &= BYTES_LONG * 8 - 1;
+	}
+
+copy_remainder:
+	while (count--)
+		*d.as_u8++ = *s.as_u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(__memcpy);
+
+void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
+EXPORT_SYMBOL(memcpy);
+void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
+void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
diff --git a/arch/riscv/purgatory/Makefile b/arch/riscv/purgatory/Makefile
index 280b0eb352b8..8b940ff04895 100644
--- a/arch/riscv/purgatory/Makefile
+++ b/arch/riscv/purgatory/Makefile
@@ -1,21 +1,21 @@
 # SPDX-License-Identifier: GPL-2.0
 OBJECT_FILES_NON_STANDARD := y
 
-purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memcpy.o memset.o
-purgatory-y += strcmp.o strlen.o strncmp.o
+purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memset.o
+purgatory-y += strcmp.o strlen.o strncmp.o riscv_string.o
 
 targets += $(purgatory-y)
 PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
 
+$(obj)/riscv_string.o: $(srctree)/arch/riscv/lib/string.c FORCE
+	$(call if_changed_rule,cc_o_c)
+
 $(obj)/string.o: $(srctree)/lib/string.c FORCE
 	$(call if_changed_rule,cc_o_c)
 
 $(obj)/ctype.o: $(srctree)/lib/ctype.c FORCE
 	$(call if_changed_rule,cc_o_c)
 
-$(obj)/memcpy.o: $(srctree)/arch/riscv/lib/memcpy.S FORCE
-	$(call if_changed_rule,as_o_S)
-
 $(obj)/memset.o: $(srctree)/arch/riscv/lib/memset.S FORCE
 	$(call if_changed_rule,as_o_S)
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-01-30 12:11 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15  2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce
2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
2021-06-15  8:57   ` David Laight
2021-06-15 13:08     ` Bin Meng
2021-06-15 13:18       ` David Laight
2021-06-15 13:28         ` Bin Meng
2021-06-15 16:12           ` Emil Renner Berthing
2021-06-16  0:33             ` Bin Meng
2021-06-16  2:01               ` Matteo Croce
2021-06-16  8:24                 ` David Laight
2021-06-16 10:48                   ` Akira Tsukamoto
2021-06-16 19:06                   ` Matteo Croce
2021-06-15 13:44         ` Matteo Croce
2021-06-16 11:46   ` Guo Ren
2021-06-16 18:52     ` Matteo Croce
2021-06-17 21:30       ` David Laight
2021-06-17 21:48         ` Matteo Croce
2021-06-18  0:32           ` Matteo Croce
2021-06-18  1:05             ` Matteo Croce
2021-06-18  8:32               ` David Laight
2021-06-15  2:38 ` [PATCH 2/3] riscv: optimized memmove Matteo Croce
2021-06-15  2:38 ` [PATCH 3/3] riscv: optimized memset Matteo Croce
2021-06-15  2:43 ` [PATCH 0/3] riscv: optimized mem* functions Bin Meng
2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
2024-01-28 12:35   ` David Laight
2024-01-30 12:11   ` Nick Kossifidis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).