All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] riscv: optimized mem* functions
@ 2021-06-15  2:38 ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

Replace the assembly mem{cpy,move,set} with C equivalent.

Try to access RAM with the largest bit width possible, but without
doing unaligned accesses.

Tested on a BeagleV Starlight with a SiFive U74 core, where the
improvement is noticeable.

Matteo Croce (3):
  riscv: optimized memcpy
  riscv: optimized memmove
  riscv: optimized memset

 arch/riscv/include/asm/string.h |  18 ++--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  17 ----
 arch/riscv/lib/Makefile         |   4 +-
 arch/riscv/lib/memcpy.S         | 108 ---------------------
 arch/riscv/lib/memmove.S        |  64 -------------
 arch/riscv/lib/memset.S         | 113 ----------------------
 arch/riscv/lib/string.c         | 162 ++++++++++++++++++++++++++++++++
 8 files changed, 172 insertions(+), 315 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memcpy.S
 delete mode 100644 arch/riscv/lib/memmove.S
 delete mode 100644 arch/riscv/lib/memset.S
 create mode 100644 arch/riscv/lib/string.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] riscv: optimized mem* functions
@ 2021-06-15  2:38 ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

Replace the assembly mem{cpy,move,set} with C equivalent.

Try to access RAM with the largest bit width possible, but without
doing unaligned accesses.

Tested on a BeagleV Starlight with a SiFive U74 core, where the
improvement is noticeable.

Matteo Croce (3):
  riscv: optimized memcpy
  riscv: optimized memmove
  riscv: optimized memset

 arch/riscv/include/asm/string.h |  18 ++--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  17 ----
 arch/riscv/lib/Makefile         |   4 +-
 arch/riscv/lib/memcpy.S         | 108 ---------------------
 arch/riscv/lib/memmove.S        |  64 -------------
 arch/riscv/lib/memset.S         | 113 ----------------------
 arch/riscv/lib/string.c         | 162 ++++++++++++++++++++++++++++++++
 8 files changed, 172 insertions(+), 315 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memcpy.S
 delete mode 100644 arch/riscv/lib/memmove.S
 delete mode 100644 arch/riscv/lib/memset.S
 create mode 100644 arch/riscv/lib/string.c

-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  2:38 ` Matteo Croce
@ 2021-06-15  2:38   ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |   8 ++-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   2 +-
 arch/riscv/lib/memcpy.S         | 108 --------------------------------
 arch/riscv/lib/string.c         |  94 +++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 113 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 909049366555..6b5d6fc3eab4 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,13 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+#endif
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 5ab1c7e1a6ed..3f6d512a5b97 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,8 +10,6 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 25d5c9664e57..2ffe85d4baee 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
+lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
 
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 51ab716253fa..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-ENTRY(__memcpy)
-WEAK(memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-END(__memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..525f9ee25a74
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* size below a classic byte at time copy is done */
+#define MIN_THRESHOLD 64
+
+/* convenience types to avoid cast between different pointer types */
+union types {
+	u8 *u8;
+	unsigned long *ulong;
+	uintptr_t uptr;
+};
+
+union const_types {
+	const u8 *u8;
+	unsigned long *ulong;
+};
+
+void *memcpy(void *dest, const void *src, size_t count)
+{
+	const int bytes_long = BITS_PER_LONG / 8;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	const int mask = bytes_long - 1;
+	const int distance = (src - dest) & mask;
+#endif
+	union const_types s = { .u8 = src };
+	union types d = { .u8 = dest };
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	if (count <= MIN_THRESHOLD)
+		goto copy_remainder;
+
+	/* copy a byte at time until destination is aligned */
+	for (; count && d.uptr & mask; count--)
+		*d.u8++ = *s.u8++;
+
+	if (distance) {
+		unsigned long last, next;
+
+		/* move s backward to the previous alignment boundary */
+		s.u8 -= distance;
+
+		/* 32/64 bit wide copy from s to d.
+		 * d is aligned now but s is not, so read s alignment wise,
+		 * and do proper shift to get the right value.
+		 * Works only on Little Endian machines.
+		 */
+		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
+			last = next;
+			next = s.ulong[1];
+
+			d.ulong[0] = last >> (distance * 8) |
+				     next << ((bytes_long - distance) * 8);
+
+			d.ulong++;
+			s.ulong++;
+		}
+
+		/* restore s with the original offset */
+		s.u8 += distance;
+	} else
+#endif
+	{
+		/* if the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= bytes_long; count -= bytes_long)
+			*d.ulong++ = *s.ulong++;
+	}
+
+	/* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
+	goto copy_remainder;
+
+copy_remainder:
+	while (count--)
+		*d.u8++ = *s.u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(memcpy);
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	return memcpy(dest, src, count);
+}
+EXPORT_SYMBOL(__memcpy);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-15  2:38   ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |   8 ++-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   2 +-
 arch/riscv/lib/memcpy.S         | 108 --------------------------------
 arch/riscv/lib/string.c         |  94 +++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 113 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 909049366555..6b5d6fc3eab4 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,13 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+#endif
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 5ab1c7e1a6ed..3f6d512a5b97 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,8 +10,6 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 25d5c9664e57..2ffe85d4baee 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
+lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
 
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 51ab716253fa..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-ENTRY(__memcpy)
-WEAK(memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-END(__memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..525f9ee25a74
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* size below a classic byte at time copy is done */
+#define MIN_THRESHOLD 64
+
+/* convenience types to avoid cast between different pointer types */
+union types {
+	u8 *u8;
+	unsigned long *ulong;
+	uintptr_t uptr;
+};
+
+union const_types {
+	const u8 *u8;
+	unsigned long *ulong;
+};
+
+void *memcpy(void *dest, const void *src, size_t count)
+{
+	const int bytes_long = BITS_PER_LONG / 8;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	const int mask = bytes_long - 1;
+	const int distance = (src - dest) & mask;
+#endif
+	union const_types s = { .u8 = src };
+	union types d = { .u8 = dest };
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	if (count <= MIN_THRESHOLD)
+		goto copy_remainder;
+
+	/* copy a byte at time until destination is aligned */
+	for (; count && d.uptr & mask; count--)
+		*d.u8++ = *s.u8++;
+
+	if (distance) {
+		unsigned long last, next;
+
+		/* move s backward to the previous alignment boundary */
+		s.u8 -= distance;
+
+		/* 32/64 bit wide copy from s to d.
+		 * d is aligned now but s is not, so read s alignment wise,
+		 * and do proper shift to get the right value.
+		 * Works only on Little Endian machines.
+		 */
+		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
+			last = next;
+			next = s.ulong[1];
+
+			d.ulong[0] = last >> (distance * 8) |
+				     next << ((bytes_long - distance) * 8);
+
+			d.ulong++;
+			s.ulong++;
+		}
+
+		/* restore s with the original offset */
+		s.u8 += distance;
+	} else
+#endif
+	{
+		/* if the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= bytes_long; count -= bytes_long)
+			*d.ulong++ = *s.ulong++;
+	}
+
+	/* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
+	goto copy_remainder;
+
+copy_remainder:
+	while (count--)
+		*d.u8++ = *s.u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(memcpy);
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	return memcpy(dest, src, count);
+}
+EXPORT_SYMBOL(__memcpy);
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/3] riscv: optimized memmove
  2021-06-15  2:38 ` Matteo Croce
@ 2021-06-15  2:38   ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  6 ++--
 arch/riscv/kernel/riscv_ksyms.c |  2 --
 arch/riscv/lib/Makefile         |  1 -
 arch/riscv/lib/memmove.S        | 64 ---------------------------------
 arch/riscv/lib/string.c         | 26 ++++++++++++++
 5 files changed, 29 insertions(+), 70 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 6b5d6fc3eab4..25d9b9078569 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
+#define __HAVE_ARCH_MEMMOVE
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 #endif
 
-#define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
 /* For those files which don't want to check by kasan. */
 #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 3f6d512a5b97..361565c4db7e 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,6 +10,4 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 2ffe85d4baee..484f5ff7b508 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index 07d1d2152ba5..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 525f9ee25a74..bc006708f075 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -92,3 +92,29 @@ void *__memcpy(void *dest, const void *src, size_t count)
 	return memcpy(dest, src, count);
 }
 EXPORT_SYMBOL(__memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(memmove);
+
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	return memmove(dest, src, count);
+}
+EXPORT_SYMBOL(__memmove);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/3] riscv: optimized memmove
@ 2021-06-15  2:38   ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  6 ++--
 arch/riscv/kernel/riscv_ksyms.c |  2 --
 arch/riscv/lib/Makefile         |  1 -
 arch/riscv/lib/memmove.S        | 64 ---------------------------------
 arch/riscv/lib/string.c         | 26 ++++++++++++++
 5 files changed, 29 insertions(+), 70 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 6b5d6fc3eab4..25d9b9078569 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
+#define __HAVE_ARCH_MEMMOVE
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 #endif
 
-#define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
 /* For those files which don't want to check by kasan. */
 #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 3f6d512a5b97..361565c4db7e 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,6 +10,4 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 2ffe85d4baee..484f5ff7b508 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index 07d1d2152ba5..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 525f9ee25a74..bc006708f075 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -92,3 +92,29 @@ void *__memcpy(void *dest, const void *src, size_t count)
 	return memcpy(dest, src, count);
 }
 EXPORT_SYMBOL(__memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(memmove);
+
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	return memmove(dest, src, count);
+}
+EXPORT_SYMBOL(__memmove);
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/3] riscv: optimized memset
  2021-06-15  2:38 ` Matteo Croce
@ 2021-06-15  2:38   ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  10 +--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  13 ----
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memset.S         | 113 --------------------------------
 arch/riscv/lib/string.c         |  42 ++++++++++++
 6 files changed, 45 insertions(+), 135 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memset.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 25d9b9078569..90500635035a 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -6,14 +6,10 @@
 #ifndef _ASM_RISCV_STRING_H
 #define _ASM_RISCV_STRING_H
 
-#include <linux/types.h>
-#include <linux/linkage.h>
-
-#define __HAVE_ARCH_MEMSET
-extern asmlinkage void *memset(void *, int, size_t);
-extern asmlinkage void *__memset(void *, int, size_t);
-
 #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t count);
+extern void *__memset(void *s, int c, size_t count);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index d3081e4d9600..e635ce1e5645 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -31,7 +31,6 @@ obj-y	+= syscall_table.o
 obj-y	+= sys_riscv.o
 obj-y	+= time.o
 obj-y	+= traps.o
-obj-y	+= riscv_ksyms.o
 obj-y	+= stacktrace.o
 obj-y	+= cacheinfo.o
 obj-y	+= patch.o
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
deleted file mode 100644
index 361565c4db7e..000000000000
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ /dev/null
@@ -1,13 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2017 Zihao Yu
- */
-
-#include <linux/export.h>
-#include <linux/uaccess.h>
-
-/*
- * Assembly functions that may be used (directly or indirectly) by modules
- */
-EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(__memset);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 484f5ff7b508..e33263cc622a 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memset.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
deleted file mode 100644
index 34c5360c6705..000000000000
--- a/arch/riscv/lib/memset.S
+++ /dev/null
@@ -1,113 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memset(void *, int, size_t) */
-ENTRY(__memset)
-WEAK(memset)
-	move t0, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented fill for small sizes */
-	sltiu a3, a2, 16
-	bnez a3, 4f
-
-	/*
-	 * Round to nearest XLEN-aligned address
-	 * greater than or equal to start address
-	 */
-	addi a3, t0, SZREG-1
-	andi a3, a3, ~(SZREG-1)
-	beq a3, t0, 2f  /* Skip if already aligned */
-	/* Handle initial misalignment */
-	sub a4, a3, t0
-1:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2: /* Duff's device with 32 XLEN stores per iteration */
-	/* Broadcast value into all bytes */
-	andi a1, a1, 0xff
-	slli a3, a1, 8
-	or a1, a3, a1
-	slli a3, a1, 16
-	or a1, a3, a1
-#ifdef CONFIG_64BIT
-	slli a3, a1, 32
-	or a1, a3, a1
-#endif
-
-	/* Calculate end address */
-	andi a4, a2, ~(SZREG-1)
-	add a3, t0, a4
-
-	andi a4, a4, 31*SZREG  /* Calculate remainder */
-	beqz a4, 3f            /* Shortcut if no remainder */
-	neg a4, a4
-	addi a4, a4, 32*SZREG  /* Calculate initial offset */
-
-	/* Adjust start address with offset */
-	sub t0, t0, a4
-
-	/* Jump into loop body */
-	/* Assumes 32-bit instruction lengths */
-	la a5, 3f
-#ifdef CONFIG_64BIT
-	srli a4, a4, 1
-#endif
-	add a5, a5, a4
-	jr a5
-3:
-	REG_S a1,        0(t0)
-	REG_S a1,    SZREG(t0)
-	REG_S a1,  2*SZREG(t0)
-	REG_S a1,  3*SZREG(t0)
-	REG_S a1,  4*SZREG(t0)
-	REG_S a1,  5*SZREG(t0)
-	REG_S a1,  6*SZREG(t0)
-	REG_S a1,  7*SZREG(t0)
-	REG_S a1,  8*SZREG(t0)
-	REG_S a1,  9*SZREG(t0)
-	REG_S a1, 10*SZREG(t0)
-	REG_S a1, 11*SZREG(t0)
-	REG_S a1, 12*SZREG(t0)
-	REG_S a1, 13*SZREG(t0)
-	REG_S a1, 14*SZREG(t0)
-	REG_S a1, 15*SZREG(t0)
-	REG_S a1, 16*SZREG(t0)
-	REG_S a1, 17*SZREG(t0)
-	REG_S a1, 18*SZREG(t0)
-	REG_S a1, 19*SZREG(t0)
-	REG_S a1, 20*SZREG(t0)
-	REG_S a1, 21*SZREG(t0)
-	REG_S a1, 22*SZREG(t0)
-	REG_S a1, 23*SZREG(t0)
-	REG_S a1, 24*SZREG(t0)
-	REG_S a1, 25*SZREG(t0)
-	REG_S a1, 26*SZREG(t0)
-	REG_S a1, 27*SZREG(t0)
-	REG_S a1, 28*SZREG(t0)
-	REG_S a1, 29*SZREG(t0)
-	REG_S a1, 30*SZREG(t0)
-	REG_S a1, 31*SZREG(t0)
-	addi t0, t0, 32*SZREG
-	bltu t0, a3, 3b
-	andi a2, a2, SZREG-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, t0, a2
-5:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 5b
-6:
-	ret
-END(__memset)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index bc006708f075..62869627e139 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -118,3 +118,45 @@ void *__memmove(void *dest, const void *src, size_t count)
 	return memmove(dest, src, count);
 }
 EXPORT_SYMBOL(__memmove);
+
+void *memset(void *s, int c, size_t count)
+{
+	union types dest = { .u8 = s };
+
+	if (count > MIN_THRESHOLD) {
+		const int bytes_long = BITS_PER_LONG / 8;
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+		cu =
+#if BITS_PER_LONG == 64
+			cu << 56 | cu << 48 | cu << 40 | cu << 32 |
+#endif
+			cu << 24 | cu << 16 | cu << 8 | cu;
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+		/* Fill the buffer one byte at time until the destination
+		 * is aligned on a 32/64 bit boundary.
+		 */
+		for (; count && dest.uptr % bytes_long; count--)
+			*dest.u8++ = c;
+#endif
+
+		/* Copy using the largest size allowed */
+		for (; count >= bytes_long; count -= bytes_long)
+			*dest.ulong++ = cu;
+	}
+
+	/* copy the remainder */
+	while (count--)
+		*dest.u8++ = c;
+
+	return s;
+}
+EXPORT_SYMBOL(memset);
+
+void *__memset(void *s, int c, size_t count)
+{
+	return memset(s, c, count);
+}
+EXPORT_SYMBOL(__memset);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/3] riscv: optimized memset
@ 2021-06-15  2:38   ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15  2:38 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  10 +--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  13 ----
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memset.S         | 113 --------------------------------
 arch/riscv/lib/string.c         |  42 ++++++++++++
 6 files changed, 45 insertions(+), 135 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memset.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 25d9b9078569..90500635035a 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -6,14 +6,10 @@
 #ifndef _ASM_RISCV_STRING_H
 #define _ASM_RISCV_STRING_H
 
-#include <linux/types.h>
-#include <linux/linkage.h>
-
-#define __HAVE_ARCH_MEMSET
-extern asmlinkage void *memset(void *, int, size_t);
-extern asmlinkage void *__memset(void *, int, size_t);
-
 #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t count);
+extern void *__memset(void *s, int c, size_t count);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index d3081e4d9600..e635ce1e5645 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -31,7 +31,6 @@ obj-y	+= syscall_table.o
 obj-y	+= sys_riscv.o
 obj-y	+= time.o
 obj-y	+= traps.o
-obj-y	+= riscv_ksyms.o
 obj-y	+= stacktrace.o
 obj-y	+= cacheinfo.o
 obj-y	+= patch.o
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
deleted file mode 100644
index 361565c4db7e..000000000000
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ /dev/null
@@ -1,13 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2017 Zihao Yu
- */
-
-#include <linux/export.h>
-#include <linux/uaccess.h>
-
-/*
- * Assembly functions that may be used (directly or indirectly) by modules
- */
-EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(__memset);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 484f5ff7b508..e33263cc622a 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memset.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
deleted file mode 100644
index 34c5360c6705..000000000000
--- a/arch/riscv/lib/memset.S
+++ /dev/null
@@ -1,113 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memset(void *, int, size_t) */
-ENTRY(__memset)
-WEAK(memset)
-	move t0, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented fill for small sizes */
-	sltiu a3, a2, 16
-	bnez a3, 4f
-
-	/*
-	 * Round to nearest XLEN-aligned address
-	 * greater than or equal to start address
-	 */
-	addi a3, t0, SZREG-1
-	andi a3, a3, ~(SZREG-1)
-	beq a3, t0, 2f  /* Skip if already aligned */
-	/* Handle initial misalignment */
-	sub a4, a3, t0
-1:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2: /* Duff's device with 32 XLEN stores per iteration */
-	/* Broadcast value into all bytes */
-	andi a1, a1, 0xff
-	slli a3, a1, 8
-	or a1, a3, a1
-	slli a3, a1, 16
-	or a1, a3, a1
-#ifdef CONFIG_64BIT
-	slli a3, a1, 32
-	or a1, a3, a1
-#endif
-
-	/* Calculate end address */
-	andi a4, a2, ~(SZREG-1)
-	add a3, t0, a4
-
-	andi a4, a4, 31*SZREG  /* Calculate remainder */
-	beqz a4, 3f            /* Shortcut if no remainder */
-	neg a4, a4
-	addi a4, a4, 32*SZREG  /* Calculate initial offset */
-
-	/* Adjust start address with offset */
-	sub t0, t0, a4
-
-	/* Jump into loop body */
-	/* Assumes 32-bit instruction lengths */
-	la a5, 3f
-#ifdef CONFIG_64BIT
-	srli a4, a4, 1
-#endif
-	add a5, a5, a4
-	jr a5
-3:
-	REG_S a1,        0(t0)
-	REG_S a1,    SZREG(t0)
-	REG_S a1,  2*SZREG(t0)
-	REG_S a1,  3*SZREG(t0)
-	REG_S a1,  4*SZREG(t0)
-	REG_S a1,  5*SZREG(t0)
-	REG_S a1,  6*SZREG(t0)
-	REG_S a1,  7*SZREG(t0)
-	REG_S a1,  8*SZREG(t0)
-	REG_S a1,  9*SZREG(t0)
-	REG_S a1, 10*SZREG(t0)
-	REG_S a1, 11*SZREG(t0)
-	REG_S a1, 12*SZREG(t0)
-	REG_S a1, 13*SZREG(t0)
-	REG_S a1, 14*SZREG(t0)
-	REG_S a1, 15*SZREG(t0)
-	REG_S a1, 16*SZREG(t0)
-	REG_S a1, 17*SZREG(t0)
-	REG_S a1, 18*SZREG(t0)
-	REG_S a1, 19*SZREG(t0)
-	REG_S a1, 20*SZREG(t0)
-	REG_S a1, 21*SZREG(t0)
-	REG_S a1, 22*SZREG(t0)
-	REG_S a1, 23*SZREG(t0)
-	REG_S a1, 24*SZREG(t0)
-	REG_S a1, 25*SZREG(t0)
-	REG_S a1, 26*SZREG(t0)
-	REG_S a1, 27*SZREG(t0)
-	REG_S a1, 28*SZREG(t0)
-	REG_S a1, 29*SZREG(t0)
-	REG_S a1, 30*SZREG(t0)
-	REG_S a1, 31*SZREG(t0)
-	addi t0, t0, 32*SZREG
-	bltu t0, a3, 3b
-	andi a2, a2, SZREG-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, t0, a2
-5:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 5b
-6:
-	ret
-END(__memset)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index bc006708f075..62869627e139 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -118,3 +118,45 @@ void *__memmove(void *dest, const void *src, size_t count)
 	return memmove(dest, src, count);
 }
 EXPORT_SYMBOL(__memmove);
+
+void *memset(void *s, int c, size_t count)
+{
+	union types dest = { .u8 = s };
+
+	if (count > MIN_THRESHOLD) {
+		const int bytes_long = BITS_PER_LONG / 8;
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+		cu =
+#if BITS_PER_LONG == 64
+			cu << 56 | cu << 48 | cu << 40 | cu << 32 |
+#endif
+			cu << 24 | cu << 16 | cu << 8 | cu;
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+		/* Fill the buffer one byte at time until the destination
+		 * is aligned on a 32/64 bit boundary.
+		 */
+		for (; count && dest.uptr % bytes_long; count--)
+			*dest.u8++ = c;
+#endif
+
+		/* Copy using the largest size allowed */
+		for (; count >= bytes_long; count -= bytes_long)
+			*dest.ulong++ = cu;
+	}
+
+	/* copy the remainder */
+	while (count--)
+		*dest.u8++ = c;
+
+	return s;
+}
+EXPORT_SYMBOL(memset);
+
+void *__memset(void *s, int c, size_t count)
+{
+	return memset(s, c, count);
+}
+EXPORT_SYMBOL(__memset);
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] riscv: optimized mem* functions
  2021-06-15  2:38 ` Matteo Croce
@ 2021-06-15  2:43   ` Bin Meng
  -1 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-15  2:43 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

Hi Matteo,

On Tue, Jun 15, 2021 at 10:39 AM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Replace the assembly mem{cpy,move,set} with C equivalent.
>
> Try to access RAM with the largest bit width possible, but without
> doing unaligned accesses.
>
> Tested on a BeagleV Starlight with a SiFive U74 core, where the
> improvement is noticeable.
>

There is already a patch on the ML for optimizing the assembly version.
https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/

Would you please try that and compare the results?

Regards,
Bin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] riscv: optimized mem* functions
@ 2021-06-15  2:43   ` Bin Meng
  0 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-15  2:43 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

Hi Matteo,

On Tue, Jun 15, 2021 at 10:39 AM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Replace the assembly mem{cpy,move,set} with C equivalent.
>
> Try to access RAM with the largest bit width possible, but without
> doing unaligned accesses.
>
> Tested on a BeagleV Starlight with a SiFive U74 core, where the
> improvement is noticeable.
>

There is already a patch on the ML for optimizing the assembly version.
https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/

Would you please try that and compare the results?

Regards,
Bin

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  2:38   ` Matteo Croce
@ 2021-06-15  8:57     ` David Laight
  -1 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-15  8:57 UTC (permalink / raw)
  To: 'Matteo Croce', linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 15 June 2021 03:38
> 
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.

I'm surprised that the C loop:

> +		for (; count >= bytes_long; count -= bytes_long)
> +			*d.ulong++ = *s.ulong++;

ends up being faster than the ASM 'read lots' - 'write lots' loop.

Especially since there was an earlier patch to convert
copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
instead of a tight single register copy loop.

I'd also guess that the performance needs to be measured on
different classes of riscv cpu.

A simple cpu will behave differently to one that can execute
multiple instructions per clock.
Any form of 'out of order' execution also changes things.
The other big change is whether the cpu can to a memory
read and write in the same clock.

I'd guess that riscv exist with some/all of those features.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-15  8:57     ` David Laight
  0 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-15  8:57 UTC (permalink / raw)
  To: 'Matteo Croce', linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 15 June 2021 03:38
> 
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.

I'm surprised that the C loop:

> +		for (; count >= bytes_long; count -= bytes_long)
> +			*d.ulong++ = *s.ulong++;

ends up being faster than the ASM 'read lots' - 'write lots' loop.

Especially since there was an earlier patch to convert
copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
instead of a tight single register copy loop.

I'd also guess that the performance needs to be measured on
different classes of riscv cpu.

A simple cpu will behave differently to one that can execute
multiple instructions per clock.
Any form of 'out of order' execution also changes things.
The other big change is whether the cpu can to a memory
read and write in the same clock.

I'd guess that riscv exist with some/all of those features.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  8:57     ` David Laight
@ 2021-06-15 13:08       ` Bin Meng
  -1 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-15 13:08 UTC (permalink / raw)
  To: David Laight
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 15 June 2021 03:38
> >
> > Write a C version of memcpy() which uses the biggest data size allowed,
> > without generating unaligned accesses.
>
> I'm surprised that the C loop:
>
> > +             for (; count >= bytes_long; count -= bytes_long)
> > +                     *d.ulong++ = *s.ulong++;
>
> ends up being faster than the ASM 'read lots' - 'write lots' loop.

I believe that's because the assembly version has some unaligned
access cases, which end up being trap-n-emulated in the OpenSBI
firmware, and that is a big overhead.

>
> Especially since there was an earlier patch to convert
> copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
> instead of a tight single register copy loop.
>
> I'd also guess that the performance needs to be measured on
> different classes of riscv cpu.
>
> A simple cpu will behave differently to one that can execute
> multiple instructions per clock.
> Any form of 'out of order' execution also changes things.
> The other big change is whether the cpu can to a memory
> read and write in the same clock.
>
> I'd guess that riscv exist with some/all of those features.

Regards,
Bin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-15 13:08       ` Bin Meng
  0 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-15 13:08 UTC (permalink / raw)
  To: David Laight
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 15 June 2021 03:38
> >
> > Write a C version of memcpy() which uses the biggest data size allowed,
> > without generating unaligned accesses.
>
> I'm surprised that the C loop:
>
> > +             for (; count >= bytes_long; count -= bytes_long)
> > +                     *d.ulong++ = *s.ulong++;
>
> ends up being faster than the ASM 'read lots' - 'write lots' loop.

I believe that's because the assembly version has some unaligned
access cases, which end up being trap-n-emulated in the OpenSBI
firmware, and that is a big overhead.

>
> Especially since there was an earlier patch to convert
> copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
> instead of a tight single register copy loop.
>
> I'd also guess that the performance needs to be measured on
> different classes of riscv cpu.
>
> A simple cpu will behave differently to one that can execute
> multiple instructions per clock.
> Any form of 'out of order' execution also changes things.
> The other big change is whether the cpu can to a memory
> read and write in the same clock.
>
> I'd guess that riscv exist with some/all of those features.

Regards,
Bin

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:08       ` Bin Meng
@ 2021-06-15 13:18         ` David Laight
  -1 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-15 13:18 UTC (permalink / raw)
  To: 'Bin Meng'
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

From: Bin Meng
> Sent: 15 June 2021 14:09
> 
> On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> >
...
> > I'm surprised that the C loop:
> >
> > > +             for (; count >= bytes_long; count -= bytes_long)
> > > +                     *d.ulong++ = *s.ulong++;
> >
> > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> 
> I believe that's because the assembly version has some unaligned
> access cases, which end up being trap-n-emulated in the OpenSBI
> firmware, and that is a big overhead.

Ah, that would make sense since the asm user copy code
was broken for misaligned copies.
I suspect memcpy() was broken the same way.

I'm surprised IP_NET_ALIGN isn't set to 2 to try to
avoid all these misaligned copies in the network stack.
Although avoiding 8n+4 aligned data is rather harder.

Misaligned copies are just best avoided - really even on x86.
The 'real fun' is when the access crosses TLB boundaries.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-15 13:18         ` David Laight
  0 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-15 13:18 UTC (permalink / raw)
  To: 'Bin Meng'
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

From: Bin Meng
> Sent: 15 June 2021 14:09
> 
> On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> >
...
> > I'm surprised that the C loop:
> >
> > > +             for (; count >= bytes_long; count -= bytes_long)
> > > +                     *d.ulong++ = *s.ulong++;
> >
> > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> 
> I believe that's because the assembly version has some unaligned
> access cases, which end up being trap-n-emulated in the OpenSBI
> firmware, and that is a big overhead.

Ah, that would make sense since the asm user copy code
was broken for misaligned copies.
I suspect memcpy() was broken the same way.

I'm surprised IP_NET_ALIGN isn't set to 2 to try to
avoid all these misaligned copies in the network stack.
Although avoiding 8n+4 aligned data is rather harder.

Misaligned copies are just best avoided - really even on x86.
The 'real fun' is when the access crosses TLB boundaries.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:18         ` David Laight
@ 2021-06-15 13:28           ` Bin Meng
  -1 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-15 13:28 UTC (permalink / raw)
  To: David Laight, Gary Guo
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 9:18 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > +             for (; count >= bytes_long; count -= bytes_long)
> > > > +                     *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>

Yes, Gary Guo sent one patch long time ago against the broken assembly
version, but that patch was still not applied as of today.
https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/

I suggest Matteo re-test using Gary's version.

> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>
> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.

Regards,
Bin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-15 13:28           ` Bin Meng
  0 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-15 13:28 UTC (permalink / raw)
  To: David Laight, Gary Guo
  Cc: Matteo Croce, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 9:18 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > +             for (; count >= bytes_long; count -= bytes_long)
> > > > +                     *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>

Yes, Gary Guo sent one patch long time ago against the broken assembly
version, but that patch was still not applied as of today.
https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/

I suggest Matteo re-test using Gary's version.

> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>
> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.

Regards,
Bin

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:18         ` David Laight
@ 2021-06-15 13:44           ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15 13:44 UTC (permalink / raw)
  To: David Laight
  Cc: Bin Meng, linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 3:18 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > +             for (; count >= bytes_long; count -= bytes_long)
> > > > +                     *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>
> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>

That's up to the network driver, indeed I have a patch already for the
BeagleV one:

https://lore.kernel.org/netdev/20210615012107.577ead86@linux.microsoft.com/T/

> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.
>

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-15 13:44           ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-15 13:44 UTC (permalink / raw)
  To: David Laight
  Cc: Bin Meng, linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

On Tue, Jun 15, 2021 at 3:18 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > +             for (; count >= bytes_long; count -= bytes_long)
> > > > +                     *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>
> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>

That's up to the network driver, indeed I have a patch already for the
BeagleV one:

https://lore.kernel.org/netdev/20210615012107.577ead86@linux.microsoft.com/T/

> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.
>

-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 13:28           ` Bin Meng
@ 2021-06-15 16:12             ` Emil Renner Berthing
  -1 siblings, 0 replies; 64+ messages in thread
From: Emil Renner Berthing @ 2021-06-15 16:12 UTC (permalink / raw)
  To: Bin Meng
  Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> ...
> Yes, Gary Guo sent one patch long time ago against the broken assembly
> version, but that patch was still not applied as of today.
> https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
>
> I suggest Matteo re-test using Gary's version.

That's a good idea, but if you read the replies to Gary's original patch
https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
.. both Gary, Palmer and David would rather like a C-based version.
This is one attempt at providing that.

> > I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> > avoid all these misaligned copies in the network stack.
> > Although avoiding 8n+4 aligned data is rather harder.
> >
> > Misaligned copies are just best avoided - really even on x86.
> > The 'real fun' is when the access crosses TLB boundaries.
>
> Regards,
> Bin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-15 16:12             ` Emil Renner Berthing
  0 siblings, 0 replies; 64+ messages in thread
From: Emil Renner Berthing @ 2021-06-15 16:12 UTC (permalink / raw)
  To: Bin Meng
  Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> ...
> Yes, Gary Guo sent one patch long time ago against the broken assembly
> version, but that patch was still not applied as of today.
> https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
>
> I suggest Matteo re-test using Gary's version.

That's a good idea, but if you read the replies to Gary's original patch
https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
.. both Gary, Palmer and David would rather like a C-based version.
This is one attempt at providing that.

> > I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> > avoid all these misaligned copies in the network stack.
> > Although avoiding 8n+4 aligned data is rather harder.
> >
> > Misaligned copies are just best avoided - really even on x86.
> > The 'real fun' is when the access crosses TLB boundaries.
>
> Regards,
> Bin

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15 16:12             ` Emil Renner Berthing
@ 2021-06-16  0:33               ` Bin Meng
  -1 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-16  0:33 UTC (permalink / raw)
  To: Emil Renner Berthing
  Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing <kernel@esmil.dk> wrote:
>
> On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> > ...
> > Yes, Gary Guo sent one patch long time ago against the broken assembly
> > version, but that patch was still not applied as of today.
> > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
> >
> > I suggest Matteo re-test using Gary's version.
>
> That's a good idea, but if you read the replies to Gary's original patch
> https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> .. both Gary, Palmer and David would rather like a C-based version.
> This is one attempt at providing that.

Yep, I prefer C as well :)

But if you check commit 04091d6, the assembly version was introduced
for KASAN. So if we are to change it back to C, please make sure KASAN
is not broken.

Regards,
Bin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-16  0:33               ` Bin Meng
  0 siblings, 0 replies; 64+ messages in thread
From: Bin Meng @ 2021-06-16  0:33 UTC (permalink / raw)
  To: Emil Renner Berthing
  Cc: David Laight, Gary Guo, Matteo Croce, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing <kernel@esmil.dk> wrote:
>
> On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> > ...
> > Yes, Gary Guo sent one patch long time ago against the broken assembly
> > version, but that patch was still not applied as of today.
> > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
> >
> > I suggest Matteo re-test using Gary's version.
>
> That's a good idea, but if you read the replies to Gary's original patch
> https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> .. both Gary, Palmer and David would rather like a C-based version.
> This is one attempt at providing that.

Yep, I prefer C as well :)

But if you check commit 04091d6, the assembly version was introduced
for KASAN. So if we are to change it back to C, please make sure KASAN
is not broken.

Regards,
Bin

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  0:33               ` Bin Meng
@ 2021-06-16  2:01                 ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-16  2:01 UTC (permalink / raw)
  To: Bin Meng
  Cc: Emil Renner Berthing, David Laight, Gary Guo, linux-riscv,
	linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, 16 Jun 2021 08:33:21 +0800
Bin Meng <bmeng.cn@gmail.com> wrote:

> On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing
> <kernel@esmil.dk> wrote:
> >
> > On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> > > ...
> > > Yes, Gary Guo sent one patch long time ago against the broken
> > > assembly version, but that patch was still not applied as of
> > > today.
> > > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
> > >
> > > I suggest Matteo re-test using Gary's version.
> >
> > That's a good idea, but if you read the replies to Gary's original
> > patch
> > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > .. both Gary, Palmer and David would rather like a C-based version.
> > This is one attempt at providing that.
> 
> Yep, I prefer C as well :)
> 
> But if you check commit 04091d6, the assembly version was introduced
> for KASAN. So if we are to change it back to C, please make sure KASAN
> is not broken.
> 
> Regards,
> Bin
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

I added a small benchmark for memcpy() and memset() in lib/test_string.c:

memcpy_align_selftest():

#define PG_SIZE	(1 << (MAX_ORDER - 1 + PAGE_SHIFT))

	page1 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
	page2 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);

	for (i = 0; i < sizeof(void*); i++) {
		t0 = ktime_get();
		memset(dst + i, 0, PG_SIZE - i);
		t1 = ktime_get();
		printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i,
			PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
	}

memset_align_selftest():
	page = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
	for (i = 0; i < sizeof(void*); i++) {
		for (j = 0; j < sizeof(void*); j++) {
			t0 = ktime_get();
			memcpy(dst + j, src + i, PG_SIZE - max(i, j));
			t1 = ktime_get();
			printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n",
				i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
		}
	}

And I ran it agains the three implementations, current, Gary's assembler
and mine in C.

Current:
[   38.980687] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[   39.021612] Strings selftest: memcpy(src+0, dst+1): 113 Mb/s
[   39.062191] Strings selftest: memcpy(src+0, dst+2): 114 Mb/s
[   39.102669] Strings selftest: memcpy(src+0, dst+3): 114 Mb/s
[   39.127423] Strings selftest: memcpy(src+0, dst+4): 209 Mb/s
[   39.167836] Strings selftest: memcpy(src+0, dst+5): 115 Mb/s
[   39.208305] Strings selftest: memcpy(src+0, dst+6): 114 Mb/s
[   39.248712] Strings selftest: memcpy(src+0, dst+7): 115 Mb/s
[   39.288144] Strings selftest: memcpy(src+1, dst+0): 118 Mb/s
[   39.309190] Strings selftest: memcpy(src+1, dst+1): 260 Mb/s
[   39.349721] Strings selftest: memcpy(src+1, dst+2): 114 Mb/s
[...]
[   41.289423] Strings selftest: memcpy(src+7, dst+5): 114 Mb/s
[   41.328801] Strings selftest: memcpy(src+7, dst+6): 118 Mb/s
[   41.349907] Strings selftest: memcpy(src+7, dst+7): 259 Mb/s

[   41.377735] Strings selftest: memset(dst+0): 241 Mb/s
[   41.397882] Strings selftest: memset(dst+1): 265 Mb/s
[   41.417666] Strings selftest: memset(dst+2): 272 Mb/s
[   41.437169] Strings selftest: memset(dst+3): 277 Mb/s
[   41.456656] Strings selftest: memset(dst+4): 277 Mb/s
[   41.476125] Strings selftest: memset(dst+5): 278 Mb/s
[   41.495555] Strings selftest: memset(dst+6): 278 Mb/s
[   41.515002] Strings selftest: memset(dst+7): 278 Mb/s

Gary's
[   27.438112] Strings selftest: memcpy(src+0, dst+0): 232 Mb/s
[   27.461586] Strings selftest: memcpy(src+0, dst+1): 224 Mb/s
[   27.484691] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[   27.507693] Strings selftest: memcpy(src+0, dst+3): 230 Mb/s
[   27.530758] Strings selftest: memcpy(src+0, dst+4): 229 Mb/s
[   27.553840] Strings selftest: memcpy(src+0, dst+5): 229 Mb/s
[   27.576793] Strings selftest: memcpy(src+0, dst+6): 231 Mb/s
[   27.599862] Strings selftest: memcpy(src+0, dst+7): 230 Mb/s
[   27.622888] Strings selftest: memcpy(src+1, dst+0): 230 Mb/s
[   27.643964] Strings selftest: memcpy(src+1, dst+1): 259 Mb/s
[   27.666926] Strings selftest: memcpy(src+1, dst+2): 231 Mb/s
[...]
[   28.831726] Strings selftest: memcpy(src+7, dst+5): 230 Mb/s
[   28.854790] Strings selftest: memcpy(src+7, dst+6): 229 Mb/s
[   28.875844] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[   28.903666] Strings selftest: memset(dst+0): 240 Mb/s
[   28.923533] Strings selftest: memset(dst+1): 269 Mb/s
[   28.943100] Strings selftest: memset(dst+2): 275 Mb/s
[   28.962554] Strings selftest: memset(dst+3): 277 Mb/s
[   28.982009] Strings selftest: memset(dst+4): 277 Mb/s
[   29.001412] Strings selftest: memset(dst+5): 278 Mb/s
[   29.020894] Strings selftest: memset(dst+6): 277 Mb/s
[   29.040383] Strings selftest: memset(dst+7): 276 Mb/s

Mine:
[   33.916144] Strings selftest: memcpy(src+0, dst+0): 222 Mb/s
[   33.939520] Strings selftest: memcpy(src+0, dst+1): 226 Mb/s
[   33.962666] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[   33.985749] Strings selftest: memcpy(src+0, dst+3): 229 Mb/s
[   34.008748] Strings selftest: memcpy(src+0, dst+4): 231 Mb/s
[   34.031970] Strings selftest: memcpy(src+0, dst+5): 228 Mb/s
[   34.055065] Strings selftest: memcpy(src+0, dst+6): 229 Mb/s
[   34.078068] Strings selftest: memcpy(src+0, dst+7): 231 Mb/s
[   34.101177] Strings selftest: memcpy(src+1, dst+0): 229 Mb/s
[   34.122995] Strings selftest: memcpy(src+1, dst+1): 247 Mb/s
[   34.146072] Strings selftest: memcpy(src+1, dst+2): 229 Mb/s
[...]
[   35.315594] Strings selftest: memcpy(src+7, dst+5): 229 Mb/s
[   35.338617] Strings selftest: memcpy(src+7, dst+6): 230 Mb/s
[   35.360464] Strings selftest: memcpy(src+7, dst+7): 247 Mb/s

[   35.388929] Strings selftest: memset(dst+0): 232 Mb/s
[   35.409351] Strings selftest: memset(dst+1): 260 Mb/s
[   35.429434] Strings selftest: memset(dst+2): 266 Mb/s
[   35.449460] Strings selftest: memset(dst+3): 267 Mb/s
[   35.469479] Strings selftest: memset(dst+4): 267 Mb/s
[   35.489481] Strings selftest: memset(dst+5): 268 Mb/s
[   35.509443] Strings selftest: memset(dst+6): 269 Mb/s
[   35.529449] Strings selftest: memset(dst+7): 268 Mb/s

Leaving out the first memcpy/set of every test which is always slower, (maybe
because of a cache miss?), the current implementation copies 260 Mb/s when
the low order bits match, and 114 otherwise.
Memset is stable at 278 Mb/s.

Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
and 230 Mb/s otherwise. Memset is the same as the current one.

Mine has the same speed of Gary's one when the low order bits mismatch, but
it's slower when equally aligned, it stops at 247 Mb/s.
Memset is slighty slower ad 269 Mb/s.


I'm not familiar with RISC-V assembly, but looking at Gary's assembler and I
think that he manually unrolled the loop by copying 16 uint64_t at time
using 16 registers.
I managed to do the same with a small change in the C code and a pragma directive:

This for memcpy():

	if (distance) {
		unsigned long last, next;
		int i;

		s.u8 -= distance;

		for (; count >= bytes_long * 8 + mask; count -= bytes_long * 8) {
			next = s.ulong[0];
			for (i = 0; i < 8; i++) {
				last = next;
				next = s.ulong[i + 1];

				d.ulong[i] = last >> (distance * 8) |
					next << ((bytes_long - distance) * 8);
			}

			d.ulong += 8;
			s.ulong += 8;
		}

		s.u8 += distance;
	} else {
		/* 8 byte wide copy */
		int i;
		for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
			for (i = 0; i < 8; i++)
				d.ulong[i] = s.ulong[i];
			d.ulong += 8;
			s.ulong += 8;
		}
	}

And this for memset:

		for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
			for (i = 0; i < 8; i++)
				dest.ulong[i] = cu;

			dest.ulong += 8;
		}

And the generated machine code is very, very similar to Gary's one!
And these are the result:

[   35.898366] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[   35.920942] Strings selftest: memcpy(src+0, dst+1): 236 Mb/s
[   35.943171] Strings selftest: memcpy(src+0, dst+2): 241 Mb/s
[   35.965291] Strings selftest: memcpy(src+0, dst+3): 242 Mb/s
[   35.987374] Strings selftest: memcpy(src+0, dst+4): 244 Mb/s
[   36.009554] Strings selftest: memcpy(src+0, dst+5): 242 Mb/s
[   36.031721] Strings selftest: memcpy(src+0, dst+6): 242 Mb/s
[   36.053881] Strings selftest: memcpy(src+0, dst+7): 242 Mb/s
[   36.075949] Strings selftest: memcpy(src+1, dst+0): 243 Mb/s
[   36.097084] Strings selftest: memcpy(src+1, dst+1): 258 Mb/s
[   36.119269] Strings selftest: memcpy(src+1, dst+2): 242 Mb/s
[...]
[   37.242433] Strings selftest: memcpy(src+7, dst+5): 242 Mb/s
[   37.264571] Strings selftest: memcpy(src+7, dst+6): 242 Mb/s
[   37.285609] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[   37.313633] Strings selftest: memset(dst+0): 237 Mb/s
[   37.333682] Strings selftest: memset(dst+1): 266 Mb/s
[   37.353375] Strings selftest: memset(dst+2): 273 Mb/s
[   37.373000] Strings selftest: memset(dst+3): 274 Mb/s
[   37.392608] Strings selftest: memset(dst+4): 274 Mb/s
[   37.412220] Strings selftest: memset(dst+5): 274 Mb/s
[   37.431848] Strings selftest: memset(dst+6): 274 Mb/s
[   37.451467] Strings selftest: memset(dst+7): 274 Mb/s

This version is even faster than the assembly one, but it won't work for
copies/set smaller that at least 64 bytes, or even 128. With small buffers
it will copy bytes one at time, so I don't know if it's worth it.

What is preferred in your opinion, an implementation which is always fast
with all sizes, or one which is a bit faster but slow with small copies?

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-16  2:01                 ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-16  2:01 UTC (permalink / raw)
  To: Bin Meng
  Cc: Emil Renner Berthing, David Laight, Gary Guo, linux-riscv,
	linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, 16 Jun 2021 08:33:21 +0800
Bin Meng <bmeng.cn@gmail.com> wrote:

> On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing
> <kernel@esmil.dk> wrote:
> >
> > On Tue, 15 Jun 2021 at 15:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> > > ...
> > > Yes, Gary Guo sent one patch long time ago against the broken
> > > assembly version, but that patch was still not applied as of
> > > today.
> > > https://patchwork.kernel.org/project/linux-riscv/patch/20210216225555.4976-1-gary@garyguo.net/
> > >
> > > I suggest Matteo re-test using Gary's version.
> >
> > That's a good idea, but if you read the replies to Gary's original
> > patch
> > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > .. both Gary, Palmer and David would rather like a C-based version.
> > This is one attempt at providing that.
> 
> Yep, I prefer C as well :)
> 
> But if you check commit 04091d6, the assembly version was introduced
> for KASAN. So if we are to change it back to C, please make sure KASAN
> is not broken.
> 
> Regards,
> Bin
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

I added a small benchmark for memcpy() and memset() in lib/test_string.c:

memcpy_align_selftest():

#define PG_SIZE	(1 << (MAX_ORDER - 1 + PAGE_SHIFT))

	page1 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
	page2 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);

	for (i = 0; i < sizeof(void*); i++) {
		t0 = ktime_get();
		memset(dst + i, 0, PG_SIZE - i);
		t1 = ktime_get();
		printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i,
			PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
	}

memset_align_selftest():
	page = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
	for (i = 0; i < sizeof(void*); i++) {
		for (j = 0; j < sizeof(void*); j++) {
			t0 = ktime_get();
			memcpy(dst + j, src + i, PG_SIZE - max(i, j));
			t1 = ktime_get();
			printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n",
				i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
		}
	}

And I ran it agains the three implementations, current, Gary's assembler
and mine in C.

Current:
[   38.980687] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[   39.021612] Strings selftest: memcpy(src+0, dst+1): 113 Mb/s
[   39.062191] Strings selftest: memcpy(src+0, dst+2): 114 Mb/s
[   39.102669] Strings selftest: memcpy(src+0, dst+3): 114 Mb/s
[   39.127423] Strings selftest: memcpy(src+0, dst+4): 209 Mb/s
[   39.167836] Strings selftest: memcpy(src+0, dst+5): 115 Mb/s
[   39.208305] Strings selftest: memcpy(src+0, dst+6): 114 Mb/s
[   39.248712] Strings selftest: memcpy(src+0, dst+7): 115 Mb/s
[   39.288144] Strings selftest: memcpy(src+1, dst+0): 118 Mb/s
[   39.309190] Strings selftest: memcpy(src+1, dst+1): 260 Mb/s
[   39.349721] Strings selftest: memcpy(src+1, dst+2): 114 Mb/s
[...]
[   41.289423] Strings selftest: memcpy(src+7, dst+5): 114 Mb/s
[   41.328801] Strings selftest: memcpy(src+7, dst+6): 118 Mb/s
[   41.349907] Strings selftest: memcpy(src+7, dst+7): 259 Mb/s

[   41.377735] Strings selftest: memset(dst+0): 241 Mb/s
[   41.397882] Strings selftest: memset(dst+1): 265 Mb/s
[   41.417666] Strings selftest: memset(dst+2): 272 Mb/s
[   41.437169] Strings selftest: memset(dst+3): 277 Mb/s
[   41.456656] Strings selftest: memset(dst+4): 277 Mb/s
[   41.476125] Strings selftest: memset(dst+5): 278 Mb/s
[   41.495555] Strings selftest: memset(dst+6): 278 Mb/s
[   41.515002] Strings selftest: memset(dst+7): 278 Mb/s

Gary's
[   27.438112] Strings selftest: memcpy(src+0, dst+0): 232 Mb/s
[   27.461586] Strings selftest: memcpy(src+0, dst+1): 224 Mb/s
[   27.484691] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[   27.507693] Strings selftest: memcpy(src+0, dst+3): 230 Mb/s
[   27.530758] Strings selftest: memcpy(src+0, dst+4): 229 Mb/s
[   27.553840] Strings selftest: memcpy(src+0, dst+5): 229 Mb/s
[   27.576793] Strings selftest: memcpy(src+0, dst+6): 231 Mb/s
[   27.599862] Strings selftest: memcpy(src+0, dst+7): 230 Mb/s
[   27.622888] Strings selftest: memcpy(src+1, dst+0): 230 Mb/s
[   27.643964] Strings selftest: memcpy(src+1, dst+1): 259 Mb/s
[   27.666926] Strings selftest: memcpy(src+1, dst+2): 231 Mb/s
[...]
[   28.831726] Strings selftest: memcpy(src+7, dst+5): 230 Mb/s
[   28.854790] Strings selftest: memcpy(src+7, dst+6): 229 Mb/s
[   28.875844] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[   28.903666] Strings selftest: memset(dst+0): 240 Mb/s
[   28.923533] Strings selftest: memset(dst+1): 269 Mb/s
[   28.943100] Strings selftest: memset(dst+2): 275 Mb/s
[   28.962554] Strings selftest: memset(dst+3): 277 Mb/s
[   28.982009] Strings selftest: memset(dst+4): 277 Mb/s
[   29.001412] Strings selftest: memset(dst+5): 278 Mb/s
[   29.020894] Strings selftest: memset(dst+6): 277 Mb/s
[   29.040383] Strings selftest: memset(dst+7): 276 Mb/s

Mine:
[   33.916144] Strings selftest: memcpy(src+0, dst+0): 222 Mb/s
[   33.939520] Strings selftest: memcpy(src+0, dst+1): 226 Mb/s
[   33.962666] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[   33.985749] Strings selftest: memcpy(src+0, dst+3): 229 Mb/s
[   34.008748] Strings selftest: memcpy(src+0, dst+4): 231 Mb/s
[   34.031970] Strings selftest: memcpy(src+0, dst+5): 228 Mb/s
[   34.055065] Strings selftest: memcpy(src+0, dst+6): 229 Mb/s
[   34.078068] Strings selftest: memcpy(src+0, dst+7): 231 Mb/s
[   34.101177] Strings selftest: memcpy(src+1, dst+0): 229 Mb/s
[   34.122995] Strings selftest: memcpy(src+1, dst+1): 247 Mb/s
[   34.146072] Strings selftest: memcpy(src+1, dst+2): 229 Mb/s
[...]
[   35.315594] Strings selftest: memcpy(src+7, dst+5): 229 Mb/s
[   35.338617] Strings selftest: memcpy(src+7, dst+6): 230 Mb/s
[   35.360464] Strings selftest: memcpy(src+7, dst+7): 247 Mb/s

[   35.388929] Strings selftest: memset(dst+0): 232 Mb/s
[   35.409351] Strings selftest: memset(dst+1): 260 Mb/s
[   35.429434] Strings selftest: memset(dst+2): 266 Mb/s
[   35.449460] Strings selftest: memset(dst+3): 267 Mb/s
[   35.469479] Strings selftest: memset(dst+4): 267 Mb/s
[   35.489481] Strings selftest: memset(dst+5): 268 Mb/s
[   35.509443] Strings selftest: memset(dst+6): 269 Mb/s
[   35.529449] Strings selftest: memset(dst+7): 268 Mb/s

Leaving out the first memcpy/set of every test which is always slower, (maybe
because of a cache miss?), the current implementation copies 260 Mb/s when
the low order bits match, and 114 otherwise.
Memset is stable at 278 Mb/s.

Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
and 230 Mb/s otherwise. Memset is the same as the current one.

Mine has the same speed of Gary's one when the low order bits mismatch, but
it's slower when equally aligned, it stops at 247 Mb/s.
Memset is slighty slower ad 269 Mb/s.


I'm not familiar with RISC-V assembly, but looking at Gary's assembler and I
think that he manually unrolled the loop by copying 16 uint64_t at time
using 16 registers.
I managed to do the same with a small change in the C code and a pragma directive:

This for memcpy():

	if (distance) {
		unsigned long last, next;
		int i;

		s.u8 -= distance;

		for (; count >= bytes_long * 8 + mask; count -= bytes_long * 8) {
			next = s.ulong[0];
			for (i = 0; i < 8; i++) {
				last = next;
				next = s.ulong[i + 1];

				d.ulong[i] = last >> (distance * 8) |
					next << ((bytes_long - distance) * 8);
			}

			d.ulong += 8;
			s.ulong += 8;
		}

		s.u8 += distance;
	} else {
		/* 8 byte wide copy */
		int i;
		for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
			for (i = 0; i < 8; i++)
				d.ulong[i] = s.ulong[i];
			d.ulong += 8;
			s.ulong += 8;
		}
	}

And this for memset:

		for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
			for (i = 0; i < 8; i++)
				dest.ulong[i] = cu;

			dest.ulong += 8;
		}

And the generated machine code is very, very similar to Gary's one!
And these are the result:

[   35.898366] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[   35.920942] Strings selftest: memcpy(src+0, dst+1): 236 Mb/s
[   35.943171] Strings selftest: memcpy(src+0, dst+2): 241 Mb/s
[   35.965291] Strings selftest: memcpy(src+0, dst+3): 242 Mb/s
[   35.987374] Strings selftest: memcpy(src+0, dst+4): 244 Mb/s
[   36.009554] Strings selftest: memcpy(src+0, dst+5): 242 Mb/s
[   36.031721] Strings selftest: memcpy(src+0, dst+6): 242 Mb/s
[   36.053881] Strings selftest: memcpy(src+0, dst+7): 242 Mb/s
[   36.075949] Strings selftest: memcpy(src+1, dst+0): 243 Mb/s
[   36.097084] Strings selftest: memcpy(src+1, dst+1): 258 Mb/s
[   36.119269] Strings selftest: memcpy(src+1, dst+2): 242 Mb/s
[...]
[   37.242433] Strings selftest: memcpy(src+7, dst+5): 242 Mb/s
[   37.264571] Strings selftest: memcpy(src+7, dst+6): 242 Mb/s
[   37.285609] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[   37.313633] Strings selftest: memset(dst+0): 237 Mb/s
[   37.333682] Strings selftest: memset(dst+1): 266 Mb/s
[   37.353375] Strings selftest: memset(dst+2): 273 Mb/s
[   37.373000] Strings selftest: memset(dst+3): 274 Mb/s
[   37.392608] Strings selftest: memset(dst+4): 274 Mb/s
[   37.412220] Strings selftest: memset(dst+5): 274 Mb/s
[   37.431848] Strings selftest: memset(dst+6): 274 Mb/s
[   37.451467] Strings selftest: memset(dst+7): 274 Mb/s

This version is even faster than the assembly one, but it won't work for
copies/set smaller that at least 64 bytes, or even 128. With small buffers
it will copy bytes one at time, so I don't know if it's worth it.

What is preferred in your opinion, an implementation which is always fast
with all sizes, or one which is a bit faster but slow with small copies?

-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  2:01                 ` Matteo Croce
@ 2021-06-16  8:24                   ` David Laight
  -1 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-16  8:24 UTC (permalink / raw)
  To: 'Matteo Croce', Bin Meng
  Cc: Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

From: Matteo Croce
> Sent: 16 June 2021 03:02
...
> > > That's a good idea, but if you read the replies to Gary's original
> > > patch
> > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > .. both Gary, Palmer and David would rather like a C-based version.
> > > This is one attempt at providing that.
> >
> > Yep, I prefer C as well :)
> >
> > But if you check commit 04091d6, the assembly version was introduced
> > for KASAN. So if we are to change it back to C, please make sure KASAN
> > is not broken.
> >
...
> Leaving out the first memcpy/set of every test which is always slower, (maybe
> because of a cache miss?), the current implementation copies 260 Mb/s when
> the low order bits match, and 114 otherwise.
> Memset is stable at 278 Mb/s.
> 
> Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> and 230 Mb/s otherwise. Memset is the same as the current one.

Any idea what the attainable performance is for the cpu you are using?
Since both memset and memcpy are running at much the same speed
I suspect it is all limited by the writes.

272MB/s is only 34M writes/sec.
This seems horribly slow for a modern cpu.
So is this actually really limited by the cache writes to physical memory?

You might want to do some tests (userspace is fine) where you
check much smaller lengths that definitely sit within the data cache.

It is also worth checking how much overhead there is for
short copies - they are almost certainly more common than
you might expect.
This is one problem with excessive loop unrolling - the 'special
cases' for the ends of the buffer start having a big effect
on small copies.

For cpu that support misaligned memory accesses, one 'trick'
for transfers longer than a 'word' is to do a (probably) misaligned
transfer of the last word of the buffer first followed by the
transfer of the rest of the buffer (overlapping a few bytes at the end).
This saves on conditionals and temporary values.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-16  8:24                   ` David Laight
  0 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-16  8:24 UTC (permalink / raw)
  To: 'Matteo Croce', Bin Meng
  Cc: Emil Renner Berthing, Gary Guo, linux-riscv, linux-kernel,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Akira Tsukamoto, Drew Fustini

From: Matteo Croce
> Sent: 16 June 2021 03:02
...
> > > That's a good idea, but if you read the replies to Gary's original
> > > patch
> > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > .. both Gary, Palmer and David would rather like a C-based version.
> > > This is one attempt at providing that.
> >
> > Yep, I prefer C as well :)
> >
> > But if you check commit 04091d6, the assembly version was introduced
> > for KASAN. So if we are to change it back to C, please make sure KASAN
> > is not broken.
> >
...
> Leaving out the first memcpy/set of every test which is always slower, (maybe
> because of a cache miss?), the current implementation copies 260 Mb/s when
> the low order bits match, and 114 otherwise.
> Memset is stable at 278 Mb/s.
> 
> Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> and 230 Mb/s otherwise. Memset is the same as the current one.

Any idea what the attainable performance is for the cpu you are using?
Since both memset and memcpy are running at much the same speed
I suspect it is all limited by the writes.

272MB/s is only 34M writes/sec.
This seems horribly slow for a modern cpu.
So is this actually really limited by the cache writes to physical memory?

You might want to do some tests (userspace is fine) where you
check much smaller lengths that definitely sit within the data cache.

It is also worth checking how much overhead there is for
short copies - they are almost certainly more common than
you might expect.
This is one problem with excessive loop unrolling - the 'special
cases' for the ends of the buffer start having a big effect
on small copies.

For cpu that support misaligned memory accesses, one 'trick'
for transfers longer than a 'word' is to do a (probably) misaligned
transfer of the last word of the buffer first followed by the
transfer of the rest of the buffer (overlapping a few bytes at the end).
This saves on conditionals and temporary values.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  8:24                   ` David Laight
@ 2021-06-16 10:48                     ` Akira Tsukamoto
  -1 siblings, 0 replies; 64+ messages in thread
From: Akira Tsukamoto @ 2021-06-16 10:48 UTC (permalink / raw)
  To: David Laight
  Cc: Matteo Croce, Bin Meng, Emil Renner Berthing, Gary Guo,
	linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Drew Fustini

On Wed, Jun 16, 2021 at 5:24 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>
> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>
> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.

I am fine with Matteo's memcpy.

The two culprits seen by the `perf top -Ue task-clock` output during the
tcp and ucp network are

> Overhead  Shared O  Symbol
>  42.22%  [kernel]  [k] memcpy
>  35.00%  [kernel]  [k] __asm_copy_to_user

so we really need to optimize both memcpy and __asm_copy_to_user.

The main reason of speed up in memcpy is that

> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.

which are in the code:

Gary's:
+       /* Calculate shifts */
+       slli    t3, a3, 3
+       sub    t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+       /* Load the initial value and align a1 */
+       andi    a1, a1, ~(SZREG-1)
+       REG_L    a5, 0(a1)
+
+       addi    t0, t0, -(SZREG-1)
+       /* At least one iteration will be executed here, no check */
+1:
+       srl    a4, a5, t3
+       REG_L    a5, SZREG(a1)
+       addi    a1, a1, SZREG
+       sll    a2, a5, t4
+       or    a2, a2, a4
+       REG_S    a2, 0(a0)
+       addi    a0, a0, SZREG
+       bltu    a0, t0, 1b

and Matteo ported to C:

+#pragma GCC unroll 8
+        for (next = s.ulong[0]; count >= bytes_long + mask; count -=
bytes_long) {
+            last = next;
+            next = s.ulong[1];
+
+            d.ulong[0] = last >> (distance * 8) |
+                     next << ((bytes_long - distance) * 8);
+
+            d.ulong++;
+            s.ulong++;
+        }

I believe this is reasonable and enough to be in the upstream.

Akira


>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-16 10:48                     ` Akira Tsukamoto
  0 siblings, 0 replies; 64+ messages in thread
From: Akira Tsukamoto @ 2021-06-16 10:48 UTC (permalink / raw)
  To: David Laight
  Cc: Matteo Croce, Bin Meng, Emil Renner Berthing, Gary Guo,
	linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Drew Fustini

On Wed, Jun 16, 2021 at 5:24 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>
> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>
> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.

I am fine with Matteo's memcpy.

The two culprits seen by the `perf top -Ue task-clock` output during the
tcp and ucp network are

> Overhead  Shared O  Symbol
>  42.22%  [kernel]  [k] memcpy
>  35.00%  [kernel]  [k] __asm_copy_to_user

so we really need to optimize both memcpy and __asm_copy_to_user.

The main reason of speed up in memcpy is that

> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.

which are in the code:

Gary's:
+       /* Calculate shifts */
+       slli    t3, a3, 3
+       sub    t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+       /* Load the initial value and align a1 */
+       andi    a1, a1, ~(SZREG-1)
+       REG_L    a5, 0(a1)
+
+       addi    t0, t0, -(SZREG-1)
+       /* At least one iteration will be executed here, no check */
+1:
+       srl    a4, a5, t3
+       REG_L    a5, SZREG(a1)
+       addi    a1, a1, SZREG
+       sll    a2, a5, t4
+       or    a2, a2, a4
+       REG_S    a2, 0(a0)
+       addi    a0, a0, SZREG
+       bltu    a0, t0, 1b

and Matteo ported to C:

+#pragma GCC unroll 8
+        for (next = s.ulong[0]; count >= bytes_long + mask; count -=
bytes_long) {
+            last = next;
+            next = s.ulong[1];
+
+            d.ulong[0] = last >> (distance * 8) |
+                     next << ((bytes_long - distance) * 8);
+
+            d.ulong++;
+            s.ulong++;
+        }

I believe this is reasonable and enough to be in the upstream.

Akira


>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-15  2:38   ` Matteo Croce
@ 2021-06-16 11:46     ` Guo Ren
  -1 siblings, 0 replies; 64+ messages in thread
From: Guo Ren @ 2021-06-16 11:46 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

Hi Matteo,

Have you tried Glibc generic implementation code?
ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t

If Glibc codes have the same performance in your hardware, then you
could give a generic implementation first.

The current Linux generic implementation is so simple in lib/string.c:
#ifndef __HAVE_ARCH_MEMCPY
/**
 * memcpy - Copy one area of memory to another
 * @dest: Where to copy to
 * @src: Where to copy from
 * @count: The size of the area.
 *
 * You should not use this function to access IO space, use memcpy_toio()
 * or memcpy_fromio() instead.
 */
void *memcpy(void *dest, const void *src, size_t count)
{
        char *tmp = dest;
        const char *s = src;

        while (count--)
                *tmp++ = *s++;
        return dest;
}
EXPORT_SYMBOL(memcpy);
#endif

On Tue, Jun 15, 2021 at 10:42 AM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
>
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
>
> On a BeagleV, the TCP RX throughput increased by 45%:
>
> before:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
> [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
> [  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
> [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
> [  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
> [  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
> [  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
> [  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
> [  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
> [  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
> [  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver
>
> after:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
> [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
> [  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
> [  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
> [  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
> [  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
> [  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
> [  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
> [  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
> [  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
> [  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver
>
> And the decreased CPU time of the memcpy() is observable with perf top.
> This is the `perf top -Ue task-clock` output when doing the test:
>
> before:
>
> Overhead  Shared O  Symbol
>   42.22%  [kernel]  [k] memcpy
>   35.00%  [kernel]  [k] __asm_copy_to_user
>    3.50%  [kernel]  [k] sifive_l2_flush64_range
>    2.30%  [kernel]  [k] stmmac_napi_poll_rx
>    1.11%  [kernel]  [k] memset
>
> after:
>
> Overhead  Shared O  Symbol
>   45.69%  [kernel]  [k] __asm_copy_to_user
>   29.06%  [kernel]  [k] memcpy
>    4.09%  [kernel]  [k] sifive_l2_flush64_range
>    2.77%  [kernel]  [k] stmmac_napi_poll_rx
>    1.24%  [kernel]  [k] memset
>
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>  arch/riscv/include/asm/string.h |   8 ++-
>  arch/riscv/kernel/riscv_ksyms.c |   2 -
>  arch/riscv/lib/Makefile         |   2 +-
>  arch/riscv/lib/memcpy.S         | 108 --------------------------------
>  arch/riscv/lib/string.c         |  94 +++++++++++++++++++++++++++
>  5 files changed, 101 insertions(+), 113 deletions(-)
>  delete mode 100644 arch/riscv/lib/memcpy.S
>  create mode 100644 arch/riscv/lib/string.c
>
> diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
> index 909049366555..6b5d6fc3eab4 100644
> --- a/arch/riscv/include/asm/string.h
> +++ b/arch/riscv/include/asm/string.h
> @@ -12,9 +12,13 @@
>  #define __HAVE_ARCH_MEMSET
>  extern asmlinkage void *memset(void *, int, size_t);
>  extern asmlinkage void *__memset(void *, int, size_t);
> +
> +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
>  #define __HAVE_ARCH_MEMCPY
> -extern asmlinkage void *memcpy(void *, const void *, size_t);
> -extern asmlinkage void *__memcpy(void *, const void *, size_t);
> +extern void *memcpy(void *dest, const void *src, size_t count);
> +extern void *__memcpy(void *dest, const void *src, size_t count);
> +#endif
> +
>  #define __HAVE_ARCH_MEMMOVE
>  extern asmlinkage void *memmove(void *, const void *, size_t);
>  extern asmlinkage void *__memmove(void *, const void *, size_t);
> diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
> index 5ab1c7e1a6ed..3f6d512a5b97 100644
> --- a/arch/riscv/kernel/riscv_ksyms.c
> +++ b/arch/riscv/kernel/riscv_ksyms.c
> @@ -10,8 +10,6 @@
>   * Assembly functions that may be used (directly or indirectly) by modules
>   */
>  EXPORT_SYMBOL(memset);
> -EXPORT_SYMBOL(memcpy);
>  EXPORT_SYMBOL(memmove);
>  EXPORT_SYMBOL(__memset);
> -EXPORT_SYMBOL(__memcpy);
>  EXPORT_SYMBOL(__memmove);
> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> index 25d5c9664e57..2ffe85d4baee 100644
> --- a/arch/riscv/lib/Makefile
> +++ b/arch/riscv/lib/Makefile
> @@ -1,9 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  lib-y                  += delay.o
> -lib-y                  += memcpy.o
>  lib-y                  += memset.o
>  lib-y                  += memmove.o
>  lib-$(CONFIG_MMU)      += uaccess.o
>  lib-$(CONFIG_64BIT)    += tishift.o
> +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
>
>  obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> deleted file mode 100644
> index 51ab716253fa..000000000000
> --- a/arch/riscv/lib/memcpy.S
> +++ /dev/null
> @@ -1,108 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2013 Regents of the University of California
> - */
> -
> -#include <linux/linkage.h>
> -#include <asm/asm.h>
> -
> -/* void *memcpy(void *, const void *, size_t) */
> -ENTRY(__memcpy)
> -WEAK(memcpy)
> -       move t6, a0  /* Preserve return value */
> -
> -       /* Defer to byte-oriented copy for small sizes */
> -       sltiu a3, a2, 128
> -       bnez a3, 4f
> -       /* Use word-oriented copy only if low-order bits match */
> -       andi a3, t6, SZREG-1
> -       andi a4, a1, SZREG-1
> -       bne a3, a4, 4f
> -
> -       beqz a3, 2f  /* Skip if already aligned */
> -       /*
> -        * Round to nearest double word-aligned address
> -        * greater than or equal to start address
> -        */
> -       andi a3, a1, ~(SZREG-1)
> -       addi a3, a3, SZREG
> -       /* Handle initial misalignment */
> -       sub a4, a3, a1
> -1:
> -       lb a5, 0(a1)
> -       addi a1, a1, 1
> -       sb a5, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 1b
> -       sub a2, a2, a4  /* Update count */
> -
> -2:
> -       andi a4, a2, ~((16*SZREG)-1)
> -       beqz a4, 4f
> -       add a3, a1, a4
> -3:
> -       REG_L a4,       0(a1)
> -       REG_L a5,   SZREG(a1)
> -       REG_L a6, 2*SZREG(a1)
> -       REG_L a7, 3*SZREG(a1)
> -       REG_L t0, 4*SZREG(a1)
> -       REG_L t1, 5*SZREG(a1)
> -       REG_L t2, 6*SZREG(a1)
> -       REG_L t3, 7*SZREG(a1)
> -       REG_L t4, 8*SZREG(a1)
> -       REG_L t5, 9*SZREG(a1)
> -       REG_S a4,       0(t6)
> -       REG_S a5,   SZREG(t6)
> -       REG_S a6, 2*SZREG(t6)
> -       REG_S a7, 3*SZREG(t6)
> -       REG_S t0, 4*SZREG(t6)
> -       REG_S t1, 5*SZREG(t6)
> -       REG_S t2, 6*SZREG(t6)
> -       REG_S t3, 7*SZREG(t6)
> -       REG_S t4, 8*SZREG(t6)
> -       REG_S t5, 9*SZREG(t6)
> -       REG_L a4, 10*SZREG(a1)
> -       REG_L a5, 11*SZREG(a1)
> -       REG_L a6, 12*SZREG(a1)
> -       REG_L a7, 13*SZREG(a1)
> -       REG_L t0, 14*SZREG(a1)
> -       REG_L t1, 15*SZREG(a1)
> -       addi a1, a1, 16*SZREG
> -       REG_S a4, 10*SZREG(t6)
> -       REG_S a5, 11*SZREG(t6)
> -       REG_S a6, 12*SZREG(t6)
> -       REG_S a7, 13*SZREG(t6)
> -       REG_S t0, 14*SZREG(t6)
> -       REG_S t1, 15*SZREG(t6)
> -       addi t6, t6, 16*SZREG
> -       bltu a1, a3, 3b
> -       andi a2, a2, (16*SZREG)-1  /* Update count */
> -
> -4:
> -       /* Handle trailing misalignment */
> -       beqz a2, 6f
> -       add a3, a1, a2
> -
> -       /* Use word-oriented copy if co-aligned to word boundary */
> -       or a5, a1, t6
> -       or a5, a5, a3
> -       andi a5, a5, 3
> -       bnez a5, 5f
> -7:
> -       lw a4, 0(a1)
> -       addi a1, a1, 4
> -       sw a4, 0(t6)
> -       addi t6, t6, 4
> -       bltu a1, a3, 7b
> -
> -       ret
> -
> -5:
> -       lb a4, 0(a1)
> -       addi a1, a1, 1
> -       sb a4, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 5b
> -6:
> -       ret
> -END(__memcpy)
> diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> new file mode 100644
> index 000000000000..525f9ee25a74
> --- /dev/null
> +++ b/arch/riscv/lib/string.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * String functions optimized for hardware which doesn't
> + * handle unaligned memory accesses efficiently.
> + *
> + * Copyright (C) 2021 Matteo Croce
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +
> +/* size below a classic byte at time copy is done */
> +#define MIN_THRESHOLD 64
> +
> +/* convenience types to avoid cast between different pointer types */
> +union types {
> +       u8 *u8;
> +       unsigned long *ulong;
> +       uintptr_t uptr;
> +};
> +
> +union const_types {
> +       const u8 *u8;
> +       unsigned long *ulong;
> +};
> +
> +void *memcpy(void *dest, const void *src, size_t count)
> +{
> +       const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +       const int mask = bytes_long - 1;
> +       const int distance = (src - dest) & mask;
> +#endif
> +       union const_types s = { .u8 = src };
> +       union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +       if (count <= MIN_THRESHOLD)
> +               goto copy_remainder;
> +
> +       /* copy a byte at time until destination is aligned */
> +       for (; count && d.uptr & mask; count--)
> +               *d.u8++ = *s.u8++;
> +
> +       if (distance) {
> +               unsigned long last, next;
> +
> +               /* move s backward to the previous alignment boundary */
> +               s.u8 -= distance;
> +
> +               /* 32/64 bit wide copy from s to d.
> +                * d is aligned now but s is not, so read s alignment wise,
> +                * and do proper shift to get the right value.
> +                * Works only on Little Endian machines.
> +                */
> +               for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> +                       last = next;
> +                       next = s.ulong[1];
> +
> +                       d.ulong[0] = last >> (distance * 8) |
> +                                    next << ((bytes_long - distance) * 8);
> +
> +                       d.ulong++;
> +                       s.ulong++;
> +               }
> +
> +               /* restore s with the original offset */
> +               s.u8 += distance;
> +       } else
> +#endif
> +       {
> +               /* if the source and dest lower bits are the same, do a simple
> +                * 32/64 bit wide copy.
> +                */
> +               for (; count >= bytes_long; count -= bytes_long)
> +                       *d.ulong++ = *s.ulong++;
> +       }
> +
> +       /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> +       goto copy_remainder;
> +
> +copy_remainder:
> +       while (count--)
> +               *d.u8++ = *s.u8++;
> +
> +       return dest;
> +}
> +EXPORT_SYMBOL(memcpy);
> +
> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +       return memcpy(dest, src, count);
> +}
> +EXPORT_SYMBOL(__memcpy);
> --
> 2.31.1
>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-16 11:46     ` Guo Ren
  0 siblings, 0 replies; 64+ messages in thread
From: Guo Ren @ 2021-06-16 11:46 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

Hi Matteo,

Have you tried Glibc generic implementation code?
ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t

If Glibc codes have the same performance in your hardware, then you
could give a generic implementation first.

The current Linux generic implementation is so simple in lib/string.c:
#ifndef __HAVE_ARCH_MEMCPY
/**
 * memcpy - Copy one area of memory to another
 * @dest: Where to copy to
 * @src: Where to copy from
 * @count: The size of the area.
 *
 * You should not use this function to access IO space, use memcpy_toio()
 * or memcpy_fromio() instead.
 */
void *memcpy(void *dest, const void *src, size_t count)
{
        char *tmp = dest;
        const char *s = src;

        while (count--)
                *tmp++ = *s++;
        return dest;
}
EXPORT_SYMBOL(memcpy);
#endif

On Tue, Jun 15, 2021 at 10:42 AM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
>
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
>
> On a BeagleV, the TCP RX throughput increased by 45%:
>
> before:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
> [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
> [  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
> [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
> [  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
> [  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
> [  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
> [  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
> [  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
> [  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
> [  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver
>
> after:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
> [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
> [  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
> [  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
> [  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
> [  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
> [  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
> [  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
> [  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
> [  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
> [  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver
>
> And the decreased CPU time of the memcpy() is observable with perf top.
> This is the `perf top -Ue task-clock` output when doing the test:
>
> before:
>
> Overhead  Shared O  Symbol
>   42.22%  [kernel]  [k] memcpy
>   35.00%  [kernel]  [k] __asm_copy_to_user
>    3.50%  [kernel]  [k] sifive_l2_flush64_range
>    2.30%  [kernel]  [k] stmmac_napi_poll_rx
>    1.11%  [kernel]  [k] memset
>
> after:
>
> Overhead  Shared O  Symbol
>   45.69%  [kernel]  [k] __asm_copy_to_user
>   29.06%  [kernel]  [k] memcpy
>    4.09%  [kernel]  [k] sifive_l2_flush64_range
>    2.77%  [kernel]  [k] stmmac_napi_poll_rx
>    1.24%  [kernel]  [k] memset
>
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>  arch/riscv/include/asm/string.h |   8 ++-
>  arch/riscv/kernel/riscv_ksyms.c |   2 -
>  arch/riscv/lib/Makefile         |   2 +-
>  arch/riscv/lib/memcpy.S         | 108 --------------------------------
>  arch/riscv/lib/string.c         |  94 +++++++++++++++++++++++++++
>  5 files changed, 101 insertions(+), 113 deletions(-)
>  delete mode 100644 arch/riscv/lib/memcpy.S
>  create mode 100644 arch/riscv/lib/string.c
>
> diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
> index 909049366555..6b5d6fc3eab4 100644
> --- a/arch/riscv/include/asm/string.h
> +++ b/arch/riscv/include/asm/string.h
> @@ -12,9 +12,13 @@
>  #define __HAVE_ARCH_MEMSET
>  extern asmlinkage void *memset(void *, int, size_t);
>  extern asmlinkage void *__memset(void *, int, size_t);
> +
> +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
>  #define __HAVE_ARCH_MEMCPY
> -extern asmlinkage void *memcpy(void *, const void *, size_t);
> -extern asmlinkage void *__memcpy(void *, const void *, size_t);
> +extern void *memcpy(void *dest, const void *src, size_t count);
> +extern void *__memcpy(void *dest, const void *src, size_t count);
> +#endif
> +
>  #define __HAVE_ARCH_MEMMOVE
>  extern asmlinkage void *memmove(void *, const void *, size_t);
>  extern asmlinkage void *__memmove(void *, const void *, size_t);
> diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
> index 5ab1c7e1a6ed..3f6d512a5b97 100644
> --- a/arch/riscv/kernel/riscv_ksyms.c
> +++ b/arch/riscv/kernel/riscv_ksyms.c
> @@ -10,8 +10,6 @@
>   * Assembly functions that may be used (directly or indirectly) by modules
>   */
>  EXPORT_SYMBOL(memset);
> -EXPORT_SYMBOL(memcpy);
>  EXPORT_SYMBOL(memmove);
>  EXPORT_SYMBOL(__memset);
> -EXPORT_SYMBOL(__memcpy);
>  EXPORT_SYMBOL(__memmove);
> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> index 25d5c9664e57..2ffe85d4baee 100644
> --- a/arch/riscv/lib/Makefile
> +++ b/arch/riscv/lib/Makefile
> @@ -1,9 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  lib-y                  += delay.o
> -lib-y                  += memcpy.o
>  lib-y                  += memset.o
>  lib-y                  += memmove.o
>  lib-$(CONFIG_MMU)      += uaccess.o
>  lib-$(CONFIG_64BIT)    += tishift.o
> +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
>
>  obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> deleted file mode 100644
> index 51ab716253fa..000000000000
> --- a/arch/riscv/lib/memcpy.S
> +++ /dev/null
> @@ -1,108 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2013 Regents of the University of California
> - */
> -
> -#include <linux/linkage.h>
> -#include <asm/asm.h>
> -
> -/* void *memcpy(void *, const void *, size_t) */
> -ENTRY(__memcpy)
> -WEAK(memcpy)
> -       move t6, a0  /* Preserve return value */
> -
> -       /* Defer to byte-oriented copy for small sizes */
> -       sltiu a3, a2, 128
> -       bnez a3, 4f
> -       /* Use word-oriented copy only if low-order bits match */
> -       andi a3, t6, SZREG-1
> -       andi a4, a1, SZREG-1
> -       bne a3, a4, 4f
> -
> -       beqz a3, 2f  /* Skip if already aligned */
> -       /*
> -        * Round to nearest double word-aligned address
> -        * greater than or equal to start address
> -        */
> -       andi a3, a1, ~(SZREG-1)
> -       addi a3, a3, SZREG
> -       /* Handle initial misalignment */
> -       sub a4, a3, a1
> -1:
> -       lb a5, 0(a1)
> -       addi a1, a1, 1
> -       sb a5, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 1b
> -       sub a2, a2, a4  /* Update count */
> -
> -2:
> -       andi a4, a2, ~((16*SZREG)-1)
> -       beqz a4, 4f
> -       add a3, a1, a4
> -3:
> -       REG_L a4,       0(a1)
> -       REG_L a5,   SZREG(a1)
> -       REG_L a6, 2*SZREG(a1)
> -       REG_L a7, 3*SZREG(a1)
> -       REG_L t0, 4*SZREG(a1)
> -       REG_L t1, 5*SZREG(a1)
> -       REG_L t2, 6*SZREG(a1)
> -       REG_L t3, 7*SZREG(a1)
> -       REG_L t4, 8*SZREG(a1)
> -       REG_L t5, 9*SZREG(a1)
> -       REG_S a4,       0(t6)
> -       REG_S a5,   SZREG(t6)
> -       REG_S a6, 2*SZREG(t6)
> -       REG_S a7, 3*SZREG(t6)
> -       REG_S t0, 4*SZREG(t6)
> -       REG_S t1, 5*SZREG(t6)
> -       REG_S t2, 6*SZREG(t6)
> -       REG_S t3, 7*SZREG(t6)
> -       REG_S t4, 8*SZREG(t6)
> -       REG_S t5, 9*SZREG(t6)
> -       REG_L a4, 10*SZREG(a1)
> -       REG_L a5, 11*SZREG(a1)
> -       REG_L a6, 12*SZREG(a1)
> -       REG_L a7, 13*SZREG(a1)
> -       REG_L t0, 14*SZREG(a1)
> -       REG_L t1, 15*SZREG(a1)
> -       addi a1, a1, 16*SZREG
> -       REG_S a4, 10*SZREG(t6)
> -       REG_S a5, 11*SZREG(t6)
> -       REG_S a6, 12*SZREG(t6)
> -       REG_S a7, 13*SZREG(t6)
> -       REG_S t0, 14*SZREG(t6)
> -       REG_S t1, 15*SZREG(t6)
> -       addi t6, t6, 16*SZREG
> -       bltu a1, a3, 3b
> -       andi a2, a2, (16*SZREG)-1  /* Update count */
> -
> -4:
> -       /* Handle trailing misalignment */
> -       beqz a2, 6f
> -       add a3, a1, a2
> -
> -       /* Use word-oriented copy if co-aligned to word boundary */
> -       or a5, a1, t6
> -       or a5, a5, a3
> -       andi a5, a5, 3
> -       bnez a5, 5f
> -7:
> -       lw a4, 0(a1)
> -       addi a1, a1, 4
> -       sw a4, 0(t6)
> -       addi t6, t6, 4
> -       bltu a1, a3, 7b
> -
> -       ret
> -
> -5:
> -       lb a4, 0(a1)
> -       addi a1, a1, 1
> -       sb a4, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 5b
> -6:
> -       ret
> -END(__memcpy)
> diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> new file mode 100644
> index 000000000000..525f9ee25a74
> --- /dev/null
> +++ b/arch/riscv/lib/string.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * String functions optimized for hardware which doesn't
> + * handle unaligned memory accesses efficiently.
> + *
> + * Copyright (C) 2021 Matteo Croce
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +
> +/* size below a classic byte at time copy is done */
> +#define MIN_THRESHOLD 64
> +
> +/* convenience types to avoid cast between different pointer types */
> +union types {
> +       u8 *u8;
> +       unsigned long *ulong;
> +       uintptr_t uptr;
> +};
> +
> +union const_types {
> +       const u8 *u8;
> +       unsigned long *ulong;
> +};
> +
> +void *memcpy(void *dest, const void *src, size_t count)
> +{
> +       const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +       const int mask = bytes_long - 1;
> +       const int distance = (src - dest) & mask;
> +#endif
> +       union const_types s = { .u8 = src };
> +       union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +       if (count <= MIN_THRESHOLD)
> +               goto copy_remainder;
> +
> +       /* copy a byte at time until destination is aligned */
> +       for (; count && d.uptr & mask; count--)
> +               *d.u8++ = *s.u8++;
> +
> +       if (distance) {
> +               unsigned long last, next;
> +
> +               /* move s backward to the previous alignment boundary */
> +               s.u8 -= distance;
> +
> +               /* 32/64 bit wide copy from s to d.
> +                * d is aligned now but s is not, so read s alignment wise,
> +                * and do proper shift to get the right value.
> +                * Works only on Little Endian machines.
> +                */
> +               for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> +                       last = next;
> +                       next = s.ulong[1];
> +
> +                       d.ulong[0] = last >> (distance * 8) |
> +                                    next << ((bytes_long - distance) * 8);
> +
> +                       d.ulong++;
> +                       s.ulong++;
> +               }
> +
> +               /* restore s with the original offset */
> +               s.u8 += distance;
> +       } else
> +#endif
> +       {
> +               /* if the source and dest lower bits are the same, do a simple
> +                * 32/64 bit wide copy.
> +                */
> +               for (; count >= bytes_long; count -= bytes_long)
> +                       *d.ulong++ = *s.ulong++;
> +       }
> +
> +       /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> +       goto copy_remainder;
> +
> +copy_remainder:
> +       while (count--)
> +               *d.u8++ = *s.u8++;
> +
> +       return dest;
> +}
> +EXPORT_SYMBOL(memcpy);
> +
> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +       return memcpy(dest, src, count);
> +}
> +EXPORT_SYMBOL(__memcpy);
> --
> 2.31.1
>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16 11:46     ` Guo Ren
@ 2021-06-16 18:52       ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-16 18:52 UTC (permalink / raw)
  To: Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
>
> Hi Matteo,
>
> Have you tried Glibc generic implementation code?
> ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
>
> If Glibc codes have the same performance in your hardware, then you
> could give a generic implementation first.
>

Hi,

I had a look, it seems that it's a C unrolled version with the
'register' keyword.
The same one was already merged in nios2:
https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
result of the other versions:

[  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-16 18:52       ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-16 18:52 UTC (permalink / raw)
  To: Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
>
> Hi Matteo,
>
> Have you tried Glibc generic implementation code?
> ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
>
> If Glibc codes have the same performance in your hardware, then you
> could give a generic implementation first.
>

Hi,

I had a look, it seems that it's a C unrolled version with the
'register' keyword.
The same one was already merged in nios2:
https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
result of the other versions:

[  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

Regards,
-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16  8:24                   ` David Laight
@ 2021-06-16 19:06                     ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-16 19:06 UTC (permalink / raw)
  To: David Laight
  Cc: Bin Meng, Emil Renner Berthing, Gary Guo, linux-riscv,
	linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, Jun 16, 2021 at 10:24 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>

I get similar results in userspace, this tool write to RAM with
variable data width:

root@beaglev:~/src# ./unalign_check 1 0 1
size:           1 Mb
write size:      8 bit
unalignment:    0 byte
elapsed time:   0.01 sec
throughput:     124.36 Mb/s

# ./unalign_check 1 0 8
size:           1 Mb
write size:      64 bit
unalignment:    0 byte
elapsed time:   0.00 sec
throughput:     252.12 Mb/s

> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>

I too believe that they are much more common than long ones.
Indeed, I wish to reduce the MIN_THRESHOLD value from 64 to 32 or even 16.
Or having it dependend on the word size, e.g. sizeof(long) * 2.

Suggestions?

> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-16 19:06                     ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-16 19:06 UTC (permalink / raw)
  To: David Laight
  Cc: Bin Meng, Emil Renner Berthing, Gary Guo, linux-riscv,
	linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Akira Tsukamoto, Drew Fustini

On Wed, Jun 16, 2021 at 10:24 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>

I get similar results in userspace, this tool write to RAM with
variable data width:

root@beaglev:~/src# ./unalign_check 1 0 1
size:           1 Mb
write size:      8 bit
unalignment:    0 byte
elapsed time:   0.01 sec
throughput:     124.36 Mb/s

# ./unalign_check 1 0 8
size:           1 Mb
write size:      64 bit
unalignment:    0 byte
elapsed time:   0.00 sec
throughput:     252.12 Mb/s

> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>

I too believe that they are much more common than long ones.
Indeed, I wish to reduce the MIN_THRESHOLD value from 64 to 32 or even 16.
Or having it dependend on the word size, e.g. sizeof(long) * 2.

Suggestions?

> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Regards,
-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-16 18:52       ` Matteo Croce
@ 2021-06-17 21:30         ` David Laight
  -1 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-17 21:30 UTC (permalink / raw)
  To: 'Matteo Croce', Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 16 June 2021 19:52
> To: Guo Ren <guoren@kernel.org>
> 
> On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> >
> > Hi Matteo,
> >
> > Have you tried Glibc generic implementation code?
> > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> >
> > If Glibc codes have the same performance in your hardware, then you
> > could give a generic implementation first.

Isn't that a byte copy loop - the performance of that ought to be terrible.
...

> I had a look, it seems that it's a C unrolled version with the
> 'register' keyword.
> The same one was already merged in nios2:
> https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I know a lot about the nios2 instruction timings.
(I've looked at code execution in the fpga's intel 'logic analiser.)
It is a very simple 4-clock pipeline cpu with a 2-clock delay
before a value read from 'tightly coupled memory' (aka cache)
can be used in another instruction.
There is also a subtle pipeline stall if a read follows a write
to the same memory block because the write is executed one
clock later - and would collide with the read.
Since it only ever executes one instruction per clock loop
unrolling does help - since you never get the loop control 'for free'.
OTOH you don't need to use that many registers.
But an unrolled loop should approach 2 bytes/clock (32bit cpu).

> I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> result of the other versions:
> 
> [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

What clock speed is that running at?
It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).

While the small riscv cpu might be similar to the nios2 (and mips
for that matter), there are also bigger/faster cpu.
I'm sure these can execute multiple instructions/clock
and possible even read and write at the same time.
Unless they also support significant instruction re-ordering
the trivial copy loops are going to be slow on such cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-17 21:30         ` David Laight
  0 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-17 21:30 UTC (permalink / raw)
  To: 'Matteo Croce', Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 16 June 2021 19:52
> To: Guo Ren <guoren@kernel.org>
> 
> On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> >
> > Hi Matteo,
> >
> > Have you tried Glibc generic implementation code?
> > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> >
> > If Glibc codes have the same performance in your hardware, then you
> > could give a generic implementation first.

Isn't that a byte copy loop - the performance of that ought to be terrible.
...

> I had a look, it seems that it's a C unrolled version with the
> 'register' keyword.
> The same one was already merged in nios2:
> https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I know a lot about the nios2 instruction timings.
(I've looked at code execution in the fpga's intel 'logic analiser.)
It is a very simple 4-clock pipeline cpu with a 2-clock delay
before a value read from 'tightly coupled memory' (aka cache)
can be used in another instruction.
There is also a subtle pipeline stall if a read follows a write
to the same memory block because the write is executed one
clock later - and would collide with the read.
Since it only ever executes one instruction per clock loop
unrolling does help - since you never get the loop control 'for free'.
OTOH you don't need to use that many registers.
But an unrolled loop should approach 2 bytes/clock (32bit cpu).

> I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> result of the other versions:
> 
> [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

What clock speed is that running at?
It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).

While the small riscv cpu might be similar to the nios2 (and mips
for that matter), there are also bigger/faster cpu.
I'm sure these can execute multiple instructions/clock
and possible even read and write at the same time.
Unless they also support significant instruction re-ordering
the trivial copy loops are going to be slow on such cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-17 21:30         ` David Laight
@ 2021-06-17 21:48           ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-17 21:48 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 19:52
> > To: Guo Ren <guoren@kernel.org>
> >
> > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > >
> > > Hi Matteo,
> > >
> > > Have you tried Glibc generic implementation code?
> > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > >
> > > If Glibc codes have the same performance in your hardware, then you
> > > could give a generic implementation first.
>
> Isn't that a byte copy loop - the performance of that ought to be terrible.
> ...
>
> > I had a look, it seems that it's a C unrolled version with the
> > 'register' keyword.
> > The same one was already merged in nios2:
> > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
>
> I know a lot about the nios2 instruction timings.
> (I've looked at code execution in the fpga's intel 'logic analiser.)
> It is a very simple 4-clock pipeline cpu with a 2-clock delay
> before a value read from 'tightly coupled memory' (aka cache)
> can be used in another instruction.
> There is also a subtle pipeline stall if a read follows a write
> to the same memory block because the write is executed one
> clock later - and would collide with the read.
> Since it only ever executes one instruction per clock loop
> unrolling does help - since you never get the loop control 'for free'.
> OTOH you don't need to use that many registers.
> But an unrolled loop should approach 2 bytes/clock (32bit cpu).
>
> > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > result of the other versions:
> >
> > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
>
> What clock speed is that running at?
> It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
>
> While the small riscv cpu might be similar to the nios2 (and mips
> for that matter), there are also bigger/faster cpu.
> I'm sure these can execute multiple instructions/clock
> and possible even read and write at the same time.
> Unless they also support significant instruction re-ordering
> the trivial copy loops are going to be slow on such cpu.
>

It's running at 1 GHz.

I get 257 Mb/s with a memcpy, a bit more with a memset,
but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-17 21:48           ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-17 21:48 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 19:52
> > To: Guo Ren <guoren@kernel.org>
> >
> > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > >
> > > Hi Matteo,
> > >
> > > Have you tried Glibc generic implementation code?
> > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > >
> > > If Glibc codes have the same performance in your hardware, then you
> > > could give a generic implementation first.
>
> Isn't that a byte copy loop - the performance of that ought to be terrible.
> ...
>
> > I had a look, it seems that it's a C unrolled version with the
> > 'register' keyword.
> > The same one was already merged in nios2:
> > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
>
> I know a lot about the nios2 instruction timings.
> (I've looked at code execution in the fpga's intel 'logic analiser.)
> It is a very simple 4-clock pipeline cpu with a 2-clock delay
> before a value read from 'tightly coupled memory' (aka cache)
> can be used in another instruction.
> There is also a subtle pipeline stall if a read follows a write
> to the same memory block because the write is executed one
> clock later - and would collide with the read.
> Since it only ever executes one instruction per clock loop
> unrolling does help - since you never get the loop control 'for free'.
> OTOH you don't need to use that many registers.
> But an unrolled loop should approach 2 bytes/clock (32bit cpu).
>
> > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > result of the other versions:
> >
> > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
>
> What clock speed is that running at?
> It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
>
> While the small riscv cpu might be similar to the nios2 (and mips
> for that matter), there are also bigger/faster cpu.
> I'm sure these can execute multiple instructions/clock
> and possible even read and write at the same time.
> Unless they also support significant instruction re-ordering
> the trivial copy loops are going to be slow on such cpu.
>

It's running at 1 GHz.

I get 257 Mb/s with a memcpy, a bit more with a memset,
but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.

-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-17 21:48           ` Matteo Croce
@ 2021-06-18  0:32             ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-18  0:32 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Matteo Croce
> > > Sent: 16 June 2021 19:52
> > > To: Guo Ren <guoren@kernel.org>
> > >
> > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > > >
> > > > Hi Matteo,
> > > >
> > > > Have you tried Glibc generic implementation code?
> > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > >
> > > > If Glibc codes have the same performance in your hardware, then you
> > > > could give a generic implementation first.
> >
> > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > ...
> >
> > > I had a look, it seems that it's a C unrolled version with the
> > > 'register' keyword.
> > > The same one was already merged in nios2:
> > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> >
> > I know a lot about the nios2 instruction timings.
> > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > before a value read from 'tightly coupled memory' (aka cache)
> > can be used in another instruction.
> > There is also a subtle pipeline stall if a read follows a write
> > to the same memory block because the write is executed one
> > clock later - and would collide with the read.
> > Since it only ever executes one instruction per clock loop
> > unrolling does help - since you never get the loop control 'for free'.
> > OTOH you don't need to use that many registers.
> > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> >
> > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > result of the other versions:
> > >
> > > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> >
> > What clock speed is that running at?
> > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> >
> > While the small riscv cpu might be similar to the nios2 (and mips
> > for that matter), there are also bigger/faster cpu.
> > I'm sure these can execute multiple instructions/clock
> > and possible even read and write at the same time.
> > Unless they also support significant instruction re-ordering
> > the trivial copy loops are going to be slow on such cpu.
> >
>
> It's running at 1 GHz.
>
> I get 257 Mb/s with a memcpy, a bit more with a memset,
> but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
>

Err, I forget a mlock() before accessing the memory in userspace.

The real speed here is:

8 bit read: 155.42 Mb/s
64 bit read: 277.29 Mb/s
8 bit write: 138.57 Mb/s
64 bit write: 239.21 Mb/s

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-18  0:32             ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-18  0:32 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
<mcroce@linux.microsoft.com> wrote:
>
> On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Matteo Croce
> > > Sent: 16 June 2021 19:52
> > > To: Guo Ren <guoren@kernel.org>
> > >
> > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > > >
> > > > Hi Matteo,
> > > >
> > > > Have you tried Glibc generic implementation code?
> > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > >
> > > > If Glibc codes have the same performance in your hardware, then you
> > > > could give a generic implementation first.
> >
> > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > ...
> >
> > > I had a look, it seems that it's a C unrolled version with the
> > > 'register' keyword.
> > > The same one was already merged in nios2:
> > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> >
> > I know a lot about the nios2 instruction timings.
> > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > before a value read from 'tightly coupled memory' (aka cache)
> > can be used in another instruction.
> > There is also a subtle pipeline stall if a read follows a write
> > to the same memory block because the write is executed one
> > clock later - and would collide with the read.
> > Since it only ever executes one instruction per clock loop
> > unrolling does help - since you never get the loop control 'for free'.
> > OTOH you don't need to use that many registers.
> > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> >
> > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > result of the other versions:
> > >
> > > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> >
> > What clock speed is that running at?
> > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> >
> > While the small riscv cpu might be similar to the nios2 (and mips
> > for that matter), there are also bigger/faster cpu.
> > I'm sure these can execute multiple instructions/clock
> > and possible even read and write at the same time.
> > Unless they also support significant instruction re-ordering
> > the trivial copy loops are going to be slow on such cpu.
> >
>
> It's running at 1 GHz.
>
> I get 257 Mb/s with a memcpy, a bit more with a memset,
> but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
>

Err, I forget a mlock() before accessing the memory in userspace.

The real speed here is:

8 bit read: 155.42 Mb/s
64 bit read: 277.29 Mb/s
8 bit write: 138.57 Mb/s
64 bit write: 239.21 Mb/s

-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2021-06-18  0:32             ` Matteo Croce
@ 2021-06-18  1:05               ` Matteo Croce
  -1 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-18  1:05 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Fri, Jun 18, 2021 at 2:32 AM Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
> <mcroce@linux.microsoft.com> wrote:
> >
> > On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Matteo Croce
> > > > Sent: 16 June 2021 19:52
> > > > To: Guo Ren <guoren@kernel.org>
> > > >
> > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > > > >
> > > > > Hi Matteo,
> > > > >
> > > > > Have you tried Glibc generic implementation code?
> > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > > >
> > > > > If Glibc codes have the same performance in your hardware, then you
> > > > > could give a generic implementation first.
> > >
> > > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > > ...
> > >
> > > > I had a look, it seems that it's a C unrolled version with the
> > > > 'register' keyword.
> > > > The same one was already merged in nios2:
> > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> > >
> > > I know a lot about the nios2 instruction timings.
> > > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > > before a value read from 'tightly coupled memory' (aka cache)
> > > can be used in another instruction.
> > > There is also a subtle pipeline stall if a read follows a write
> > > to the same memory block because the write is executed one
> > > clock later - and would collide with the read.
> > > Since it only ever executes one instruction per clock loop
> > > unrolling does help - since you never get the loop control 'for free'.
> > > OTOH you don't need to use that many registers.
> > > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> > >
> > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > > result of the other versions:
> > > >
> > > > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> > >
> > > What clock speed is that running at?
> > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> > >
> > > While the small riscv cpu might be similar to the nios2 (and mips
> > > for that matter), there are also bigger/faster cpu.
> > > I'm sure these can execute multiple instructions/clock
> > > and possible even read and write at the same time.
> > > Unless they also support significant instruction re-ordering
> > > the trivial copy loops are going to be slow on such cpu.
> > >
> >
> > It's running at 1 GHz.
> >
> > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> >
>
> Err, I forget a mlock() before accessing the memory in userspace.
>
> The real speed here is:
>
> 8 bit read: 155.42 Mb/s
> 64 bit read: 277.29 Mb/s
> 8 bit write: 138.57 Mb/s
> 64 bit write: 239.21 Mb/s
>

Anyway, thanks for the info on nio2 timings.
If you think that an unrolled loop would help, we can achieve the same in C.
I think we could code something similar to a Duff device (or with jump
labels) to unroll the loop but at the same time doing efficient small copies.

Regards,

--
per aspera ad upstream

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-18  1:05               ` Matteo Croce
  0 siblings, 0 replies; 64+ messages in thread
From: Matteo Croce @ 2021-06-18  1:05 UTC (permalink / raw)
  To: David Laight
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

On Fri, Jun 18, 2021 at 2:32 AM Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
> <mcroce@linux.microsoft.com> wrote:
> >
> > On Thu, Jun 17, 2021 at 11:30 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Matteo Croce
> > > > Sent: 16 June 2021 19:52
> > > > To: Guo Ren <guoren@kernel.org>
> > > >
> > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <guoren@kernel.org> wrote:
> > > > >
> > > > > Hi Matteo,
> > > > >
> > > > > Have you tried Glibc generic implementation code?
> > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > > >
> > > > > If Glibc codes have the same performance in your hardware, then you
> > > > > could give a generic implementation first.
> > >
> > > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > > ...
> > >
> > > > I had a look, it seems that it's a C unrolled version with the
> > > > 'register' keyword.
> > > > The same one was already merged in nios2:
> > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> > >
> > > I know a lot about the nios2 instruction timings.
> > > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > > before a value read from 'tightly coupled memory' (aka cache)
> > > can be used in another instruction.
> > > There is also a subtle pipeline stall if a read follows a write
> > > to the same memory block because the write is executed one
> > > clock later - and would collide with the read.
> > > Since it only ever executes one instruction per clock loop
> > > unrolling does help - since you never get the loop control 'for free'.
> > > OTOH you don't need to use that many registers.
> > > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> > >
> > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > > result of the other versions:
> > > >
> > > > [  563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> > >
> > > What clock speed is that running at?
> > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> > >
> > > While the small riscv cpu might be similar to the nios2 (and mips
> > > for that matter), there are also bigger/faster cpu.
> > > I'm sure these can execute multiple instructions/clock
> > > and possible even read and write at the same time.
> > > Unless they also support significant instruction re-ordering
> > > the trivial copy loops are going to be slow on such cpu.
> > >
> >
> > It's running at 1 GHz.
> >
> > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> >
>
> Err, I forget a mlock() before accessing the memory in userspace.
>
> The real speed here is:
>
> 8 bit read: 155.42 Mb/s
> 64 bit read: 277.29 Mb/s
> 8 bit write: 138.57 Mb/s
> 64 bit write: 239.21 Mb/s
>

Anyway, thanks for the info on nio2 timings.
If you think that an unrolled loop would help, we can achieve the same in C.
I think we could code something similar to a Duff device (or with jump
labels) to unroll the loop but at the same time doing efficient small copies.

Regards,

--
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2021-06-18  1:05               ` Matteo Croce
@ 2021-06-18  8:32                 ` David Laight
  -1 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-18  8:32 UTC (permalink / raw)
  To: 'Matteo Croce'
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 18 June 2021 02:05
...
> > > It's running at 1 GHz.
> > >
> > > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> > >
> >
> > Err, I forget a mlock() before accessing the memory in userspace.

What is the mlock() for?
The data for a quick loop won't get paged out.
You want to test cache to cache copies, so the first loop
will always be slow.
After that each iteration should be much the same.
I use code like:
	for (;;) {
		start = read_tsc();
		do_test();
		histogram[(read_tsc() - start) >> n]++
	}
(You need to exclude outliers)
to get a distribution for the execution times.
Tends to be pretty stable - even though different program
runs can give different values!
	
> > The real speed here is:
> >
> > 8 bit read: 155.42 Mb/s
> > 64 bit read: 277.29 Mb/s
> > 8 bit write: 138.57 Mb/s
> > 64 bit write: 239.21 Mb/s
> >
> 
> Anyway, thanks for the info on nio2 timings.
> If you think that an unrolled loop would help, we can achieve the same in C.
> I think we could code something similar to a Duff device (or with jump
> labels) to unroll the loop but at the same time doing efficient small copies.

Unrolling has to be done with care.
It tends to improve benchmarks, but the extra code displaces
other code from the i-cache and slows down overall performance.
So you need 'just enough' unrolling to avoid cpu stalls.

On your system it looks like the memory/cache subsystem
is the bottleneck for the tests you are doing.
I'd really expect a 1GHz cpu to be able to read/write from
its data cache every clock.
So I'd expect transfer rates nearer 8000 MB/s, not 250 MB/s.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
@ 2021-06-18  8:32                 ` David Laight
  0 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2021-06-18  8:32 UTC (permalink / raw)
  To: 'Matteo Croce'
  Cc: Guo Ren, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng

From: Matteo Croce
> Sent: 18 June 2021 02:05
...
> > > It's running at 1 GHz.
> > >
> > > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> > >
> >
> > Err, I forget a mlock() before accessing the memory in userspace.

What is the mlock() for?
The data for a quick loop won't get paged out.
You want to test cache to cache copies, so the first loop
will always be slow.
After that each iteration should be much the same.
I use code like:
	for (;;) {
		start = read_tsc();
		do_test();
		histogram[(read_tsc() - start) >> n]++
	}
(You need to exclude outliers)
to get a distribution for the execution times.
Tends to be pretty stable - even though different program
runs can give different values!
	
> > The real speed here is:
> >
> > 8 bit read: 155.42 Mb/s
> > 64 bit read: 277.29 Mb/s
> > 8 bit write: 138.57 Mb/s
> > 64 bit write: 239.21 Mb/s
> >
> 
> Anyway, thanks for the info on nio2 timings.
> If you think that an unrolled loop would help, we can achieve the same in C.
> I think we could code something similar to a Duff device (or with jump
> labels) to unroll the loop but at the same time doing efficient small copies.

Unrolling has to be done with care.
It tends to improve benchmarks, but the extra code displaces
other code from the i-cache and slows down overall performance.
So you need 'just enough' unrolling to avoid cpu stalls.

On your system it looks like the memory/cache subsystem
is the bottleneck for the tests you are doing.
I'd really expect a 1GHz cpu to be able to read/write from
its data cache every clock.
So I'd expect transfer rates nearer 8000 MB/s, not 250 MB/s.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-31  5:25           ` Jisheng Zhang
@ 2024-01-31  9:13             ` Nick Kossifidis
  -1 siblings, 0 replies; 64+ messages in thread
From: Nick Kossifidis @ 2024-01-31  9:13 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On 1/31/24 07:25, Jisheng Zhang wrote:
> 
> I didn't have c99 spec in hand, but I found gcc explanations about
> restrict keyword from [1]:
> 
> "the restrict declaration promises that the code will not access that
> object in any other way--only through p."
> 
> So if there's overlap in memcpy, then it contradicts the restrict
> implication.
> 
> [1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html
> 
The union used in the code also contradicts this. BTW the restrict 
qualifier isn't used in kernel's lib/string.c nor in the current 
implementation 
(https://elixir.bootlin.com/linux/latest/source/arch/riscv/include/asm/string.h#L16).

> And from the manual, if the memcpy users must ensure "The memory areas
> must not overlap." So I think all linux kernel's memcpy implementations(only copy
> fw and don't take overlap into consideration) are right.
> 
> I did see the alias-memcpy-as-memmove in some libc implementations, but
> this is not the style in current kernel's implementations.
> 
> Given current riscv asm implementation also doesn't do the alias and
> copy-fw only, and this series improves performance and doesn't introduce the
> Is it better to divide this into two steps: Firstly, merge this series
> if there's no obvious bug; secondly, do the alias as you suggested,
> since you have a basic implementation, you could even submit your patch
> ;) What do you think about this two steps solution?
> 

I still don't understand why you prefer undefined behavior over just 
aliasing memcpy to memmove. Anyway, do as you wish, I don't have time to 
work on this unfortunately. Feel free to use the code I shared for bw 
copy etc.

Regards,
Nick


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-31  9:13             ` Nick Kossifidis
  0 siblings, 0 replies; 64+ messages in thread
From: Nick Kossifidis @ 2024-01-31  9:13 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On 1/31/24 07:25, Jisheng Zhang wrote:
> 
> I didn't have c99 spec in hand, but I found gcc explanations about
> restrict keyword from [1]:
> 
> "the restrict declaration promises that the code will not access that
> object in any other way--only through p."
> 
> So if there's overlap in memcpy, then it contradicts the restrict
> implication.
> 
> [1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html
> 
The union used in the code also contradicts this. BTW the restrict 
qualifier isn't used in kernel's lib/string.c nor in the current 
implementation 
(https://elixir.bootlin.com/linux/latest/source/arch/riscv/include/asm/string.h#L16).

> And from the manual, if the memcpy users must ensure "The memory areas
> must not overlap." So I think all linux kernel's memcpy implementations(only copy
> fw and don't take overlap into consideration) are right.
> 
> I did see the alias-memcpy-as-memmove in some libc implementations, but
> this is not the style in current kernel's implementations.
> 
> Given current riscv asm implementation also doesn't do the alias and
> copy-fw only, and this series improves performance and doesn't introduce the
> Is it better to divide this into two steps: Firstly, merge this series
> if there's no obvious bug; secondly, do the alias as you suggested,
> since you have a basic implementation, you could even submit your patch
> ;) What do you think about this two steps solution?
> 

I still don't understand why you prefer undefined behavior over just 
aliasing memcpy to memmove. Anyway, do as you wish, I don't have time to 
work on this unfortunately. Feel free to use the code I shared for bw 
copy etc.

Regards,
Nick


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 16:52         ` Nick Kossifidis
@ 2024-01-31  5:25           ` Jisheng Zhang
  -1 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-31  5:25 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Tue, Jan 30, 2024 at 06:52:24PM +0200, Nick Kossifidis wrote:
> On 1/30/24 15:12, Jisheng Zhang wrote:
> > On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
> > > On 1/28/24 13:10, Jisheng Zhang wrote:
> > > > From: Matteo Croce <mcroce@microsoft.com>
> > > > 
> > > > When the destination buffer is before the source one, or when the
> > > > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > > > optimized to use a bigger data size possible.
> > > > 
> > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> > > 
> > > I'd expect to have memmove handle both fw/bw copying and then memcpy being
> > > an alias to memmove, to also take care when regions overlap and avoid
> > > undefined behavior.
> > 
> > Hi Nick,
> > 
> > Here is somthing from man memcpy:
> > 
> > "void *memcpy(void dest[restrict .n], const void src[restrict .n],
> >                      size_t n);
> > 
> > The  memcpy()  function copies n bytes from memory area src to memory area dest.
> > The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
> > lap."
> > 
> > IMHO, the "restrict" implies that there's no overlap. If overlap
> > happens, the manual doesn't say what will happen.
> > 
> >  From another side, I have a concern: currently, other arch don't have
> > this alias behavior, IIUC(at least, per my understanding of arm and arm64
> > memcpy implementations)they just copy forward. I want to keep similar behavior
> > for riscv.
> > 
> > So I want to hear more before going towards alias-memcpy-to-memmove direction.
> > 
> > Thanks
> 

Hi Nick,

> If you read Matteo's original post that was also his suggestion, and Linus

I did read all discussions in Matteo's v1 ~ v5 before this renew. Per my
understanding, Matteo also concerned no such memcpy-alias-memmove behavior
in other arch's implementations.

> has also commented on that. In general it's better to handle the case where

Linus commented on https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132
about glibc alias memcpy to memove rather than the patch series.

> the regions provided to memcpy() overlap than to resort to "undefined
> behavior", I provided a backwards copy example that you can use so that we
> can have both fw and bw copying for memmove(), and use memmove() in any
> case. The [restrict .n] in the prototype is just there to say that the size
> of src is restricted by n (the next argument). If someone uses memcpy() with

I didn't have c99 spec in hand, but I found gcc explanations about
restrict keyword from [1]:

"the restrict declaration promises that the code will not access that
object in any other way--only through p."

So if there's overlap in memcpy, then it contradicts the restrict
implication.

[1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html

And from the manual, if the memcpy users must ensure "The memory areas
must not overlap." So I think all linux kernel's memcpy implementations(only copy
fw and don't take overlap into consideration) are right.

I did see the alias-memcpy-as-memmove in some libc implementations, but
this is not the style in current kernel's implementations.

Given current riscv asm implementation also doesn't do the alias and
copy-fw only, and this series improves performance and doesn't introduce the
Is it better to divide this into two steps: Firstly, merge this series
if there's no obvious bug; secondly, do the alias as you suggested,
since you have a basic implementation, you could even submit your patch
;) What do you think about this two steps solution?

Thanks
> overlapping regions, which is always a possibility, in your case it'll
> result corrupted data, we won't even get a warning (still counts as
> undefined behavior) about it.
> 
> Regards,
> Nick
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-31  5:25           ` Jisheng Zhang
  0 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-31  5:25 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Tue, Jan 30, 2024 at 06:52:24PM +0200, Nick Kossifidis wrote:
> On 1/30/24 15:12, Jisheng Zhang wrote:
> > On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
> > > On 1/28/24 13:10, Jisheng Zhang wrote:
> > > > From: Matteo Croce <mcroce@microsoft.com>
> > > > 
> > > > When the destination buffer is before the source one, or when the
> > > > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > > > optimized to use a bigger data size possible.
> > > > 
> > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> > > 
> > > I'd expect to have memmove handle both fw/bw copying and then memcpy being
> > > an alias to memmove, to also take care when regions overlap and avoid
> > > undefined behavior.
> > 
> > Hi Nick,
> > 
> > Here is somthing from man memcpy:
> > 
> > "void *memcpy(void dest[restrict .n], const void src[restrict .n],
> >                      size_t n);
> > 
> > The  memcpy()  function copies n bytes from memory area src to memory area dest.
> > The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
> > lap."
> > 
> > IMHO, the "restrict" implies that there's no overlap. If overlap
> > happens, the manual doesn't say what will happen.
> > 
> >  From another side, I have a concern: currently, other arch don't have
> > this alias behavior, IIUC(at least, per my understanding of arm and arm64
> > memcpy implementations)they just copy forward. I want to keep similar behavior
> > for riscv.
> > 
> > So I want to hear more before going towards alias-memcpy-to-memmove direction.
> > 
> > Thanks
> 

Hi Nick,

> If you read Matteo's original post that was also his suggestion, and Linus

I did read all discussions in Matteo's v1 ~ v5 before this renew. Per my
understanding, Matteo also concerned no such memcpy-alias-memmove behavior
in other arch's implementations.

> has also commented on that. In general it's better to handle the case where

Linus commented on https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132
about glibc alias memcpy to memove rather than the patch series.

> the regions provided to memcpy() overlap than to resort to "undefined
> behavior", I provided a backwards copy example that you can use so that we
> can have both fw and bw copying for memmove(), and use memmove() in any
> case. The [restrict .n] in the prototype is just there to say that the size
> of src is restricted by n (the next argument). If someone uses memcpy() with

I didn't have c99 spec in hand, but I found gcc explanations about
restrict keyword from [1]:

"the restrict declaration promises that the code will not access that
object in any other way--only through p."

So if there's overlap in memcpy, then it contradicts the restrict
implication.

[1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html

And from the manual, if the memcpy users must ensure "The memory areas
must not overlap." So I think all linux kernel's memcpy implementations(only copy
fw and don't take overlap into consideration) are right.

I did see the alias-memcpy-as-memmove in some libc implementations, but
this is not the style in current kernel's implementations.

Given current riscv asm implementation also doesn't do the alias and
copy-fw only, and this series improves performance and doesn't introduce the
Is it better to divide this into two steps: Firstly, merge this series
if there's no obvious bug; secondly, do the alias as you suggested,
since you have a basic implementation, you could even submit your patch
;) What do you think about this two steps solution?

Thanks
> overlapping regions, which is always a possibility, in your case it'll
> result corrupted data, we won't even get a warning (still counts as
> undefined behavior) about it.
> 
> Regards,
> Nick
> 

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 13:12       ` Jisheng Zhang
@ 2024-01-30 16:52         ` Nick Kossifidis
  -1 siblings, 0 replies; 64+ messages in thread
From: Nick Kossifidis @ 2024-01-30 16:52 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On 1/30/24 15:12, Jisheng Zhang wrote:
> On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
>> On 1/28/24 13:10, Jisheng Zhang wrote:
>>> From: Matteo Croce <mcroce@microsoft.com>
>>>
>>> When the destination buffer is before the source one, or when the
>>> buffers doesn't overlap, it's safe to use memcpy() instead, which is
>>> optimized to use a bigger data size possible.
>>>
>>> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
>>> Reported-by: kernel test robot <lkp@intel.com>
>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>>
>> I'd expect to have memmove handle both fw/bw copying and then memcpy being
>> an alias to memmove, to also take care when regions overlap and avoid
>> undefined behavior.
> 
> Hi Nick,
> 
> Here is somthing from man memcpy:
> 
> "void *memcpy(void dest[restrict .n], const void src[restrict .n],
>                      size_t n);
> 
> The  memcpy()  function copies n bytes from memory area src to memory area dest.
> The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
> lap."
> 
> IMHO, the "restrict" implies that there's no overlap. If overlap
> happens, the manual doesn't say what will happen.
> 
>  From another side, I have a concern: currently, other arch don't have
> this alias behavior, IIUC(at least, per my understanding of arm and arm64
> memcpy implementations)they just copy forward. I want to keep similar behavior
> for riscv.
> 
> So I want to hear more before going towards alias-memcpy-to-memmove direction.
> 
> Thanks

If you read Matteo's original post that was also his suggestion, and 
Linus has also commented on that. In general it's better to handle the 
case where the regions provided to memcpy() overlap than to resort to 
"undefined behavior", I provided a backwards copy example that you can 
use so that we can have both fw and bw copying for memmove(), and use 
memmove() in any case. The [restrict .n] in the prototype is just there 
to say that the size of src is restricted by n (the next argument). If 
someone uses memcpy() with overlapping regions, which is always a 
possibility, in your case it'll result corrupted data, we won't even get 
a warning (still counts as undefined behavior) about it.

Regards,
Nick


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-30 16:52         ` Nick Kossifidis
  0 siblings, 0 replies; 64+ messages in thread
From: Nick Kossifidis @ 2024-01-30 16:52 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On 1/30/24 15:12, Jisheng Zhang wrote:
> On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
>> On 1/28/24 13:10, Jisheng Zhang wrote:
>>> From: Matteo Croce <mcroce@microsoft.com>
>>>
>>> When the destination buffer is before the source one, or when the
>>> buffers doesn't overlap, it's safe to use memcpy() instead, which is
>>> optimized to use a bigger data size possible.
>>>
>>> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
>>> Reported-by: kernel test robot <lkp@intel.com>
>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>>
>> I'd expect to have memmove handle both fw/bw copying and then memcpy being
>> an alias to memmove, to also take care when regions overlap and avoid
>> undefined behavior.
> 
> Hi Nick,
> 
> Here is somthing from man memcpy:
> 
> "void *memcpy(void dest[restrict .n], const void src[restrict .n],
>                      size_t n);
> 
> The  memcpy()  function copies n bytes from memory area src to memory area dest.
> The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
> lap."
> 
> IMHO, the "restrict" implies that there's no overlap. If overlap
> happens, the manual doesn't say what will happen.
> 
>  From another side, I have a concern: currently, other arch don't have
> this alias behavior, IIUC(at least, per my understanding of arm and arm64
> memcpy implementations)they just copy forward. I want to keep similar behavior
> for riscv.
> 
> So I want to hear more before going towards alias-memcpy-to-memmove direction.
> 
> Thanks

If you read Matteo's original post that was also his suggestion, and 
Linus has also commented on that. In general it's better to handle the 
case where the regions provided to memcpy() overlap than to resort to 
"undefined behavior", I provided a backwards copy example that you can 
use so that we can have both fw and bw copying for memmove(), and use 
memmove() in any case. The [restrict .n] in the prototype is just there 
to say that the size of src is restricted by n (the next argument). If 
someone uses memcpy() with overlapping regions, which is always a 
possibility, in your case it'll result corrupted data, we won't even get 
a warning (still counts as undefined behavior) about it.

Regards,
Nick


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 11:39     ` Nick Kossifidis
@ 2024-01-30 13:12       ` Jisheng Zhang
  -1 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-30 13:12 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
> On 1/28/24 13:10, Jisheng Zhang wrote:
> > From: Matteo Croce <mcroce@microsoft.com>
> > 
> > When the destination buffer is before the source one, or when the
> > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > optimized to use a bigger data size possible.
> > 
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> 
> I'd expect to have memmove handle both fw/bw copying and then memcpy being
> an alias to memmove, to also take care when regions overlap and avoid
> undefined behavior.

Hi Nick,

Here is somthing from man memcpy:

"void *memcpy(void dest[restrict .n], const void src[restrict .n],
                    size_t n);

The  memcpy()  function copies n bytes from memory area src to memory area dest.
The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
lap."

IMHO, the "restrict" implies that there's no overlap. If overlap
happens, the manual doesn't say what will happen.

From another side, I have a concern: currently, other arch don't have
this alias behavior, IIUC(at least, per my understanding of arm and arm64
memcpy implementations)they just copy forward. I want to keep similar behavior
for riscv.

So I want to hear more before going towards alias-memcpy-to-memmove direction.

Thanks
> 
> 
> > --- a/arch/riscv/lib/string.c
> > +++ b/arch/riscv/lib/string.c
> > @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
> >   EXPORT_SYMBOL(memcpy);
> >   void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> >   void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> > +
> > +/*
> > + * Simply check if the buffer overlaps an call memcpy() in case,
> > + * otherwise do a simple one byte at time backward copy.
> > + */
> > +void *__memmove(void *dest, const void *src, size_t count)
> > +{
> > +	if (dest < src || src + count <= dest)
> > +		return __memcpy(dest, src, count);
> > +
> > +	if (dest > src) {
> > +		const char *s = src + count;
> > +		char *tmp = dest + count;
> > +
> > +		while (count--)
> > +			*--tmp = *--s;
> > +	}
> > +	return dest;
> > +}
> > +EXPORT_SYMBOL(__memmove);
> > +
> 
> Here is an approach for the backwards case to get things started...
> 
> static void
> copy_bw(void *dst_ptr, const void *src_ptr, size_t len)
> {
> 	union const_data src = { .as_bytes = src_ptr + len };
> 	union data dst = { .as_bytes = dst_ptr + len };
> 	size_t remaining = len;
> 	size_t src_offt = 0;
> 
> 	if (len < 2 * WORD_SIZE)
> 		goto trailing_bw;
> 
> 	for(; dst.as_uptr & WORD_MASK; remaining--)
> 		*--dst.as_bytes = *--src.as_bytes;
> 
> 	src_offt = src.as_uptr & WORD_MASK;
> 	if (!src_offt) {
> 		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE)
> 			*--dst.as_ulong = *--src.as_ulong;
> 	} else {
> 		unsigned long cur, prev;
> 		src.as_bytes -= src_offt;
> 		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
> 			cur = *src.as_ulong;
> 			prev = *--src.as_ulong;
> 			*--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) |
> 					  prev >> (src_offt * 8);
> 		}
> 		src.as_bytes += src_offt;
> 	}
> 
>  trailing_bw:
> 	while (remaining-- > 0)
> 		*--dst.as_bytes = *--src.as_bytes;
> }
> 
> Regards,
> Nick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-30 13:12       ` Jisheng Zhang
  0 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-30 13:12 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
> On 1/28/24 13:10, Jisheng Zhang wrote:
> > From: Matteo Croce <mcroce@microsoft.com>
> > 
> > When the destination buffer is before the source one, or when the
> > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > optimized to use a bigger data size possible.
> > 
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> 
> I'd expect to have memmove handle both fw/bw copying and then memcpy being
> an alias to memmove, to also take care when regions overlap and avoid
> undefined behavior.

Hi Nick,

Here is somthing from man memcpy:

"void *memcpy(void dest[restrict .n], const void src[restrict .n],
                    size_t n);

The  memcpy()  function copies n bytes from memory area src to memory area dest.
The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
lap."

IMHO, the "restrict" implies that there's no overlap. If overlap
happens, the manual doesn't say what will happen.

From another side, I have a concern: currently, other arch don't have
this alias behavior, IIUC(at least, per my understanding of arm and arm64
memcpy implementations)they just copy forward. I want to keep similar behavior
for riscv.

So I want to hear more before going towards alias-memcpy-to-memmove direction.

Thanks
> 
> 
> > --- a/arch/riscv/lib/string.c
> > +++ b/arch/riscv/lib/string.c
> > @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
> >   EXPORT_SYMBOL(memcpy);
> >   void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> >   void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> > +
> > +/*
> > + * Simply check if the buffer overlaps an call memcpy() in case,
> > + * otherwise do a simple one byte at time backward copy.
> > + */
> > +void *__memmove(void *dest, const void *src, size_t count)
> > +{
> > +	if (dest < src || src + count <= dest)
> > +		return __memcpy(dest, src, count);
> > +
> > +	if (dest > src) {
> > +		const char *s = src + count;
> > +		char *tmp = dest + count;
> > +
> > +		while (count--)
> > +			*--tmp = *--s;
> > +	}
> > +	return dest;
> > +}
> > +EXPORT_SYMBOL(__memmove);
> > +
> 
> Here is an approach for the backwards case to get things started...
> 
> static void
> copy_bw(void *dst_ptr, const void *src_ptr, size_t len)
> {
> 	union const_data src = { .as_bytes = src_ptr + len };
> 	union data dst = { .as_bytes = dst_ptr + len };
> 	size_t remaining = len;
> 	size_t src_offt = 0;
> 
> 	if (len < 2 * WORD_SIZE)
> 		goto trailing_bw;
> 
> 	for(; dst.as_uptr & WORD_MASK; remaining--)
> 		*--dst.as_bytes = *--src.as_bytes;
> 
> 	src_offt = src.as_uptr & WORD_MASK;
> 	if (!src_offt) {
> 		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE)
> 			*--dst.as_ulong = *--src.as_ulong;
> 	} else {
> 		unsigned long cur, prev;
> 		src.as_bytes -= src_offt;
> 		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
> 			cur = *src.as_ulong;
> 			prev = *--src.as_ulong;
> 			*--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) |
> 					  prev >> (src_offt * 8);
> 		}
> 		src.as_bytes += src_offt;
> 	}
> 
>  trailing_bw:
> 	while (remaining-- > 0)
> 		*--dst.as_bytes = *--src.as_bytes;
> }
> 
> Regards,
> Nick

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 11:30       ` Jisheng Zhang
@ 2024-01-30 11:51         ` David Laight
  -1 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2024-01-30 11:51 UTC (permalink / raw)
  To: 'Jisheng Zhang'
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 30 January 2024 11:31
> 
> On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote:
> > From: Jisheng Zhang
> > > Sent: 28 January 2024 11:10
> > >
> > > When the destination buffer is before the source one, or when the
> > > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > > optimized to use a bigger data size possible.
> > >
> > ...
> > > + * Simply check if the buffer overlaps an call memcpy() in case,
> > > + * otherwise do a simple one byte at time backward copy.
> >
> > I'd at least do a 64bit copy loop if the addresses are aligned.
> >
> > Thinks a bit more....
> >
> > Put the copy 64 bytes code (the body of the memcpy() loop)
> > into it an inline function and call it with increasing addresses
> > in memcpy() are decrementing addresses in memmove.
> 
> Hi David,
> 
> Besides the 64 bytes copy, there's another optimization in __memcpy:
> word-by-word copy even if s and d are not aligned.
> So if we make the two optimizd copy as inline functions and call them
> in memmove(), we almost duplicate the __memcpy code, so I think
> directly calling __memcpy is a bit better.

If a forwards copy is valid call memcpy() - which I think you do.
If not you can still use the same 'copy 8 register' code
that memcpy() uses - just with a decrementing block address.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-30 11:51         ` David Laight
  0 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2024-01-30 11:51 UTC (permalink / raw)
  To: 'Jisheng Zhang'
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 30 January 2024 11:31
> 
> On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote:
> > From: Jisheng Zhang
> > > Sent: 28 January 2024 11:10
> > >
> > > When the destination buffer is before the source one, or when the
> > > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > > optimized to use a bigger data size possible.
> > >
> > ...
> > > + * Simply check if the buffer overlaps an call memcpy() in case,
> > > + * otherwise do a simple one byte at time backward copy.
> >
> > I'd at least do a 64bit copy loop if the addresses are aligned.
> >
> > Thinks a bit more....
> >
> > Put the copy 64 bytes code (the body of the memcpy() loop)
> > into it an inline function and call it with increasing addresses
> > in memcpy() are decrementing addresses in memmove.
> 
> Hi David,
> 
> Besides the 64 bytes copy, there's another optimization in __memcpy:
> word-by-word copy even if s and d are not aligned.
> So if we make the two optimizd copy as inline functions and call them
> in memmove(), we almost duplicate the __memcpy code, so I think
> directly calling __memcpy is a bit better.

If a forwards copy is valid call memcpy() - which I think you do.
If not you can still use the same 'copy 8 register' code
that memcpy() uses - just with a decrementing block address.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-28 11:10   ` Jisheng Zhang
@ 2024-01-30 11:39     ` Nick Kossifidis
  -1 siblings, 0 replies; 64+ messages in thread
From: Nick Kossifidis @ 2024-01-30 11:39 UTC (permalink / raw)
  To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

On 1/28/24 13:10, Jisheng Zhang wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> When the destination buffer is before the source one, or when the
> buffers doesn't overlap, it's safe to use memcpy() instead, which is
> optimized to use a bigger data size possible.
> 
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>

I'd expect to have memmove handle both fw/bw copying and then memcpy 
being an alias to memmove, to also take care when regions overlap and 
avoid undefined behavior.


> --- a/arch/riscv/lib/string.c
> +++ b/arch/riscv/lib/string.c
> @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
>   EXPORT_SYMBOL(memcpy);
>   void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
>   void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> +
> +/*
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.
> + */
> +void *__memmove(void *dest, const void *src, size_t count)
> +{
> +	if (dest < src || src + count <= dest)
> +		return __memcpy(dest, src, count);
> +
> +	if (dest > src) {
> +		const char *s = src + count;
> +		char *tmp = dest + count;
> +
> +		while (count--)
> +			*--tmp = *--s;
> +	}
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memmove);
> +

Here is an approach for the backwards case to get things started...

static void
copy_bw(void *dst_ptr, const void *src_ptr, size_t len)
{
	union const_data src = { .as_bytes = src_ptr + len };
	union data dst = { .as_bytes = dst_ptr + len };
	size_t remaining = len;
	size_t src_offt = 0;

	if (len < 2 * WORD_SIZE)
		goto trailing_bw;

	for(; dst.as_uptr & WORD_MASK; remaining--)
		*--dst.as_bytes = *--src.as_bytes;

	src_offt = src.as_uptr & WORD_MASK;
	if (!src_offt) {
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE)
			*--dst.as_ulong = *--src.as_ulong;
	} else {
		unsigned long cur, prev;
		src.as_bytes -= src_offt;
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
			cur = *src.as_ulong;
			prev = *--src.as_ulong;
			*--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) |
					  prev >> (src_offt * 8);
		}
		src.as_bytes += src_offt;
	}

  trailing_bw:
	while (remaining-- > 0)
		*--dst.as_bytes = *--src.as_bytes;
}

Regards,
Nick

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-30 11:39     ` Nick Kossifidis
  0 siblings, 0 replies; 64+ messages in thread
From: Nick Kossifidis @ 2024-01-30 11:39 UTC (permalink / raw)
  To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

On 1/28/24 13:10, Jisheng Zhang wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> When the destination buffer is before the source one, or when the
> buffers doesn't overlap, it's safe to use memcpy() instead, which is
> optimized to use a bigger data size possible.
> 
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>

I'd expect to have memmove handle both fw/bw copying and then memcpy 
being an alias to memmove, to also take care when regions overlap and 
avoid undefined behavior.


> --- a/arch/riscv/lib/string.c
> +++ b/arch/riscv/lib/string.c
> @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
>   EXPORT_SYMBOL(memcpy);
>   void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
>   void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> +
> +/*
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.
> + */
> +void *__memmove(void *dest, const void *src, size_t count)
> +{
> +	if (dest < src || src + count <= dest)
> +		return __memcpy(dest, src, count);
> +
> +	if (dest > src) {
> +		const char *s = src + count;
> +		char *tmp = dest + count;
> +
> +		while (count--)
> +			*--tmp = *--s;
> +	}
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memmove);
> +

Here is an approach for the backwards case to get things started...

static void
copy_bw(void *dst_ptr, const void *src_ptr, size_t len)
{
	union const_data src = { .as_bytes = src_ptr + len };
	union data dst = { .as_bytes = dst_ptr + len };
	size_t remaining = len;
	size_t src_offt = 0;

	if (len < 2 * WORD_SIZE)
		goto trailing_bw;

	for(; dst.as_uptr & WORD_MASK; remaining--)
		*--dst.as_bytes = *--src.as_bytes;

	src_offt = src.as_uptr & WORD_MASK;
	if (!src_offt) {
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE)
			*--dst.as_ulong = *--src.as_ulong;
	} else {
		unsigned long cur, prev;
		src.as_bytes -= src_offt;
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
			cur = *src.as_ulong;
			prev = *--src.as_ulong;
			*--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) |
					  prev >> (src_offt * 8);
		}
		src.as_bytes += src_offt;
	}

  trailing_bw:
	while (remaining-- > 0)
		*--dst.as_bytes = *--src.as_bytes;
}

Regards,
Nick

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-28 12:47     ` David Laight
@ 2024-01-30 11:30       ` Jisheng Zhang
  -1 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-30 11:30 UTC (permalink / raw)
  To: David Laight
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote:
> From: Jisheng Zhang
> > Sent: 28 January 2024 11:10
> > 
> > When the destination buffer is before the source one, or when the
> > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > optimized to use a bigger data size possible.
> > 
> ...
> > + * Simply check if the buffer overlaps an call memcpy() in case,
> > + * otherwise do a simple one byte at time backward copy.
> 
> I'd at least do a 64bit copy loop if the addresses are aligned.
> 
> Thinks a bit more....
> 
> Put the copy 64 bytes code (the body of the memcpy() loop)
> into it an inline function and call it with increasing addresses
> in memcpy() are decrementing addresses in memmove.

Hi David,

Besides the 64 bytes copy, there's another optimization in __memcpy:
word-by-word copy even if s and d are not aligned.
So if we make the two optimizd copy as inline functions and call them
in memmove(), we almost duplicate the __memcpy code, so I think
directly calling __memcpy is a bit better.

Thanks
> 
> So memcpy() contains:
> 	src_lim = src_lim + count;
> 	... alignment copy
> 	for (; src + 64 <= src_lim; src += 64; dest += 64)
> 		copy_64_bytes(dest, src);
> 	... tail copy
> 
> Then you can do something very similar for backwards copies.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-30 11:30       ` Jisheng Zhang
  0 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-30 11:30 UTC (permalink / raw)
  To: David Laight
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote:
> From: Jisheng Zhang
> > Sent: 28 January 2024 11:10
> > 
> > When the destination buffer is before the source one, or when the
> > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > optimized to use a bigger data size possible.
> > 
> ...
> > + * Simply check if the buffer overlaps an call memcpy() in case,
> > + * otherwise do a simple one byte at time backward copy.
> 
> I'd at least do a 64bit copy loop if the addresses are aligned.
> 
> Thinks a bit more....
> 
> Put the copy 64 bytes code (the body of the memcpy() loop)
> into it an inline function and call it with increasing addresses
> in memcpy() are decrementing addresses in memmove.

Hi David,

Besides the 64 bytes copy, there's another optimization in __memcpy:
word-by-word copy even if s and d are not aligned.
So if we make the two optimizd copy as inline functions and call them
in memmove(), we almost duplicate the __memcpy code, so I think
directly calling __memcpy is a bit better.

Thanks
> 
> So memcpy() contains:
> 	src_lim = src_lim + count;
> 	... alignment copy
> 	for (; src + 64 <= src_lim; src += 64; dest += 64)
> 		copy_64_bytes(dest, src);
> 	... tail copy
> 
> Then you can do something very similar for backwards copies.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 2/3] riscv: optimized memmove
  2024-01-28 11:10   ` Jisheng Zhang
@ 2024-01-28 12:47     ` David Laight
  -1 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2024-01-28 12:47 UTC (permalink / raw)
  To: 'Jisheng Zhang', Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 28 January 2024 11:10
> 
> When the destination buffer is before the source one, or when the
> buffers doesn't overlap, it's safe to use memcpy() instead, which is
> optimized to use a bigger data size possible.
> 
...
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.

I'd at least do a 64bit copy loop if the addresses are aligned.

Thinks a bit more....

Put the copy 64 bytes code (the body of the memcpy() loop)
into it an inline function and call it with increasing addresses
in memcpy() are decrementing addresses in memmove.

So memcpy() contains:
	src_lim = src_lim + count;
	... alignment copy
	for (; src + 64 <= src_lim; src += 64; dest += 64)
		copy_64_bytes(dest, src);
	... tail copy

Then you can do something very similar for backwards copies.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 2/3] riscv: optimized memmove
@ 2024-01-28 12:47     ` David Laight
  0 siblings, 0 replies; 64+ messages in thread
From: David Laight @ 2024-01-28 12:47 UTC (permalink / raw)
  To: 'Jisheng Zhang', Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 28 January 2024 11:10
> 
> When the destination buffer is before the source one, or when the
> buffers doesn't overlap, it's safe to use memcpy() instead, which is
> optimized to use a bigger data size possible.
> 
...
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.

I'd at least do a 64bit copy loop if the addresses are aligned.

Thinks a bit more....

Put the copy 64 bytes code (the body of the memcpy() loop)
into it an inline function and call it with increasing addresses
in memcpy() are decrementing addresses in memmove.

So memcpy() contains:
	src_lim = src_lim + count;
	... alignment copy
	for (; src + 64 <= src_lim; src += 64; dest += 64)
		copy_64_bytes(dest, src);
	... tail copy

Then you can do something very similar for backwards copies.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] riscv: optimized memmove
  2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
@ 2024-01-28 11:10   ` Jisheng Zhang
  0 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
 arch/riscv/include/asm/string.h |   4 +-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memmove.S        | 317 --------------------------------
 arch/riscv/lib/string.c         |  25 +++
 5 files changed, 27 insertions(+), 322 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index edf1d56e4f13..17c3b40382e1 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -18,8 +18,8 @@ extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
 
 #define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 
 #define __HAVE_ARCH_STRCMP
 extern asmlinkage int strcmp(const char *cs, const char *ct);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index c69dc74e0a27..76849d0906ef 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,9 +10,7 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strlen);
 EXPORT_SYMBOL(strncmp);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 5f2f94f6db17..5fa88c5a601c 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-y			+= strcmp.o
 lib-y			+= strlen.o
 lib-y			+= string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index cb3e2e7ef0ba..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,317 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com>
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-SYM_FUNC_START(__memmove)
-	/*
-	 * Returns
-	 *   a0 - dest
-	 *
-	 * Parameters
-	 *   a0 - Inclusive first byte of dest
-	 *   a1 - Inclusive first byte of src
-	 *   a2 - Length of copy n
-	 *
-	 * Because the return matches the parameter register a0,
-	 * we will not clobber or modify that register.
-	 *
-	 * Note: This currently only works on little-endian.
-	 * To port to big-endian, reverse the direction of shifts
-	 * in the 2 misaligned fixup copy loops.
-	 */
-
-	/* Return if nothing to do */
-	beq a0, a1, .Lreturn_from_memmove
-	beqz a2, .Lreturn_from_memmove
-
-	/*
-	 * Register Uses
-	 *      Forward Copy: a1 - Index counter of src
-	 *      Reverse Copy: a4 - Index counter of src
-	 *      Forward Copy: t3 - Index counter of dest
-	 *      Reverse Copy: t4 - Index counter of dest
-	 *   Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest
-	 *   Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest
-	 *   Both Copy Modes: t0 - Link / Temporary for load-store
-	 *   Both Copy Modes: t1 - Temporary for load-store
-	 *   Both Copy Modes: t2 - Temporary for load-store
-	 *   Both Copy Modes: a5 - dest to src alignment offset
-	 *   Both Copy Modes: a6 - Shift ammount
-	 *   Both Copy Modes: a7 - Inverse Shift ammount
-	 *   Both Copy Modes: a2 - Alternate breakpoint for unrolled loops
-	 */
-
-	/*
-	 * Solve for some register values now.
-	 * Byte copy does not need t5 or t6.
-	 */
-	mv   t3, a0
-	add  t4, a0, a2
-	add  a4, a1, a2
-
-	/*
-	 * Byte copy if copying less than (2 * SZREG) bytes. This can
-	 * cause problems with the bulk copy implementation and is
-	 * small enough not to bother.
-	 */
-	andi t0, a2, -(2 * SZREG)
-	beqz t0, .Lbyte_copy
-
-	/*
-	 * Now solve for t5 and t6.
-	 */
-	andi t5, t3, -SZREG
-	andi t6, t4, -SZREG
-	/*
-	 * If dest(Register t3) rounded down to the nearest naturally
-	 * aligned SZREG address, does not equal dest, then add SZREG
-	 * to find the low-bound of SZREG alignment in the dest memory
-	 * region.  Note that this could overshoot the dest memory
-	 * region if n is less than SZREG.  This is one reason why
-	 * we always byte copy if n is less than SZREG.
-	 * Otherwise, dest is already naturally aligned to SZREG.
-	 */
-	beq  t5, t3, 1f
-		addi t5, t5, SZREG
-	1:
-
-	/*
-	 * If the dest and src are co-aligned to SZREG, then there is
-	 * no need for the full rigmarole of a full misaligned fixup copy.
-	 * Instead, do a simpler co-aligned copy.
-	 */
-	xor  t0, a0, a1
-	andi t1, t0, (SZREG - 1)
-	beqz t1, .Lcoaligned_copy
-	/* Fall through to misaligned fixup copy */
-
-.Lmisaligned_fixup_copy:
-	bltu a1, a0, .Lmisaligned_fixup_copy_reverse
-
-.Lmisaligned_fixup_copy_forward:
-	jal  t0, .Lbyte_copy_until_aligned_forward
-
-	andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */
-	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
-	sub  a5, a1, t3 /* Find the difference between src and dest */
-	andi a1, a1, -SZREG /* Align the src pointer */
-	addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/
-
-	/*
-	 * Compute The Inverse Shift
-	 * a7 = XLEN - a6 = XLEN + -a6
-	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
-	 * Add that to XLEN.  XLEN = SZREG * 8.
-	 */
-	not  a7, a6
-	addi a7, a7, (SZREG * 8 + 1)
-
-	/*
-	 * Fix Misalignment Copy Loop - Forward
-	 * load_val0 = load_ptr[0];
-	 * do {
-	 * 	load_val1 = load_ptr[1];
-	 * 	store_ptr += 2;
-	 * 	store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7});
-	 *
-	 * 	if (store_ptr == {a2})
-	 * 		break;
-	 *
-	 * 	load_val0 = load_ptr[2];
-	 * 	load_ptr += 2;
-	 * 	store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7});
-	 *
-	 * } while (store_ptr != store_ptr_end);
-	 * store_ptr = store_ptr_end;
-	 */
-
-	REG_L t0, (0 * SZREG)(a1)
-	1:
-	REG_L t1, (1 * SZREG)(a1)
-	addi  t3, t3, (2 * SZREG)
-	srl   t0, t0, a6
-	sll   t2, t1, a7
-	or    t2, t0, t2
-	REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3)
-
-	beq   t3, a2, 2f
-
-	REG_L t0, (2 * SZREG)(a1)
-	addi  a1, a1, (2 * SZREG)
-	srl   t1, t1, a6
-	sll   t2, t0, a7
-	or    t2, t1, t2
-	REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3)
-
-	bne   t3, t6, 1b
-	2:
-	mv    t3, t6 /* Fix the dest pointer in case the loop was broken */
-
-	add  a1, t3, a5 /* Restore the src pointer */
-	j .Lbyte_copy_forward /* Copy any remaining bytes */
-
-.Lmisaligned_fixup_copy_reverse:
-	jal  t0, .Lbyte_copy_until_aligned_reverse
-
-	andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */
-	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
-	sub  a5, a4, t4 /* Find the difference between src and dest */
-	andi a4, a4, -SZREG /* Align the src pointer */
-	addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/
-
-	/*
-	 * Compute The Inverse Shift
-	 * a7 = XLEN - a6 = XLEN + -a6
-	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
-	 * Add that to XLEN.  XLEN = SZREG * 8.
-	 */
-	not  a7, a6
-	addi a7, a7, (SZREG * 8 + 1)
-
-	/*
-	 * Fix Misalignment Copy Loop - Reverse
-	 * load_val1 = load_ptr[0];
-	 * do {
-	 * 	load_val0 = load_ptr[-1];
-	 * 	store_ptr -= 2;
-	 * 	store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7});
-	 *
-	 * 	if (store_ptr == {a2})
-	 * 		break;
-	 *
-	 * 	load_val1 = load_ptr[-2];
-	 * 	load_ptr -= 2;
-	 * 	store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7});
-	 *
-	 * } while (store_ptr != store_ptr_end);
-	 * store_ptr = store_ptr_end;
-	 */
-
-	REG_L t1, ( 0 * SZREG)(a4)
-	1:
-	REG_L t0, (-1 * SZREG)(a4)
-	addi  t4, t4, (-2 * SZREG)
-	sll   t1, t1, a7
-	srl   t2, t0, a6
-	or    t2, t1, t2
-	REG_S t2, ( 1 * SZREG)(t4)
-
-	beq   t4, a2, 2f
-
-	REG_L t1, (-2 * SZREG)(a4)
-	addi  a4, a4, (-2 * SZREG)
-	sll   t0, t0, a7
-	srl   t2, t1, a6
-	or    t2, t0, t2
-	REG_S t2, ( 0 * SZREG)(t4)
-
-	bne   t4, t5, 1b
-	2:
-	mv    t4, t5 /* Fix the dest pointer in case the loop was broken */
-
-	add  a4, t4, a5 /* Restore the src pointer */
-	j .Lbyte_copy_reverse /* Copy any remaining bytes */
-
-/*
- * Simple copy loops for SZREG co-aligned memory locations.
- * These also make calls to do byte copies for any unaligned
- * data at their terminations.
- */
-.Lcoaligned_copy:
-	bltu a1, a0, .Lcoaligned_copy_reverse
-
-.Lcoaligned_copy_forward:
-	jal t0, .Lbyte_copy_until_aligned_forward
-
-	1:
-	REG_L t1, ( 0 * SZREG)(a1)
-	addi  a1, a1, SZREG
-	addi  t3, t3, SZREG
-	REG_S t1, (-1 * SZREG)(t3)
-	bne   t3, t6, 1b
-
-	j .Lbyte_copy_forward /* Copy any remaining bytes */
-
-.Lcoaligned_copy_reverse:
-	jal t0, .Lbyte_copy_until_aligned_reverse
-
-	1:
-	REG_L t1, (-1 * SZREG)(a4)
-	addi  a4, a4, -SZREG
-	addi  t4, t4, -SZREG
-	REG_S t1, ( 0 * SZREG)(t4)
-	bne   t4, t5, 1b
-
-	j .Lbyte_copy_reverse /* Copy any remaining bytes */
-
-/*
- * These are basically sub-functions within the function.  They
- * are used to byte copy until the dest pointer is in alignment.
- * At which point, a bulk copy method can be used by the
- * calling code.  These work on the same registers as the bulk
- * copy loops.  Therefore, the register values can be picked
- * up from where they were left and we avoid code duplication
- * without any overhead except the call in and return jumps.
- */
-.Lbyte_copy_until_aligned_forward:
-	beq  t3, t5, 2f
-	1:
-	lb   t1,  0(a1)
-	addi a1, a1, 1
-	addi t3, t3, 1
-	sb   t1, -1(t3)
-	bne  t3, t5, 1b
-	2:
-	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
-
-.Lbyte_copy_until_aligned_reverse:
-	beq  t4, t6, 2f
-	1:
-	lb   t1, -1(a4)
-	addi a4, a4, -1
-	addi t4, t4, -1
-	sb   t1,  0(t4)
-	bne  t4, t6, 1b
-	2:
-	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
-
-/*
- * Simple byte copy loops.
- * These will byte copy until they reach the end of data to copy.
- * At that point, they will call to return from memmove.
- */
-.Lbyte_copy:
-	bltu a1, a0, .Lbyte_copy_reverse
-
-.Lbyte_copy_forward:
-	beq  t3, t4, 2f
-	1:
-	lb   t1,  0(a1)
-	addi a1, a1, 1
-	addi t3, t3, 1
-	sb   t1, -1(t3)
-	bne  t3, t4, 1b
-	2:
-	ret
-
-.Lbyte_copy_reverse:
-	beq  t4, t3, 2f
-	1:
-	lb   t1, -1(a4)
-	addi a4, a4, -1
-	addi t4, t4, -1
-	sb   t1,  0(t4)
-	bne  t4, t3, 1b
-	2:
-
-.Lreturn_from_memmove:
-	ret
-
-SYM_FUNC_END(__memmove)
-SYM_FUNC_ALIAS_WEAK(memmove, __memmove)
-SYM_FUNC_ALIAS(__pi_memmove, __memmove)
-SYM_FUNC_ALIAS(__pi___memmove, __memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 5f9c83ec548d..20677c8067da 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
 EXPORT_SYMBOL(memcpy);
 void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
 void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return __memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(__memmove);
+
+void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
+EXPORT_SYMBOL(memmove);
+void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove);
+void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/3] riscv: optimized memmove
@ 2024-01-28 11:10   ` Jisheng Zhang
  0 siblings, 0 replies; 64+ messages in thread
From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
 arch/riscv/include/asm/string.h |   4 +-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memmove.S        | 317 --------------------------------
 arch/riscv/lib/string.c         |  25 +++
 5 files changed, 27 insertions(+), 322 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index edf1d56e4f13..17c3b40382e1 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -18,8 +18,8 @@ extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
 
 #define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 
 #define __HAVE_ARCH_STRCMP
 extern asmlinkage int strcmp(const char *cs, const char *ct);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index c69dc74e0a27..76849d0906ef 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,9 +10,7 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strlen);
 EXPORT_SYMBOL(strncmp);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 5f2f94f6db17..5fa88c5a601c 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-y			+= strcmp.o
 lib-y			+= strlen.o
 lib-y			+= string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index cb3e2e7ef0ba..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,317 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com>
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-SYM_FUNC_START(__memmove)
-	/*
-	 * Returns
-	 *   a0 - dest
-	 *
-	 * Parameters
-	 *   a0 - Inclusive first byte of dest
-	 *   a1 - Inclusive first byte of src
-	 *   a2 - Length of copy n
-	 *
-	 * Because the return matches the parameter register a0,
-	 * we will not clobber or modify that register.
-	 *
-	 * Note: This currently only works on little-endian.
-	 * To port to big-endian, reverse the direction of shifts
-	 * in the 2 misaligned fixup copy loops.
-	 */
-
-	/* Return if nothing to do */
-	beq a0, a1, .Lreturn_from_memmove
-	beqz a2, .Lreturn_from_memmove
-
-	/*
-	 * Register Uses
-	 *      Forward Copy: a1 - Index counter of src
-	 *      Reverse Copy: a4 - Index counter of src
-	 *      Forward Copy: t3 - Index counter of dest
-	 *      Reverse Copy: t4 - Index counter of dest
-	 *   Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest
-	 *   Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest
-	 *   Both Copy Modes: t0 - Link / Temporary for load-store
-	 *   Both Copy Modes: t1 - Temporary for load-store
-	 *   Both Copy Modes: t2 - Temporary for load-store
-	 *   Both Copy Modes: a5 - dest to src alignment offset
-	 *   Both Copy Modes: a6 - Shift ammount
-	 *   Both Copy Modes: a7 - Inverse Shift ammount
-	 *   Both Copy Modes: a2 - Alternate breakpoint for unrolled loops
-	 */
-
-	/*
-	 * Solve for some register values now.
-	 * Byte copy does not need t5 or t6.
-	 */
-	mv   t3, a0
-	add  t4, a0, a2
-	add  a4, a1, a2
-
-	/*
-	 * Byte copy if copying less than (2 * SZREG) bytes. This can
-	 * cause problems with the bulk copy implementation and is
-	 * small enough not to bother.
-	 */
-	andi t0, a2, -(2 * SZREG)
-	beqz t0, .Lbyte_copy
-
-	/*
-	 * Now solve for t5 and t6.
-	 */
-	andi t5, t3, -SZREG
-	andi t6, t4, -SZREG
-	/*
-	 * If dest(Register t3) rounded down to the nearest naturally
-	 * aligned SZREG address, does not equal dest, then add SZREG
-	 * to find the low-bound of SZREG alignment in the dest memory
-	 * region.  Note that this could overshoot the dest memory
-	 * region if n is less than SZREG.  This is one reason why
-	 * we always byte copy if n is less than SZREG.
-	 * Otherwise, dest is already naturally aligned to SZREG.
-	 */
-	beq  t5, t3, 1f
-		addi t5, t5, SZREG
-	1:
-
-	/*
-	 * If the dest and src are co-aligned to SZREG, then there is
-	 * no need for the full rigmarole of a full misaligned fixup copy.
-	 * Instead, do a simpler co-aligned copy.
-	 */
-	xor  t0, a0, a1
-	andi t1, t0, (SZREG - 1)
-	beqz t1, .Lcoaligned_copy
-	/* Fall through to misaligned fixup copy */
-
-.Lmisaligned_fixup_copy:
-	bltu a1, a0, .Lmisaligned_fixup_copy_reverse
-
-.Lmisaligned_fixup_copy_forward:
-	jal  t0, .Lbyte_copy_until_aligned_forward
-
-	andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */
-	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
-	sub  a5, a1, t3 /* Find the difference between src and dest */
-	andi a1, a1, -SZREG /* Align the src pointer */
-	addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/
-
-	/*
-	 * Compute The Inverse Shift
-	 * a7 = XLEN - a6 = XLEN + -a6
-	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
-	 * Add that to XLEN.  XLEN = SZREG * 8.
-	 */
-	not  a7, a6
-	addi a7, a7, (SZREG * 8 + 1)
-
-	/*
-	 * Fix Misalignment Copy Loop - Forward
-	 * load_val0 = load_ptr[0];
-	 * do {
-	 * 	load_val1 = load_ptr[1];
-	 * 	store_ptr += 2;
-	 * 	store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7});
-	 *
-	 * 	if (store_ptr == {a2})
-	 * 		break;
-	 *
-	 * 	load_val0 = load_ptr[2];
-	 * 	load_ptr += 2;
-	 * 	store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7});
-	 *
-	 * } while (store_ptr != store_ptr_end);
-	 * store_ptr = store_ptr_end;
-	 */
-
-	REG_L t0, (0 * SZREG)(a1)
-	1:
-	REG_L t1, (1 * SZREG)(a1)
-	addi  t3, t3, (2 * SZREG)
-	srl   t0, t0, a6
-	sll   t2, t1, a7
-	or    t2, t0, t2
-	REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3)
-
-	beq   t3, a2, 2f
-
-	REG_L t0, (2 * SZREG)(a1)
-	addi  a1, a1, (2 * SZREG)
-	srl   t1, t1, a6
-	sll   t2, t0, a7
-	or    t2, t1, t2
-	REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3)
-
-	bne   t3, t6, 1b
-	2:
-	mv    t3, t6 /* Fix the dest pointer in case the loop was broken */
-
-	add  a1, t3, a5 /* Restore the src pointer */
-	j .Lbyte_copy_forward /* Copy any remaining bytes */
-
-.Lmisaligned_fixup_copy_reverse:
-	jal  t0, .Lbyte_copy_until_aligned_reverse
-
-	andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */
-	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
-	sub  a5, a4, t4 /* Find the difference between src and dest */
-	andi a4, a4, -SZREG /* Align the src pointer */
-	addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/
-
-	/*
-	 * Compute The Inverse Shift
-	 * a7 = XLEN - a6 = XLEN + -a6
-	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
-	 * Add that to XLEN.  XLEN = SZREG * 8.
-	 */
-	not  a7, a6
-	addi a7, a7, (SZREG * 8 + 1)
-
-	/*
-	 * Fix Misalignment Copy Loop - Reverse
-	 * load_val1 = load_ptr[0];
-	 * do {
-	 * 	load_val0 = load_ptr[-1];
-	 * 	store_ptr -= 2;
-	 * 	store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7});
-	 *
-	 * 	if (store_ptr == {a2})
-	 * 		break;
-	 *
-	 * 	load_val1 = load_ptr[-2];
-	 * 	load_ptr -= 2;
-	 * 	store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7});
-	 *
-	 * } while (store_ptr != store_ptr_end);
-	 * store_ptr = store_ptr_end;
-	 */
-
-	REG_L t1, ( 0 * SZREG)(a4)
-	1:
-	REG_L t0, (-1 * SZREG)(a4)
-	addi  t4, t4, (-2 * SZREG)
-	sll   t1, t1, a7
-	srl   t2, t0, a6
-	or    t2, t1, t2
-	REG_S t2, ( 1 * SZREG)(t4)
-
-	beq   t4, a2, 2f
-
-	REG_L t1, (-2 * SZREG)(a4)
-	addi  a4, a4, (-2 * SZREG)
-	sll   t0, t0, a7
-	srl   t2, t1, a6
-	or    t2, t0, t2
-	REG_S t2, ( 0 * SZREG)(t4)
-
-	bne   t4, t5, 1b
-	2:
-	mv    t4, t5 /* Fix the dest pointer in case the loop was broken */
-
-	add  a4, t4, a5 /* Restore the src pointer */
-	j .Lbyte_copy_reverse /* Copy any remaining bytes */
-
-/*
- * Simple copy loops for SZREG co-aligned memory locations.
- * These also make calls to do byte copies for any unaligned
- * data at their terminations.
- */
-.Lcoaligned_copy:
-	bltu a1, a0, .Lcoaligned_copy_reverse
-
-.Lcoaligned_copy_forward:
-	jal t0, .Lbyte_copy_until_aligned_forward
-
-	1:
-	REG_L t1, ( 0 * SZREG)(a1)
-	addi  a1, a1, SZREG
-	addi  t3, t3, SZREG
-	REG_S t1, (-1 * SZREG)(t3)
-	bne   t3, t6, 1b
-
-	j .Lbyte_copy_forward /* Copy any remaining bytes */
-
-.Lcoaligned_copy_reverse:
-	jal t0, .Lbyte_copy_until_aligned_reverse
-
-	1:
-	REG_L t1, (-1 * SZREG)(a4)
-	addi  a4, a4, -SZREG
-	addi  t4, t4, -SZREG
-	REG_S t1, ( 0 * SZREG)(t4)
-	bne   t4, t5, 1b
-
-	j .Lbyte_copy_reverse /* Copy any remaining bytes */
-
-/*
- * These are basically sub-functions within the function.  They
- * are used to byte copy until the dest pointer is in alignment.
- * At which point, a bulk copy method can be used by the
- * calling code.  These work on the same registers as the bulk
- * copy loops.  Therefore, the register values can be picked
- * up from where they were left and we avoid code duplication
- * without any overhead except the call in and return jumps.
- */
-.Lbyte_copy_until_aligned_forward:
-	beq  t3, t5, 2f
-	1:
-	lb   t1,  0(a1)
-	addi a1, a1, 1
-	addi t3, t3, 1
-	sb   t1, -1(t3)
-	bne  t3, t5, 1b
-	2:
-	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
-
-.Lbyte_copy_until_aligned_reverse:
-	beq  t4, t6, 2f
-	1:
-	lb   t1, -1(a4)
-	addi a4, a4, -1
-	addi t4, t4, -1
-	sb   t1,  0(t4)
-	bne  t4, t6, 1b
-	2:
-	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
-
-/*
- * Simple byte copy loops.
- * These will byte copy until they reach the end of data to copy.
- * At that point, they will call to return from memmove.
- */
-.Lbyte_copy:
-	bltu a1, a0, .Lbyte_copy_reverse
-
-.Lbyte_copy_forward:
-	beq  t3, t4, 2f
-	1:
-	lb   t1,  0(a1)
-	addi a1, a1, 1
-	addi t3, t3, 1
-	sb   t1, -1(t3)
-	bne  t3, t4, 1b
-	2:
-	ret
-
-.Lbyte_copy_reverse:
-	beq  t4, t3, 2f
-	1:
-	lb   t1, -1(a4)
-	addi a4, a4, -1
-	addi t4, t4, -1
-	sb   t1,  0(t4)
-	bne  t4, t3, 1b
-	2:
-
-.Lreturn_from_memmove:
-	ret
-
-SYM_FUNC_END(__memmove)
-SYM_FUNC_ALIAS_WEAK(memmove, __memmove)
-SYM_FUNC_ALIAS(__pi_memmove, __memmove)
-SYM_FUNC_ALIAS(__pi___memmove, __memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 5f9c83ec548d..20677c8067da 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
 EXPORT_SYMBOL(memcpy);
 void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
 void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return __memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(__memmove);
+
+void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
+EXPORT_SYMBOL(memmove);
+void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove);
+void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove);
-- 
2.43.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2024-01-31 11:31 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15  2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce
2021-06-15  2:38 ` Matteo Croce
2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
2021-06-15  2:38   ` Matteo Croce
2021-06-15  8:57   ` David Laight
2021-06-15  8:57     ` David Laight
2021-06-15 13:08     ` Bin Meng
2021-06-15 13:08       ` Bin Meng
2021-06-15 13:18       ` David Laight
2021-06-15 13:18         ` David Laight
2021-06-15 13:28         ` Bin Meng
2021-06-15 13:28           ` Bin Meng
2021-06-15 16:12           ` Emil Renner Berthing
2021-06-15 16:12             ` Emil Renner Berthing
2021-06-16  0:33             ` Bin Meng
2021-06-16  0:33               ` Bin Meng
2021-06-16  2:01               ` Matteo Croce
2021-06-16  2:01                 ` Matteo Croce
2021-06-16  8:24                 ` David Laight
2021-06-16  8:24                   ` David Laight
2021-06-16 10:48                   ` Akira Tsukamoto
2021-06-16 10:48                     ` Akira Tsukamoto
2021-06-16 19:06                   ` Matteo Croce
2021-06-16 19:06                     ` Matteo Croce
2021-06-15 13:44         ` Matteo Croce
2021-06-15 13:44           ` Matteo Croce
2021-06-16 11:46   ` Guo Ren
2021-06-16 11:46     ` Guo Ren
2021-06-16 18:52     ` Matteo Croce
2021-06-16 18:52       ` Matteo Croce
2021-06-17 21:30       ` David Laight
2021-06-17 21:30         ` David Laight
2021-06-17 21:48         ` Matteo Croce
2021-06-17 21:48           ` Matteo Croce
2021-06-18  0:32           ` Matteo Croce
2021-06-18  0:32             ` Matteo Croce
2021-06-18  1:05             ` Matteo Croce
2021-06-18  1:05               ` Matteo Croce
2021-06-18  8:32               ` David Laight
2021-06-18  8:32                 ` David Laight
2021-06-15  2:38 ` [PATCH 2/3] riscv: optimized memmove Matteo Croce
2021-06-15  2:38   ` Matteo Croce
2021-06-15  2:38 ` [PATCH 3/3] riscv: optimized memset Matteo Croce
2021-06-15  2:38   ` Matteo Croce
2021-06-15  2:43 ` [PATCH 0/3] riscv: optimized mem* functions Bin Meng
2021-06-15  2:43   ` Bin Meng
2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
2024-01-28 11:10 ` [PATCH 2/3] riscv: optimized memmove Jisheng Zhang
2024-01-28 11:10   ` Jisheng Zhang
2024-01-28 12:47   ` David Laight
2024-01-28 12:47     ` David Laight
2024-01-30 11:30     ` Jisheng Zhang
2024-01-30 11:30       ` Jisheng Zhang
2024-01-30 11:51       ` David Laight
2024-01-30 11:51         ` David Laight
2024-01-30 11:39   ` Nick Kossifidis
2024-01-30 11:39     ` Nick Kossifidis
2024-01-30 13:12     ` Jisheng Zhang
2024-01-30 13:12       ` Jisheng Zhang
2024-01-30 16:52       ` Nick Kossifidis
2024-01-30 16:52         ` Nick Kossifidis
2024-01-31  5:25         ` Jisheng Zhang
2024-01-31  5:25           ` Jisheng Zhang
2024-01-31  9:13           ` Nick Kossifidis
2024-01-31  9:13             ` Nick Kossifidis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.