All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] riscv: optimized mem* functions
@ 2021-06-17 15:27 ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

Replace the assembly mem{cpy,move,set} with C equivalent.

Try to access RAM with the largest bit width possible, but without
doing unaligned accesses.

Tested on a BeagleV Starlight with a SiFive U74 core, where the
improvement is noticeable.

v2 -> v3:
- alias mem* to __mem* and not viceversa
- use __alias instead of a tail call

v1 -> v2:
- reduce the threshold from 64 to 16 bytes
- fix KASAN build
- optimize memset

Matteo Croce (3):
  riscv: optimized memcpy
  riscv: optimized memmove
  riscv: optimized memset

 arch/riscv/include/asm/string.h |  18 ++--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  17 ----
 arch/riscv/lib/Makefile         |   4 +-
 arch/riscv/lib/memcpy.S         | 108 ----------------------
 arch/riscv/lib/memmove.S        |  64 -------------
 arch/riscv/lib/memset.S         | 113 -----------------------
 arch/riscv/lib/string.c         | 153 ++++++++++++++++++++++++++++++++
 8 files changed, 163 insertions(+), 315 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memcpy.S
 delete mode 100644 arch/riscv/lib/memmove.S
 delete mode 100644 arch/riscv/lib/memset.S
 create mode 100644 arch/riscv/lib/string.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 0/3] riscv: optimized mem* functions
@ 2021-06-17 15:27 ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

Replace the assembly mem{cpy,move,set} with C equivalent.

Try to access RAM with the largest bit width possible, but without
doing unaligned accesses.

Tested on a BeagleV Starlight with a SiFive U74 core, where the
improvement is noticeable.

v2 -> v3:
- alias mem* to __mem* and not viceversa
- use __alias instead of a tail call

v1 -> v2:
- reduce the threshold from 64 to 16 bytes
- fix KASAN build
- optimize memset

Matteo Croce (3):
  riscv: optimized memcpy
  riscv: optimized memmove
  riscv: optimized memset

 arch/riscv/include/asm/string.h |  18 ++--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  17 ----
 arch/riscv/lib/Makefile         |   4 +-
 arch/riscv/lib/memcpy.S         | 108 ----------------------
 arch/riscv/lib/memmove.S        |  64 -------------
 arch/riscv/lib/memset.S         | 113 -----------------------
 arch/riscv/lib/string.c         | 153 ++++++++++++++++++++++++++++++++
 8 files changed, 163 insertions(+), 315 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memcpy.S
 delete mode 100644 arch/riscv/lib/memmove.S
 delete mode 100644 arch/riscv/lib/memset.S
 create mode 100644 arch/riscv/lib/string.c

-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-17 15:27 ` Matteo Croce
@ 2021-06-17 15:27   ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |   8 ++-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   2 +-
 arch/riscv/lib/memcpy.S         | 108 --------------------------------
 arch/riscv/lib/string.c         |  91 +++++++++++++++++++++++++++
 5 files changed, 98 insertions(+), 113 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 909049366555..6b5d6fc3eab4 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,13 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+#endif
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 5ab1c7e1a6ed..3f6d512a5b97 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,8 +10,6 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 25d5c9664e57..2ffe85d4baee 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
+lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
 
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 51ab716253fa..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-ENTRY(__memcpy)
-WEAK(memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-END(__memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..e48a79a045d8
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* Minimum size for a word copy to be convenient */
+#define MIN_THRESHOLD (BITS_PER_LONG / 8 * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *u8;
+	unsigned long *ulong;
+	uintptr_t uptr;
+};
+
+union const_types {
+	const u8 *u8;
+	unsigned long *ulong;
+};
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	const int bytes_long = BITS_PER_LONG / 8;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	const int mask = bytes_long - 1;
+	const int distance = (src - dest) & mask;
+#endif
+	union const_types s = { .u8 = src };
+	union types d = { .u8 = dest };
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	if (count < MIN_THRESHOLD)
+		goto copy_remainder;
+
+	/* copy a byte at time until destination is aligned */
+	for (; count && d.uptr & mask; count--)
+		*d.u8++ = *s.u8++;
+
+	if (distance) {
+		unsigned long last, next;
+
+		/* move s backward to the previous alignment boundary */
+		s.u8 -= distance;
+
+		/* 32/64 bit wide copy from s to d.
+		 * d is aligned now but s is not, so read s alignment wise,
+		 * and do proper shift to get the right value.
+		 * Works only on Little Endian machines.
+		 */
+		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
+			last = next;
+			next = s.ulong[1];
+
+			d.ulong[0] = last >> (distance * 8) |
+				     next << ((bytes_long - distance) * 8);
+
+			d.ulong++;
+			s.ulong++;
+		}
+
+		/* restore s with the original offset */
+		s.u8 += distance;
+	} else
+#endif
+	{
+		/* if the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= bytes_long; count -= bytes_long)
+			*d.ulong++ = *s.ulong++;
+	}
+
+	/* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
+	goto copy_remainder;
+
+copy_remainder:
+	while (count--)
+		*d.u8++ = *s.u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(__memcpy);
+
+void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
+EXPORT_SYMBOL(memcpy);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-17 15:27   ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |   8 ++-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   2 +-
 arch/riscv/lib/memcpy.S         | 108 --------------------------------
 arch/riscv/lib/string.c         |  91 +++++++++++++++++++++++++++
 5 files changed, 98 insertions(+), 113 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 909049366555..6b5d6fc3eab4 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,13 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+#endif
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 5ab1c7e1a6ed..3f6d512a5b97 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,8 +10,6 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 25d5c9664e57..2ffe85d4baee 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
+lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
 
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 51ab716253fa..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-ENTRY(__memcpy)
-WEAK(memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-END(__memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..e48a79a045d8
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* Minimum size for a word copy to be convenient */
+#define MIN_THRESHOLD (BITS_PER_LONG / 8 * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *u8;
+	unsigned long *ulong;
+	uintptr_t uptr;
+};
+
+union const_types {
+	const u8 *u8;
+	unsigned long *ulong;
+};
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	const int bytes_long = BITS_PER_LONG / 8;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	const int mask = bytes_long - 1;
+	const int distance = (src - dest) & mask;
+#endif
+	union const_types s = { .u8 = src };
+	union types d = { .u8 = dest };
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	if (count < MIN_THRESHOLD)
+		goto copy_remainder;
+
+	/* copy a byte at time until destination is aligned */
+	for (; count && d.uptr & mask; count--)
+		*d.u8++ = *s.u8++;
+
+	if (distance) {
+		unsigned long last, next;
+
+		/* move s backward to the previous alignment boundary */
+		s.u8 -= distance;
+
+		/* 32/64 bit wide copy from s to d.
+		 * d is aligned now but s is not, so read s alignment wise,
+		 * and do proper shift to get the right value.
+		 * Works only on Little Endian machines.
+		 */
+		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
+			last = next;
+			next = s.ulong[1];
+
+			d.ulong[0] = last >> (distance * 8) |
+				     next << ((bytes_long - distance) * 8);
+
+			d.ulong++;
+			s.ulong++;
+		}
+
+		/* restore s with the original offset */
+		s.u8 += distance;
+	} else
+#endif
+	{
+		/* if the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= bytes_long; count -= bytes_long)
+			*d.ulong++ = *s.ulong++;
+	}
+
+	/* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
+	goto copy_remainder;
+
+copy_remainder:
+	while (count--)
+		*d.u8++ = *s.u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(__memcpy);
+
+void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
+EXPORT_SYMBOL(memcpy);
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/3] riscv: optimized memmove
  2021-06-17 15:27 ` Matteo Croce
@ 2021-06-17 15:27   ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  6 ++--
 arch/riscv/kernel/riscv_ksyms.c |  2 --
 arch/riscv/lib/Makefile         |  1 -
 arch/riscv/lib/memmove.S        | 64 ---------------------------------
 arch/riscv/lib/string.c         | 23 ++++++++++++
 5 files changed, 26 insertions(+), 70 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 6b5d6fc3eab4..25d9b9078569 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
+#define __HAVE_ARCH_MEMMOVE
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 #endif
 
-#define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
 /* For those files which don't want to check by kasan. */
 #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 3f6d512a5b97..361565c4db7e 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,6 +10,4 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 2ffe85d4baee..484f5ff7b508 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index 07d1d2152ba5..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index e48a79a045d8..9c7009d43c39 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -89,3 +89,26 @@ EXPORT_SYMBOL(__memcpy);
 
 void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
 EXPORT_SYMBOL(memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(__memmove);
+
+void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
+EXPORT_SYMBOL(memmove);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/3] riscv: optimized memmove
@ 2021-06-17 15:27   ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  6 ++--
 arch/riscv/kernel/riscv_ksyms.c |  2 --
 arch/riscv/lib/Makefile         |  1 -
 arch/riscv/lib/memmove.S        | 64 ---------------------------------
 arch/riscv/lib/string.c         | 23 ++++++++++++
 5 files changed, 26 insertions(+), 70 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 6b5d6fc3eab4..25d9b9078569 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
+#define __HAVE_ARCH_MEMMOVE
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 #endif
 
-#define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
 /* For those files which don't want to check by kasan. */
 #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 3f6d512a5b97..361565c4db7e 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,6 +10,4 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 2ffe85d4baee..484f5ff7b508 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index 07d1d2152ba5..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index e48a79a045d8..9c7009d43c39 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -89,3 +89,26 @@ EXPORT_SYMBOL(__memcpy);
 
 void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
 EXPORT_SYMBOL(memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(__memmove);
+
+void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
+EXPORT_SYMBOL(memmove);
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/3] riscv: optimized memset
  2021-06-17 15:27 ` Matteo Croce
@ 2021-06-17 15:27   ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  10 +--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  13 ----
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memset.S         | 113 --------------------------------
 arch/riscv/lib/string.c         |  39 +++++++++++
 6 files changed, 42 insertions(+), 135 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memset.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 25d9b9078569..90500635035a 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -6,14 +6,10 @@
 #ifndef _ASM_RISCV_STRING_H
 #define _ASM_RISCV_STRING_H
 
-#include <linux/types.h>
-#include <linux/linkage.h>
-
-#define __HAVE_ARCH_MEMSET
-extern asmlinkage void *memset(void *, int, size_t);
-extern asmlinkage void *__memset(void *, int, size_t);
-
 #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t count);
+extern void *__memset(void *s, int c, size_t count);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index d3081e4d9600..e635ce1e5645 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -31,7 +31,6 @@ obj-y	+= syscall_table.o
 obj-y	+= sys_riscv.o
 obj-y	+= time.o
 obj-y	+= traps.o
-obj-y	+= riscv_ksyms.o
 obj-y	+= stacktrace.o
 obj-y	+= cacheinfo.o
 obj-y	+= patch.o
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
deleted file mode 100644
index 361565c4db7e..000000000000
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ /dev/null
@@ -1,13 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2017 Zihao Yu
- */
-
-#include <linux/export.h>
-#include <linux/uaccess.h>
-
-/*
- * Assembly functions that may be used (directly or indirectly) by modules
- */
-EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(__memset);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 484f5ff7b508..e33263cc622a 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memset.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
deleted file mode 100644
index 34c5360c6705..000000000000
--- a/arch/riscv/lib/memset.S
+++ /dev/null
@@ -1,113 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memset(void *, int, size_t) */
-ENTRY(__memset)
-WEAK(memset)
-	move t0, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented fill for small sizes */
-	sltiu a3, a2, 16
-	bnez a3, 4f
-
-	/*
-	 * Round to nearest XLEN-aligned address
-	 * greater than or equal to start address
-	 */
-	addi a3, t0, SZREG-1
-	andi a3, a3, ~(SZREG-1)
-	beq a3, t0, 2f  /* Skip if already aligned */
-	/* Handle initial misalignment */
-	sub a4, a3, t0
-1:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2: /* Duff's device with 32 XLEN stores per iteration */
-	/* Broadcast value into all bytes */
-	andi a1, a1, 0xff
-	slli a3, a1, 8
-	or a1, a3, a1
-	slli a3, a1, 16
-	or a1, a3, a1
-#ifdef CONFIG_64BIT
-	slli a3, a1, 32
-	or a1, a3, a1
-#endif
-
-	/* Calculate end address */
-	andi a4, a2, ~(SZREG-1)
-	add a3, t0, a4
-
-	andi a4, a4, 31*SZREG  /* Calculate remainder */
-	beqz a4, 3f            /* Shortcut if no remainder */
-	neg a4, a4
-	addi a4, a4, 32*SZREG  /* Calculate initial offset */
-
-	/* Adjust start address with offset */
-	sub t0, t0, a4
-
-	/* Jump into loop body */
-	/* Assumes 32-bit instruction lengths */
-	la a5, 3f
-#ifdef CONFIG_64BIT
-	srli a4, a4, 1
-#endif
-	add a5, a5, a4
-	jr a5
-3:
-	REG_S a1,        0(t0)
-	REG_S a1,    SZREG(t0)
-	REG_S a1,  2*SZREG(t0)
-	REG_S a1,  3*SZREG(t0)
-	REG_S a1,  4*SZREG(t0)
-	REG_S a1,  5*SZREG(t0)
-	REG_S a1,  6*SZREG(t0)
-	REG_S a1,  7*SZREG(t0)
-	REG_S a1,  8*SZREG(t0)
-	REG_S a1,  9*SZREG(t0)
-	REG_S a1, 10*SZREG(t0)
-	REG_S a1, 11*SZREG(t0)
-	REG_S a1, 12*SZREG(t0)
-	REG_S a1, 13*SZREG(t0)
-	REG_S a1, 14*SZREG(t0)
-	REG_S a1, 15*SZREG(t0)
-	REG_S a1, 16*SZREG(t0)
-	REG_S a1, 17*SZREG(t0)
-	REG_S a1, 18*SZREG(t0)
-	REG_S a1, 19*SZREG(t0)
-	REG_S a1, 20*SZREG(t0)
-	REG_S a1, 21*SZREG(t0)
-	REG_S a1, 22*SZREG(t0)
-	REG_S a1, 23*SZREG(t0)
-	REG_S a1, 24*SZREG(t0)
-	REG_S a1, 25*SZREG(t0)
-	REG_S a1, 26*SZREG(t0)
-	REG_S a1, 27*SZREG(t0)
-	REG_S a1, 28*SZREG(t0)
-	REG_S a1, 29*SZREG(t0)
-	REG_S a1, 30*SZREG(t0)
-	REG_S a1, 31*SZREG(t0)
-	addi t0, t0, 32*SZREG
-	bltu t0, a3, 3b
-	andi a2, a2, SZREG-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, t0, a2
-5:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 5b
-6:
-	ret
-END(__memset)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 9c7009d43c39..1fb4de351516 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -112,3 +112,42 @@ EXPORT_SYMBOL(__memmove);
 
 void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
 EXPORT_SYMBOL(memmove);
+
+void *__memset(void *s, int c, size_t count)
+{
+	union types dest = { .u8 = s };
+
+	if (count >= MIN_THRESHOLD) {
+		const int bytes_long = BITS_PER_LONG / 8;
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+		cu |= cu << 8;
+		cu |= cu << 16;
+#if BITS_PER_LONG == 64
+		cu |= cu << 32;
+#endif
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+		/* Fill the buffer one byte at time until the destination
+		 * is aligned on a 32/64 bit boundary.
+		 */
+		for (; count && dest.uptr % bytes_long; count--)
+			*dest.u8++ = c;
+#endif
+
+		/* Copy using the largest size allowed */
+		for (; count >= bytes_long; count -= bytes_long)
+			*dest.ulong++ = cu;
+	}
+
+	/* copy the remainder */
+	while (count--)
+		*dest.u8++ = c;
+
+	return s;
+}
+EXPORT_SYMBOL(__memset);
+
+void *memset(void *s, int c, size_t count) __weak __alias(__memset);
+EXPORT_SYMBOL(memset);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/3] riscv: optimized memset
@ 2021-06-17 15:27   ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-17 15:27 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  10 +--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  13 ----
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memset.S         | 113 --------------------------------
 arch/riscv/lib/string.c         |  39 +++++++++++
 6 files changed, 42 insertions(+), 135 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memset.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 25d9b9078569..90500635035a 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -6,14 +6,10 @@
 #ifndef _ASM_RISCV_STRING_H
 #define _ASM_RISCV_STRING_H
 
-#include <linux/types.h>
-#include <linux/linkage.h>
-
-#define __HAVE_ARCH_MEMSET
-extern asmlinkage void *memset(void *, int, size_t);
-extern asmlinkage void *__memset(void *, int, size_t);
-
 #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t count);
+extern void *__memset(void *s, int c, size_t count);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index d3081e4d9600..e635ce1e5645 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -31,7 +31,6 @@ obj-y	+= syscall_table.o
 obj-y	+= sys_riscv.o
 obj-y	+= time.o
 obj-y	+= traps.o
-obj-y	+= riscv_ksyms.o
 obj-y	+= stacktrace.o
 obj-y	+= cacheinfo.o
 obj-y	+= patch.o
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
deleted file mode 100644
index 361565c4db7e..000000000000
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ /dev/null
@@ -1,13 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2017 Zihao Yu
- */
-
-#include <linux/export.h>
-#include <linux/uaccess.h>
-
-/*
- * Assembly functions that may be used (directly or indirectly) by modules
- */
-EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(__memset);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 484f5ff7b508..e33263cc622a 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memset.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
deleted file mode 100644
index 34c5360c6705..000000000000
--- a/arch/riscv/lib/memset.S
+++ /dev/null
@@ -1,113 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memset(void *, int, size_t) */
-ENTRY(__memset)
-WEAK(memset)
-	move t0, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented fill for small sizes */
-	sltiu a3, a2, 16
-	bnez a3, 4f
-
-	/*
-	 * Round to nearest XLEN-aligned address
-	 * greater than or equal to start address
-	 */
-	addi a3, t0, SZREG-1
-	andi a3, a3, ~(SZREG-1)
-	beq a3, t0, 2f  /* Skip if already aligned */
-	/* Handle initial misalignment */
-	sub a4, a3, t0
-1:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2: /* Duff's device with 32 XLEN stores per iteration */
-	/* Broadcast value into all bytes */
-	andi a1, a1, 0xff
-	slli a3, a1, 8
-	or a1, a3, a1
-	slli a3, a1, 16
-	or a1, a3, a1
-#ifdef CONFIG_64BIT
-	slli a3, a1, 32
-	or a1, a3, a1
-#endif
-
-	/* Calculate end address */
-	andi a4, a2, ~(SZREG-1)
-	add a3, t0, a4
-
-	andi a4, a4, 31*SZREG  /* Calculate remainder */
-	beqz a4, 3f            /* Shortcut if no remainder */
-	neg a4, a4
-	addi a4, a4, 32*SZREG  /* Calculate initial offset */
-
-	/* Adjust start address with offset */
-	sub t0, t0, a4
-
-	/* Jump into loop body */
-	/* Assumes 32-bit instruction lengths */
-	la a5, 3f
-#ifdef CONFIG_64BIT
-	srli a4, a4, 1
-#endif
-	add a5, a5, a4
-	jr a5
-3:
-	REG_S a1,        0(t0)
-	REG_S a1,    SZREG(t0)
-	REG_S a1,  2*SZREG(t0)
-	REG_S a1,  3*SZREG(t0)
-	REG_S a1,  4*SZREG(t0)
-	REG_S a1,  5*SZREG(t0)
-	REG_S a1,  6*SZREG(t0)
-	REG_S a1,  7*SZREG(t0)
-	REG_S a1,  8*SZREG(t0)
-	REG_S a1,  9*SZREG(t0)
-	REG_S a1, 10*SZREG(t0)
-	REG_S a1, 11*SZREG(t0)
-	REG_S a1, 12*SZREG(t0)
-	REG_S a1, 13*SZREG(t0)
-	REG_S a1, 14*SZREG(t0)
-	REG_S a1, 15*SZREG(t0)
-	REG_S a1, 16*SZREG(t0)
-	REG_S a1, 17*SZREG(t0)
-	REG_S a1, 18*SZREG(t0)
-	REG_S a1, 19*SZREG(t0)
-	REG_S a1, 20*SZREG(t0)
-	REG_S a1, 21*SZREG(t0)
-	REG_S a1, 22*SZREG(t0)
-	REG_S a1, 23*SZREG(t0)
-	REG_S a1, 24*SZREG(t0)
-	REG_S a1, 25*SZREG(t0)
-	REG_S a1, 26*SZREG(t0)
-	REG_S a1, 27*SZREG(t0)
-	REG_S a1, 28*SZREG(t0)
-	REG_S a1, 29*SZREG(t0)
-	REG_S a1, 30*SZREG(t0)
-	REG_S a1, 31*SZREG(t0)
-	addi t0, t0, 32*SZREG
-	bltu t0, a3, 3b
-	andi a2, a2, SZREG-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, t0, a2
-5:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 5b
-6:
-	ret
-END(__memset)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 9c7009d43c39..1fb4de351516 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -112,3 +112,42 @@ EXPORT_SYMBOL(__memmove);
 
 void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
 EXPORT_SYMBOL(memmove);
+
+void *__memset(void *s, int c, size_t count)
+{
+	union types dest = { .u8 = s };
+
+	if (count >= MIN_THRESHOLD) {
+		const int bytes_long = BITS_PER_LONG / 8;
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+		cu |= cu << 8;
+		cu |= cu << 16;
+#if BITS_PER_LONG == 64
+		cu |= cu << 32;
+#endif
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+		/* Fill the buffer one byte at time until the destination
+		 * is aligned on a 32/64 bit boundary.
+		 */
+		for (; count && dest.uptr % bytes_long; count--)
+			*dest.u8++ = c;
+#endif
+
+		/* Copy using the largest size allowed */
+		for (; count >= bytes_long; count -= bytes_long)
+			*dest.ulong++ = cu;
+	}
+
+	/* copy the remainder */
+	while (count--)
+		*dest.u8++ = c;
+
+	return s;
+}
+EXPORT_SYMBOL(__memset);
+
+void *memset(void *s, int c, size_t count) __weak __alias(__memset);
+EXPORT_SYMBOL(memset);
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-17 15:27   ` Matteo Croce
  (?)
@ 2021-06-18 14:06     ` kernel test robot
  -1 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2021-06-18 14:06 UTC (permalink / raw)
  To: Matteo Croce, linux-riscv
  Cc: kbuild-all, clang-built-linux, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

[-- Attachment #1: Type: text/plain, Size: 3047 bytes --]

Hi Matteo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.13-rc6 next-20210617]
[cannot apply to atish-riscv-linux/topo_v3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd860052c99b1e088352bdd4fb7aef46f8d2ef47
config: riscv-randconfig-r031-20210618 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 64720f57bea6a6bf033feef4a5751ab9c0c3b401)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/0day-ci/linux/commit/c35a3474b6782645ff0695db1f30be0e8123c131
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
        git checkout c35a3474b6782645ff0695db1f30be0e8123c131
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/riscv/lib/string.c:90:57: warning: attribute declaration must precede definition [-Wignored-attributes]
   void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
                                                           ^
   include/linux/compiler_attributes.h:291:56: note: expanded from macro '__weak'
   #define __weak                          __attribute__((__weak__))
                                                          ^
   include/linux/fortify-string.h:178:24: note: previous definition is here
   __FORTIFY_INLINE void *memcpy(void *p, const void *q, __kernel_size_t size)
                          ^
   1 warning generated.

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for LOCKDEP
   Depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT && (FRAME_POINTER || MIPS || PPC || S390 || MICROBLAZE || ARM || ARC || X86)
   Selected by
   - PROVE_LOCKING && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
   - LOCK_STAT && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
   - DEBUG_LOCK_ALLOC && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT


vim +90 arch/riscv/lib/string.c

    89	
  > 90	void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27385 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-18 14:06     ` kernel test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2021-06-18 14:06 UTC (permalink / raw)
  To: Matteo Croce, linux-riscv
  Cc: kbuild-all, clang-built-linux, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini

[-- Attachment #1: Type: text/plain, Size: 3047 bytes --]

Hi Matteo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.13-rc6 next-20210617]
[cannot apply to atish-riscv-linux/topo_v3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd860052c99b1e088352bdd4fb7aef46f8d2ef47
config: riscv-randconfig-r031-20210618 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 64720f57bea6a6bf033feef4a5751ab9c0c3b401)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/0day-ci/linux/commit/c35a3474b6782645ff0695db1f30be0e8123c131
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
        git checkout c35a3474b6782645ff0695db1f30be0e8123c131
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/riscv/lib/string.c:90:57: warning: attribute declaration must precede definition [-Wignored-attributes]
   void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
                                                           ^
   include/linux/compiler_attributes.h:291:56: note: expanded from macro '__weak'
   #define __weak                          __attribute__((__weak__))
                                                          ^
   include/linux/fortify-string.h:178:24: note: previous definition is here
   __FORTIFY_INLINE void *memcpy(void *p, const void *q, __kernel_size_t size)
                          ^
   1 warning generated.

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for LOCKDEP
   Depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT && (FRAME_POINTER || MIPS || PPC || S390 || MICROBLAZE || ARM || ARC || X86)
   Selected by
   - PROVE_LOCKING && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
   - LOCK_STAT && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
   - DEBUG_LOCK_ALLOC && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT


vim +90 arch/riscv/lib/string.c

    89	
  > 90	void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27385 bytes --]

[-- Attachment #3: Type: text/plain, Size: 161 bytes --]

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-18 14:06     ` kernel test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2021-06-18 14:06 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3109 bytes --]

Hi Matteo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.13-rc6 next-20210617]
[cannot apply to atish-riscv-linux/topo_v3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd860052c99b1e088352bdd4fb7aef46f8d2ef47
config: riscv-randconfig-r031-20210618 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 64720f57bea6a6bf033feef4a5751ab9c0c3b401)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/0day-ci/linux/commit/c35a3474b6782645ff0695db1f30be0e8123c131
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
        git checkout c35a3474b6782645ff0695db1f30be0e8123c131
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/riscv/lib/string.c:90:57: warning: attribute declaration must precede definition [-Wignored-attributes]
   void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
                                                           ^
   include/linux/compiler_attributes.h:291:56: note: expanded from macro '__weak'
   #define __weak                          __attribute__((__weak__))
                                                          ^
   include/linux/fortify-string.h:178:24: note: previous definition is here
   __FORTIFY_INLINE void *memcpy(void *p, const void *q, __kernel_size_t size)
                          ^
   1 warning generated.

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for LOCKDEP
   Depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT && (FRAME_POINTER || MIPS || PPC || S390 || MICROBLAZE || ARM || ARC || X86)
   Selected by
   - PROVE_LOCKING && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
   - LOCK_STAT && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
   - DEBUG_LOCK_ALLOC && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT


vim +90 arch/riscv/lib/string.c

    89	
  > 90	void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 27385 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-17 15:27   ` Matteo Croce
@ 2021-06-21 14:26     ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-06-21 14:26 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

On Thu, Jun 17, 2021 at 05:27:52PM +0200, Matteo Croce wrote:
> +extern void *memcpy(void *dest, const void *src, size_t count);
> +extern void *__memcpy(void *dest, const void *src, size_t count);

No need for externs.

> +++ b/arch/riscv/lib/string.c

Nothing in her looks RISC-V specific.  Why doesn't this go into lib/ so
that other architectures can use it as well.

> +#include <linux/module.h>

I think you only need export.h.

> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +	const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +	const int mask = bytes_long - 1;
> +	const int distance = (src - dest) & mask;
> +#endif
> +	union const_types s = { .u8 = src };
> +	union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +	if (count < MIN_THRESHOLD)

Using IS_ENABLED we can avoid a lot of the mess in this
function.

	int distance = 0;

	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
		if (count < MIN_THRESHOLD)
			goto copy_remainder;

		/* copy a byte at time until destination is aligned */
		for (; count && d.uptr & mask; count--)
			*d.u8++ = *s.u8++;
		distance = (src - dest) & mask;
	}

	if (distance) {
		...

> +		/* 32/64 bit wide copy from s to d.
> +		 * d is aligned now but s is not, so read s alignment wise,
> +		 * and do proper shift to get the right value.
> +		 * Works only on Little Endian machines.
> +		 */

Normal kernel comment style always start with a:

		/*


> +		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {

Please avoid the pointlessly overlong line.  And (just as a matter of
personal preference) I find for loop that don't actually use a single
iterator rather confusing.  Wjy not simply:

		next = s.ulong[0];
		while (count >= bytes_long + mask) {
			...
			count -= bytes_long;
		}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-21 14:26     ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-06-21 14:26 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

On Thu, Jun 17, 2021 at 05:27:52PM +0200, Matteo Croce wrote:
> +extern void *memcpy(void *dest, const void *src, size_t count);
> +extern void *__memcpy(void *dest, const void *src, size_t count);

No need for externs.

> +++ b/arch/riscv/lib/string.c

Nothing in her looks RISC-V specific.  Why doesn't this go into lib/ so
that other architectures can use it as well.

> +#include <linux/module.h>

I think you only need export.h.

> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +	const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +	const int mask = bytes_long - 1;
> +	const int distance = (src - dest) & mask;
> +#endif
> +	union const_types s = { .u8 = src };
> +	union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +	if (count < MIN_THRESHOLD)

Using IS_ENABLED we can avoid a lot of the mess in this
function.

	int distance = 0;

	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
		if (count < MIN_THRESHOLD)
			goto copy_remainder;

		/* copy a byte at time until destination is aligned */
		for (; count && d.uptr & mask; count--)
			*d.u8++ = *s.u8++;
		distance = (src - dest) & mask;
	}

	if (distance) {
		...

> +		/* 32/64 bit wide copy from s to d.
> +		 * d is aligned now but s is not, so read s alignment wise,
> +		 * and do proper shift to get the right value.
> +		 * Works only on Little Endian machines.
> +		 */

Normal kernel comment style always start with a:

		/*


> +		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {

Please avoid the pointlessly overlong line.  And (just as a matter of
personal preference) I find for loop that don't actually use a single
iterator rather confusing.  Wjy not simply:

		next = s.ulong[0];
		while (count >= bytes_long + mask) {
			...
			count -= bytes_long;
		}

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/3] riscv: optimized memmove
  2021-06-17 15:27   ` Matteo Croce
@ 2021-06-21 14:28     ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-06-21 14:28 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

This looks nice, but I think it would be a good candidate for lib/
again.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/3] riscv: optimized memmove
@ 2021-06-21 14:28     ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-06-21 14:28 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

This looks nice, but I think it would be a good candidate for lib/
again.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
  2021-06-17 15:27   ` Matteo Croce
@ 2021-06-21 14:32     ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-06-21 14:32 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Looks nice, except IS_ENABLED would be useful here again, as well as
a placement in lib/.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
@ 2021-06-21 14:32     ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-06-21 14:32 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Looks nice, except IS_ENABLED would be useful here again, as well as
a placement in lib/.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-17 15:27   ` Matteo Croce
@ 2021-06-22  0:14     ` Nick Kossifidis
  -1 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  0:14 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Hello Matteo and thanks for the patch,

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> 
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * String functions optimized for hardware which doesn't
> + * handle unaligned memory accesses efficiently.
> + *
> + * Copyright (C) 2021 Matteo Croce
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +
> +/* Minimum size for a word copy to be convenient */
> +#define MIN_THRESHOLD (BITS_PER_LONG / 8 * 2)
> +
> +/* convenience union to avoid cast between different pointer types */
> +union types {
> +	u8 *u8;

You are using a type as a name, I'd go with as_bytes/as_ulong/as_uptr 
which makes it easier for the reader to understand what you are trying 
to do.

> +	unsigned long *ulong;
> +	uintptr_t uptr;
> +};
> +
> +union const_types {
> +	const u8 *u8;
> +	unsigned long *ulong;
> +};
> +

I suggest you define those unions inside the function body, no one else 
is using them.

> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +	const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +	const int mask = bytes_long - 1;
> +	const int distance = (src - dest) & mask;

Why not unsigned ints ?

> +#endif
> +	union const_types s = { .u8 = src };
> +	union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

If you want to be compliant with memcpy you should check for overlapping 
regions here since "The memory areas must not overlap", and do nothing 
about it because according to POSIX this leads to undefined behavior. 
That's why recent libc implementations use memmove in any case (memcpy 
is an alias to memmove), which is the suggested approach.

> +	if (count < MIN_THRESHOLD)
> +		goto copy_remainder;
> +
> +	/* copy a byte at time until destination is aligned */
> +	for (; count && d.uptr & mask; count--)
> +		*d.u8++ = *s.u8++;
> +

You should check for !IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) here.

> +	if (distance) {
> +		unsigned long last, next;
> +
> +		/* move s backward to the previous alignment boundary */
> +		s.u8 -= distance;

It'd help here to explain that since s is distance bytes ahead relative 
to d, and d reached the alignment boundary above, s is now aligned but 
the data needs to be shifted to compensate for distance, in order to do 
word-by-word copy.

> +
> +		/* 32/64 bit wide copy from s to d.
> +		 * d is aligned now but s is not, so read s alignment wise,
> +		 * and do proper shift to get the right value.
> +		 * Works only on Little Endian machines.
> +		 */

This commend is misleading because s is aligned or else s.ulong[0]/[1] 
below would result an unaligned access.

> +		for (next = s.ulong[0]; count >= bytes_long + mask; count -= 
> bytes_long) {
> +			last = next;
> +			next = s.ulong[1];
> +
> +			d.ulong[0] = last >> (distance * 8) |
> +				     next << ((bytes_long - distance) * 8);
> +
> +			d.ulong++;
> +			s.ulong++;
> +		}
> +
> +		/* restore s with the original offset */
> +		s.u8 += distance;
> +	} else
> +#endif
> +	{
> +		/* if the source and dest lower bits are the same, do a simple
> +		 * 32/64 bit wide copy.
> +		 */

A while() loop would make more sense here.

> +		for (; count >= bytes_long; count -= bytes_long)
> +			*d.ulong++ = *s.ulong++;
> +	}
> +
> +	/* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> +	goto copy_remainder;
> +
> +copy_remainder:
> +	while (count--)
> +		*d.u8++ = *s.u8++;
> +
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memcpy);
> +
> +void *memcpy(void *dest, const void *src, size_t count) __weak
> __alias(__memcpy);
> +EXPORT_SYMBOL(memcpy);

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-22  0:14     ` Nick Kossifidis
  0 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  0:14 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Hello Matteo and thanks for the patch,

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> 
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * String functions optimized for hardware which doesn't
> + * handle unaligned memory accesses efficiently.
> + *
> + * Copyright (C) 2021 Matteo Croce
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +
> +/* Minimum size for a word copy to be convenient */
> +#define MIN_THRESHOLD (BITS_PER_LONG / 8 * 2)
> +
> +/* convenience union to avoid cast between different pointer types */
> +union types {
> +	u8 *u8;

You are using a type as a name, I'd go with as_bytes/as_ulong/as_uptr 
which makes it easier for the reader to understand what you are trying 
to do.

> +	unsigned long *ulong;
> +	uintptr_t uptr;
> +};
> +
> +union const_types {
> +	const u8 *u8;
> +	unsigned long *ulong;
> +};
> +

I suggest you define those unions inside the function body, no one else 
is using them.

> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +	const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +	const int mask = bytes_long - 1;
> +	const int distance = (src - dest) & mask;

Why not unsigned ints ?

> +#endif
> +	union const_types s = { .u8 = src };
> +	union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

If you want to be compliant with memcpy you should check for overlapping 
regions here since "The memory areas must not overlap", and do nothing 
about it because according to POSIX this leads to undefined behavior. 
That's why recent libc implementations use memmove in any case (memcpy 
is an alias to memmove), which is the suggested approach.

> +	if (count < MIN_THRESHOLD)
> +		goto copy_remainder;
> +
> +	/* copy a byte at time until destination is aligned */
> +	for (; count && d.uptr & mask; count--)
> +		*d.u8++ = *s.u8++;
> +

You should check for !IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) here.

> +	if (distance) {
> +		unsigned long last, next;
> +
> +		/* move s backward to the previous alignment boundary */
> +		s.u8 -= distance;

It'd help here to explain that since s is distance bytes ahead relative 
to d, and d reached the alignment boundary above, s is now aligned but 
the data needs to be shifted to compensate for distance, in order to do 
word-by-word copy.

> +
> +		/* 32/64 bit wide copy from s to d.
> +		 * d is aligned now but s is not, so read s alignment wise,
> +		 * and do proper shift to get the right value.
> +		 * Works only on Little Endian machines.
> +		 */

This commend is misleading because s is aligned or else s.ulong[0]/[1] 
below would result an unaligned access.

> +		for (next = s.ulong[0]; count >= bytes_long + mask; count -= 
> bytes_long) {
> +			last = next;
> +			next = s.ulong[1];
> +
> +			d.ulong[0] = last >> (distance * 8) |
> +				     next << ((bytes_long - distance) * 8);
> +
> +			d.ulong++;
> +			s.ulong++;
> +		}
> +
> +		/* restore s with the original offset */
> +		s.u8 += distance;
> +	} else
> +#endif
> +	{
> +		/* if the source and dest lower bits are the same, do a simple
> +		 * 32/64 bit wide copy.
> +		 */

A while() loop would make more sense here.

> +		for (; count >= bytes_long; count -= bytes_long)
> +			*d.ulong++ = *s.ulong++;
> +	}
> +
> +	/* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> +	goto copy_remainder;
> +
> +copy_remainder:
> +	while (count--)
> +		*d.u8++ = *s.u8++;
> +
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memcpy);
> +
> +void *memcpy(void *dest, const void *src, size_t count) __weak
> __alias(__memcpy);
> +EXPORT_SYMBOL(memcpy);

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/3] riscv: optimized memmove
  2021-06-17 15:27   ` Matteo Croce
@ 2021-06-22  0:46     ` Nick Kossifidis
  -1 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  0:46 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> +
> +/*
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.
> + */
> +void *__memmove(void *dest, const void *src, size_t count)
> +{
> +	if (dest < src || src + count <= dest)
> +		return memcpy(dest, src, count);
> +
> +	if (dest > src) {
> +		const char *s = src + count;
> +		char *tmp = dest + count;
> +
> +		while (count--)
> +			*--tmp = *--s;
> +	}
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memmove);
> +

Copying backwards byte-per-byte is suboptimal, I understand this is not 
a very common scenario but you could at least check if they are both 
word-aligned e.g. (((src + len) | (dst + len)) & mask), or missaligned 
by the same offset e.g. (((src + len) ^ (dst + len)) & mask) and still 
end up doing word-by-word copying. Ideally it would be great if you 
re-used the same technique you used for forwards copying on your memcpy.

> +void *memmove(void *dest, const void *src, size_t count) __weak
> __alias(__memmove);
> +EXPORT_SYMBOL(memmove);

As I mentioned on your memcpy patch, if you implement memmove, you can 
just alias memcpy to memmove and we won't have to worry about memcpy 
being used on overlapping regions.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/3] riscv: optimized memmove
@ 2021-06-22  0:46     ` Nick Kossifidis
  0 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  0:46 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> +
> +/*
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.
> + */
> +void *__memmove(void *dest, const void *src, size_t count)
> +{
> +	if (dest < src || src + count <= dest)
> +		return memcpy(dest, src, count);
> +
> +	if (dest > src) {
> +		const char *s = src + count;
> +		char *tmp = dest + count;
> +
> +		while (count--)
> +			*--tmp = *--s;
> +	}
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memmove);
> +

Copying backwards byte-per-byte is suboptimal, I understand this is not 
a very common scenario but you could at least check if they are both 
word-aligned e.g. (((src + len) | (dst + len)) & mask), or missaligned 
by the same offset e.g. (((src + len) ^ (dst + len)) & mask) and still 
end up doing word-by-word copying. Ideally it would be great if you 
re-used the same technique you used for forwards copying on your memcpy.

> +void *memmove(void *dest, const void *src, size_t count) __weak
> __alias(__memmove);
> +EXPORT_SYMBOL(memmove);

As I mentioned on your memcpy patch, if you implement memmove, you can 
just alias memcpy to memmove and we won't have to worry about memcpy 
being used on overlapping regions.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
  2021-06-17 15:27   ` Matteo Croce
@ 2021-06-22  1:07     ` Nick Kossifidis
  -1 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  1:07 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> +
> +void *__memset(void *s, int c, size_t count)
> +{
> +	union types dest = { .u8 = s };
> +
> +	if (count >= MIN_THRESHOLD) {
> +		const int bytes_long = BITS_PER_LONG / 8;

You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const 
int mask = bytes_long - 1;' from your memcpy patch visible to memset as 
well (static const...) and use them here (mask would make more sense to 
be named as word_mask).

> +		unsigned long cu = (unsigned long)c;
> +
> +		/* Compose an ulong with 'c' repeated 4/8 times */
> +		cu |= cu << 8;
> +		cu |= cu << 16;
> +#if BITS_PER_LONG == 64
> +		cu |= cu << 32;
> +#endif
> +

You don't have to create cu here, you'll fill dest buffer with 'c' 
anyway so after filling up enough 'c's to be able to grab an aligned 
word full of them from dest, you can just grab that word and keep 
filling up dest with it.

> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +		/* Fill the buffer one byte at time until the destination
> +		 * is aligned on a 32/64 bit boundary.
> +		 */
> +		for (; count && dest.uptr % bytes_long; count--)

You could reuse & mask here instead of % bytes_long.

> +			*dest.u8++ = c;
> +#endif

I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your 
memcpy patch, is it worth it here ? To begin with riscv doesn't set it 
and even if it did we are talking about a loop that will run just a few 
times to reach the alignment boundary (worst case scenario it'll run 7 
times), I don't think we gain much here, even for archs that have 
efficient unaligned access.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
@ 2021-06-22  1:07     ` Nick Kossifidis
  0 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  1:07 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> +
> +void *__memset(void *s, int c, size_t count)
> +{
> +	union types dest = { .u8 = s };
> +
> +	if (count >= MIN_THRESHOLD) {
> +		const int bytes_long = BITS_PER_LONG / 8;

You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const 
int mask = bytes_long - 1;' from your memcpy patch visible to memset as 
well (static const...) and use them here (mask would make more sense to 
be named as word_mask).

> +		unsigned long cu = (unsigned long)c;
> +
> +		/* Compose an ulong with 'c' repeated 4/8 times */
> +		cu |= cu << 8;
> +		cu |= cu << 16;
> +#if BITS_PER_LONG == 64
> +		cu |= cu << 32;
> +#endif
> +

You don't have to create cu here, you'll fill dest buffer with 'c' 
anyway so after filling up enough 'c's to be able to grab an aligned 
word full of them from dest, you can just grab that word and keep 
filling up dest with it.

> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +		/* Fill the buffer one byte at time until the destination
> +		 * is aligned on a 32/64 bit boundary.
> +		 */
> +		for (; count && dest.uptr % bytes_long; count--)

You could reuse & mask here instead of % bytes_long.

> +			*dest.u8++ = c;
> +#endif

I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your 
memcpy patch, is it worth it here ? To begin with riscv doesn't set it 
and even if it did we are talking about a loop that will run just a few 
times to reach the alignment boundary (worst case scenario it'll run 7 
times), I don't think we gain much here, even for archs that have 
efficient unaligned access.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/3] riscv: optimized mem* functions
  2021-06-17 15:27 ` Matteo Croce
@ 2021-06-22  1:09   ` Nick Kossifidis
  -1 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  1:09 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Hello Matteo,

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Replace the assembly mem{cpy,move,set} with C equivalent.
> 
> Try to access RAM with the largest bit width possible, but without
> doing unaligned accesses.
> 
> Tested on a BeagleV Starlight with a SiFive U74 core, where the
> improvement is noticeable.
> 

There are already generic C implementations for memcpy/memmove/memset at 
https://elixir.bootlin.com/linux/v5.13-rc7/source/lib/string.c#L871 but 
are doing one byte at a time, I suggest you update them to do 
word-by-word copy instead of introducing yet another memcpy/memmove C 
implementation on arch/riscv/.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/3] riscv: optimized mem* functions
@ 2021-06-22  1:09   ` Nick Kossifidis
  0 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-22  1:09 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren

Hello Matteo,

Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Replace the assembly mem{cpy,move,set} with C equivalent.
> 
> Try to access RAM with the largest bit width possible, but without
> doing unaligned accesses.
> 
> Tested on a BeagleV Starlight with a SiFive U74 core, where the
> improvement is noticeable.
> 

There are already generic C implementations for memcpy/memmove/memset at 
https://elixir.bootlin.com/linux/v5.13-rc7/source/lib/string.c#L871 but 
are doing one byte at a time, I suggest you update them to do 
word-by-word copy instead of introducing yet another memcpy/memmove C 
implementation on arch/riscv/.




_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/3] riscv: optimized mem* functions
  2021-06-22  1:09   ` Nick Kossifidis
@ 2021-06-22  2:39     ` Guo Ren
  -1 siblings, 0 replies; 48+ messages in thread
From: Guo Ren @ 2021-06-22  2:39 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Matteo Croce, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight

On Tue, Jun 22, 2021 at 9:09 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Hello Matteo,
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Replace the assembly mem{cpy,move,set} with C equivalent.
> >
> > Try to access RAM with the largest bit width possible, but without
> > doing unaligned accesses.
> >
> > Tested on a BeagleV Starlight with a SiFive U74 core, where the
> > improvement is noticeable.
> >
>
> There are already generic C implementations for memcpy/memmove/memset at
> https://elixir.bootlin.com/linux/v5.13-rc7/source/lib/string.c#L871 but
> are doing one byte at a time, I suggest you update them to do
> word-by-word copy instead of introducing yet another memcpy/memmove C
> implementation on arch/riscv/.
Yes, I've tried to copy the Glibc version into arch/csky/abiv1, and
Arnd suggested putting them into generic.
ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t

>
>
>



--
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/3] riscv: optimized mem* functions
@ 2021-06-22  2:39     ` Guo Ren
  0 siblings, 0 replies; 48+ messages in thread
From: Guo Ren @ 2021-06-22  2:39 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Matteo Croce, linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight

On Tue, Jun 22, 2021 at 9:09 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Hello Matteo,
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Replace the assembly mem{cpy,move,set} with C equivalent.
> >
> > Try to access RAM with the largest bit width possible, but without
> > doing unaligned accesses.
> >
> > Tested on a BeagleV Starlight with a SiFive U74 core, where the
> > improvement is noticeable.
> >
>
> There are already generic C implementations for memcpy/memmove/memset at
> https://elixir.bootlin.com/linux/v5.13-rc7/source/lib/string.c#L871 but
> are doing one byte at a time, I suggest you update them to do
> word-by-word copy instead of introducing yet another memcpy/memmove C
> implementation on arch/riscv/.
Yes, I've tried to copy the Glibc version into arch/csky/abiv1, and
Arnd suggested putting them into generic.
ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t

>
>
>



--
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-21 14:26     ` Christoph Hellwig
@ 2021-06-22  8:19       ` David Laight
  -1 siblings, 0 replies; 48+ messages in thread
From: David Laight @ 2021-06-22  8:19 UTC (permalink / raw)
  To: 'Christoph Hellwig', Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, Guo Ren

From: Christoph Hellwig
> Sent: 21 June 2021 15:27
...
> > +		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> 
> Please avoid the pointlessly overlong line.  And (just as a matter of
> personal preference) I find for loop that don't actually use a single
> iterator rather confusing.  Wjy not simply:
> 
> 		next = s.ulong[0];
> 		while (count >= bytes_long + mask) {
> 			...
> 			count -= bytes_long;
> 		}

My fist attack on long 'for' statements is just to move the
initialisation to the previous line.
Then make sure there is nothing in the comparison that needs
to be calculated every iteration.
I suspect you can subtract 'mask' from 'count'.
Giving:
		count -= mask;
		next = s.ulong[0];
		for (;; count > bytes_long; count -= bytes_long) {

Next is to shorten the variable names!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-22  8:19       ` David Laight
  0 siblings, 0 replies; 48+ messages in thread
From: David Laight @ 2021-06-22  8:19 UTC (permalink / raw)
  To: 'Christoph Hellwig', Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, Guo Ren

From: Christoph Hellwig
> Sent: 21 June 2021 15:27
...
> > +		for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> 
> Please avoid the pointlessly overlong line.  And (just as a matter of
> personal preference) I find for loop that don't actually use a single
> iterator rather confusing.  Wjy not simply:
> 
> 		next = s.ulong[0];
> 		while (count >= bytes_long + mask) {
> 			...
> 			count -= bytes_long;
> 		}

My fist attack on long 'for' statements is just to move the
initialisation to the previous line.
Then make sure there is nothing in the comparison that needs
to be calculated every iteration.
I suspect you can subtract 'mask' from 'count'.
Giving:
		count -= mask;
		next = s.ulong[0];
		for (;; count > bytes_long; count -= bytes_long) {

Next is to shorten the variable names!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v3 3/3] riscv: optimized memset
  2021-06-22  1:07     ` Nick Kossifidis
@ 2021-06-22  8:38       ` David Laight
  -1 siblings, 0 replies; 48+ messages in thread
From: David Laight @ 2021-06-22  8:38 UTC (permalink / raw)
  To: 'Nick Kossifidis', Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, Guo Ren

From: Nick Kossifidis
> Sent: 22 June 2021 02:08
> 
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > +	union types dest = { .u8 = s };
> > +
> > +	if (count >= MIN_THRESHOLD) {
> > +		const int bytes_long = BITS_PER_LONG / 8;
> 
> You could make 'const int bytes_long = BITS_PER_LONG / 8;'

What is wrong with sizeof (long) ?
...
> > +		unsigned long cu = (unsigned long)c;
> > +
> > +		/* Compose an ulong with 'c' repeated 4/8 times */
> > +		cu |= cu << 8;
> > +		cu |= cu << 16;
> > +#if BITS_PER_LONG == 64
> > +		cu |= cu << 32;
> > +#endif
> > +
> 
> You don't have to create cu here, you'll fill dest buffer with 'c'
> anyway so after filling up enough 'c's to be able to grab an aligned
> word full of them from dest, you can just grab that word and keep
> filling up dest with it.

That will be a lot slower - especially if run on something like x86.
A write-read of the same size is optimised by the store-load forwarder.
But the byte write, word read will have to go via the cache.

You can just write:
	cu = (unsigned long)c * 0x0101010101010101ull;
and let the compiler sort out the best way to generate the constant.

> 
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +		/* Fill the buffer one byte at time until the destination
> > +		 * is aligned on a 32/64 bit boundary.
> > +		 */
> > +		for (; count && dest.uptr % bytes_long; count--)
> 
> You could reuse & mask here instead of % bytes_long.
> 
> > +			*dest.u8++ = c;
> > +#endif
> 
> I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> and even if it did we are talking about a loop that will run just a few
> times to reach the alignment boundary (worst case scenario it'll run 7
> times), I don't think we gain much here, even for archs that have
> efficient unaligned access.

With CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS it probably isn't worth
even checking the alignment.
While aligning the copy will be quicker for an unaligned buffer they
almost certainly don't happen often enough to worry about.
In any case you'd want to do a misaligned word write to the start
of the buffer - not separate byte writes.
Provided the buffer is long enough you can also do a misaligned write
to the end of the buffer before filling from the start.

I suspect you may need either barrier() or use a ptr to packed
to avoid the perverted 'undefined behaviour' fubar.'

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v3 3/3] riscv: optimized memset
@ 2021-06-22  8:38       ` David Laight
  0 siblings, 0 replies; 48+ messages in thread
From: David Laight @ 2021-06-22  8:38 UTC (permalink / raw)
  To: 'Nick Kossifidis', Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, Guo Ren

From: Nick Kossifidis
> Sent: 22 June 2021 02:08
> 
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > +	union types dest = { .u8 = s };
> > +
> > +	if (count >= MIN_THRESHOLD) {
> > +		const int bytes_long = BITS_PER_LONG / 8;
> 
> You could make 'const int bytes_long = BITS_PER_LONG / 8;'

What is wrong with sizeof (long) ?
...
> > +		unsigned long cu = (unsigned long)c;
> > +
> > +		/* Compose an ulong with 'c' repeated 4/8 times */
> > +		cu |= cu << 8;
> > +		cu |= cu << 16;
> > +#if BITS_PER_LONG == 64
> > +		cu |= cu << 32;
> > +#endif
> > +
> 
> You don't have to create cu here, you'll fill dest buffer with 'c'
> anyway so after filling up enough 'c's to be able to grab an aligned
> word full of them from dest, you can just grab that word and keep
> filling up dest with it.

That will be a lot slower - especially if run on something like x86.
A write-read of the same size is optimised by the store-load forwarder.
But the byte write, word read will have to go via the cache.

You can just write:
	cu = (unsigned long)c * 0x0101010101010101ull;
and let the compiler sort out the best way to generate the constant.

> 
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +		/* Fill the buffer one byte at time until the destination
> > +		 * is aligned on a 32/64 bit boundary.
> > +		 */
> > +		for (; count && dest.uptr % bytes_long; count--)
> 
> You could reuse & mask here instead of % bytes_long.
> 
> > +			*dest.u8++ = c;
> > +#endif
> 
> I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> and even if it did we are talking about a loop that will run just a few
> times to reach the alignment boundary (worst case scenario it'll run 7
> times), I don't think we gain much here, even for archs that have
> efficient unaligned access.

With CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS it probably isn't worth
even checking the alignment.
While aligning the copy will be quicker for an unaligned buffer they
almost certainly don't happen often enough to worry about.
In any case you'd want to do a misaligned word write to the start
of the buffer - not separate byte writes.
Provided the buffer is long enough you can also do a misaligned write
to the end of the buffer before filling from the start.

I suspect you may need either barrier() or use a ptr to packed
to avoid the perverted 'undefined behaviour' fubar.'

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-21 14:26     ` Christoph Hellwig
@ 2021-06-22 22:00       ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-22 22:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Guo Ren

On Mon, Jun 21, 2021 at 4:26 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Jun 17, 2021 at 05:27:52PM +0200, Matteo Croce wrote:
> > +extern void *memcpy(void *dest, const void *src, size_t count);
> > +extern void *__memcpy(void *dest, const void *src, size_t count);
>
> No need for externs.
>

Right.

> > +++ b/arch/riscv/lib/string.c
>
> Nothing in her looks RISC-V specific.  Why doesn't this go into lib/ so
> that other architectures can use it as well.
>

Technically it could go into lib/ and be generic.
If you think it's worth it, I have just to handle the different
left/right shift because of endianness.

> > +#include <linux/module.h>
>
> I think you only need export.h.
>

Nice.

> > +void *__memcpy(void *dest, const void *src, size_t count)
> > +{
> > +     const int bytes_long = BITS_PER_LONG / 8;
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +     const int mask = bytes_long - 1;
> > +     const int distance = (src - dest) & mask;
> > +#endif
> > +     union const_types s = { .u8 = src };
> > +     union types d = { .u8 = dest };
> > +
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +     if (count < MIN_THRESHOLD)
>
> Using IS_ENABLED we can avoid a lot of the mess in this
> function.
>
>         int distance = 0;
>
>         if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
>                 if (count < MIN_THRESHOLD)
>                         goto copy_remainder;
>
>                 /* copy a byte at time until destination is aligned */
>                 for (; count && d.uptr & mask; count--)
>                         *d.u8++ = *s.u8++;
>                 distance = (src - dest) & mask;
>         }
>

Cool. What about putting this check in the very start:

        if (count < MIN_THRESHOLD)
                goto copy_remainder;

And since count is at least twice bytes_long, remove count from the check below?
Also, setting distance after d is aligned is as simple as getting the
lower bits of s:

        if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
                /* Copy a byte at time until destination is aligned. */
                for (; d.uptr & mask; count--)
                        *d.u8++ = *s.u8++;

                distance = s.uptr & mask;
        }

>         if (distance) {
>                 ...
>
> > +             /* 32/64 bit wide copy from s to d.
> > +              * d is aligned now but s is not, so read s alignment wise,
> > +              * and do proper shift to get the right value.
> > +              * Works only on Little Endian machines.
> > +              */
>
> Normal kernel comment style always start with a:
>

Right, I was used to netdev ones :)

>                 /*
>
>
> > +             for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
>
> Please avoid the pointlessly overlong line.  And (just as a matter of
> personal preference) I find for loop that don't actually use a single
> iterator rather confusing.  Wjy not simply:
>
>                 next = s.ulong[0];
>                 while (count >= bytes_long + mask) {
>                         ...
>                         count -= bytes_long;
>                 }

My fault, in a previous version it was:

    next = s.ulong[0];
    for (; count >= bytes_long + mask; count -= bytes_long) {

So to have a single `count` counter for the loop.

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-22 22:00       ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-22 22:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Guo Ren

On Mon, Jun 21, 2021 at 4:26 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Jun 17, 2021 at 05:27:52PM +0200, Matteo Croce wrote:
> > +extern void *memcpy(void *dest, const void *src, size_t count);
> > +extern void *__memcpy(void *dest, const void *src, size_t count);
>
> No need for externs.
>

Right.

> > +++ b/arch/riscv/lib/string.c
>
> Nothing in her looks RISC-V specific.  Why doesn't this go into lib/ so
> that other architectures can use it as well.
>

Technically it could go into lib/ and be generic.
If you think it's worth it, I have just to handle the different
left/right shift because of endianness.

> > +#include <linux/module.h>
>
> I think you only need export.h.
>

Nice.

> > +void *__memcpy(void *dest, const void *src, size_t count)
> > +{
> > +     const int bytes_long = BITS_PER_LONG / 8;
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +     const int mask = bytes_long - 1;
> > +     const int distance = (src - dest) & mask;
> > +#endif
> > +     union const_types s = { .u8 = src };
> > +     union types d = { .u8 = dest };
> > +
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +     if (count < MIN_THRESHOLD)
>
> Using IS_ENABLED we can avoid a lot of the mess in this
> function.
>
>         int distance = 0;
>
>         if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
>                 if (count < MIN_THRESHOLD)
>                         goto copy_remainder;
>
>                 /* copy a byte at time until destination is aligned */
>                 for (; count && d.uptr & mask; count--)
>                         *d.u8++ = *s.u8++;
>                 distance = (src - dest) & mask;
>         }
>

Cool. What about putting this check in the very start:

        if (count < MIN_THRESHOLD)
                goto copy_remainder;

And since count is at least twice bytes_long, remove count from the check below?
Also, setting distance after d is aligned is as simple as getting the
lower bits of s:

        if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
                /* Copy a byte at time until destination is aligned. */
                for (; d.uptr & mask; count--)
                        *d.u8++ = *s.u8++;

                distance = s.uptr & mask;
        }

>         if (distance) {
>                 ...
>
> > +             /* 32/64 bit wide copy from s to d.
> > +              * d is aligned now but s is not, so read s alignment wise,
> > +              * and do proper shift to get the right value.
> > +              * Works only on Little Endian machines.
> > +              */
>
> Normal kernel comment style always start with a:
>

Right, I was used to netdev ones :)

>                 /*
>
>
> > +             for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
>
> Please avoid the pointlessly overlong line.  And (just as a matter of
> personal preference) I find for loop that don't actually use a single
> iterator rather confusing.  Wjy not simply:
>
>                 next = s.ulong[0];
>                 while (count >= bytes_long + mask) {
>                         ...
>                         count -= bytes_long;
>                 }

My fault, in a previous version it was:

    next = s.ulong[0];
    for (; count >= bytes_long + mask; count -= bytes_long) {

So to have a single `count` counter for the loop.

Regards,
-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-22  8:19       ` David Laight
@ 2021-06-22 22:53         ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-22 22:53 UTC (permalink / raw)
  To: David Laight
  Cc: Christoph Hellwig, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	Guo Ren

On Tue, Jun 22, 2021 at 10:19 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Christoph Hellwig
> > Sent: 21 June 2021 15:27
> ...
> > > +           for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> >
> > Please avoid the pointlessly overlong line.  And (just as a matter of
> > personal preference) I find for loop that don't actually use a single
> > iterator rather confusing.  Wjy not simply:
> >
> >               next = s.ulong[0];
> >               while (count >= bytes_long + mask) {
> >                       ...
> >                       count -= bytes_long;
> >               }
>
> My fist attack on long 'for' statements is just to move the
> initialisation to the previous line.
> Then make sure there is nothing in the comparison that needs
> to be calculated every iteration.
> I suspect you can subtract 'mask' from 'count'.
> Giving:
>                 count -= mask;
>                 next = s.ulong[0];
>                 for (;; count > bytes_long; count -= bytes_long) {
>

This way we'll lose the remainder, as count is used at the end to copy
the leftover.
Anyway, both bytes_long and mask are constant, I doubt they get summed
at every cycle.

> Next is to shorten the variable names!
>
>         David
>

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-22 22:53         ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-22 22:53 UTC (permalink / raw)
  To: David Laight
  Cc: Christoph Hellwig, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	Guo Ren

On Tue, Jun 22, 2021 at 10:19 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Christoph Hellwig
> > Sent: 21 June 2021 15:27
> ...
> > > +           for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> >
> > Please avoid the pointlessly overlong line.  And (just as a matter of
> > personal preference) I find for loop that don't actually use a single
> > iterator rather confusing.  Wjy not simply:
> >
> >               next = s.ulong[0];
> >               while (count >= bytes_long + mask) {
> >                       ...
> >                       count -= bytes_long;
> >               }
>
> My fist attack on long 'for' statements is just to move the
> initialisation to the previous line.
> Then make sure there is nothing in the comparison that needs
> to be calculated every iteration.
> I suspect you can subtract 'mask' from 'count'.
> Giving:
>                 count -= mask;
>                 next = s.ulong[0];
>                 for (;; count > bytes_long; count -= bytes_long) {
>

This way we'll lose the remainder, as count is used at the end to copy
the leftover.
Anyway, both bytes_long and mask are constant, I doubt they get summed
at every cycle.

> Next is to shorten the variable names!
>
>         David
>

-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-22  0:14     ` Nick Kossifidis
@ 2021-06-22 23:35       ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-22 23:35 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Guo Ren

On Tue, Jun 22, 2021 at 2:15 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Hello Matteo and thanks for the patch,
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> >
> > @@ -0,0 +1,91 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * String functions optimized for hardware which doesn't
> > + * handle unaligned memory accesses efficiently.
> > + *
> > + * Copyright (C) 2021 Matteo Croce
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/module.h>
> > +
> > +/* Minimum size for a word copy to be convenient */
> > +#define MIN_THRESHOLD (BITS_PER_LONG / 8 * 2)
> > +
> > +/* convenience union to avoid cast between different pointer types */
> > +union types {
> > +     u8 *u8;
>
> You are using a type as a name, I'd go with as_bytes/as_ulong/as_uptr
> which makes it easier for the reader to understand what you are trying
> to do.
>

Makes sense

> > +     unsigned long *ulong;
> > +     uintptr_t uptr;
> > +};
> > +
> > +union const_types {
> > +     const u8 *u8;
> > +     unsigned long *ulong;
> > +};
> > +
>
> I suggest you define those unions inside the function body, no one else
> is using them.
>

They will be used in memset(), in patch 3/3

> > +void *__memcpy(void *dest, const void *src, size_t count)
> > +{
> > +     const int bytes_long = BITS_PER_LONG / 8;
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +     const int mask = bytes_long - 1;
> > +     const int distance = (src - dest) & mask;
>
> Why not unsigned ints ?
>

Ok.

> > +#endif
> > +     union const_types s = { .u8 = src };
> > +     union types d = { .u8 = dest };
> > +
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> If you want to be compliant with memcpy you should check for overlapping
> regions here since "The memory areas must not overlap", and do nothing
> about it because according to POSIX this leads to undefined behavior.
> That's why recent libc implementations use memmove in any case (memcpy
> is an alias to memmove), which is the suggested approach.
>

Mmm which memcpy arch implementation does this check?
I guess that noone is currently doing it.

> > +     if (count < MIN_THRESHOLD)
> > +             goto copy_remainder;
> > +
> > +     /* copy a byte at time until destination is aligned */
> > +     for (; count && d.uptr & mask; count--)
> > +             *d.u8++ = *s.u8++;
> > +
>
> You should check for !IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) here.
>

I tought that only Little Endian RISC-V machines were supported in Linux.
Should I add a BUILD_BUG_ON()?
Anyway, if this is going in generic lib/, I'll take care of the endianness.

> > +     if (distance) {
> > +             unsigned long last, next;
> > +
> > +             /* move s backward to the previous alignment boundary */
> > +             s.u8 -= distance;
>
> It'd help here to explain that since s is distance bytes ahead relative
> to d, and d reached the alignment boundary above, s is now aligned but
> the data needs to be shifted to compensate for distance, in order to do
> word-by-word copy.
>
> > +
> > +             /* 32/64 bit wide copy from s to d.
> > +              * d is aligned now but s is not, so read s alignment wise,
> > +              * and do proper shift to get the right value.
> > +              * Works only on Little Endian machines.
> > +              */
>
> This commend is misleading because s is aligned or else s.ulong[0]/[1]
> below would result an unaligned access.
>

Yes, those two comments should be rephrased, merged and put above.

> > +             for (next = s.ulong[0]; count >= bytes_long + mask; count -=
> > bytes_long) {
> > +                     last = next;
> > +                     next = s.ulong[1];
> > +
> > +                     d.ulong[0] = last >> (distance * 8) |
> > +                                  next << ((bytes_long - distance) * 8);
> > +
> > +                     d.ulong++;
> > +                     s.ulong++;
> > +             }
> > +
> > +             /* restore s with the original offset */
> > +             s.u8 += distance;
> > +     } else
> > +#endif
> > +     {
> > +             /* if the source and dest lower bits are the same, do a simple
> > +              * 32/64 bit wide copy.
> > +              */
>
> A while() loop would make more sense here.
>

Ok.

> > +             for (; count >= bytes_long; count -= bytes_long)
> > +                     *d.ulong++ = *s.ulong++;
> > +     }
> > +
> > +     /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> > +     goto copy_remainder;
> > +
> > +copy_remainder:
> > +     while (count--)
> > +             *d.u8++ = *s.u8++;
> > +
> > +     return dest;
> > +}
> > +EXPORT_SYMBOL(__memcpy);
> > +
> > +void *memcpy(void *dest, const void *src, size_t count) __weak
> > __alias(__memcpy);
> > +EXPORT_SYMBOL(memcpy);

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-22 23:35       ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-22 23:35 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Guo Ren

On Tue, Jun 22, 2021 at 2:15 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Hello Matteo and thanks for the patch,
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> >
> > @@ -0,0 +1,91 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * String functions optimized for hardware which doesn't
> > + * handle unaligned memory accesses efficiently.
> > + *
> > + * Copyright (C) 2021 Matteo Croce
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/module.h>
> > +
> > +/* Minimum size for a word copy to be convenient */
> > +#define MIN_THRESHOLD (BITS_PER_LONG / 8 * 2)
> > +
> > +/* convenience union to avoid cast between different pointer types */
> > +union types {
> > +     u8 *u8;
>
> You are using a type as a name, I'd go with as_bytes/as_ulong/as_uptr
> which makes it easier for the reader to understand what you are trying
> to do.
>

Makes sense

> > +     unsigned long *ulong;
> > +     uintptr_t uptr;
> > +};
> > +
> > +union const_types {
> > +     const u8 *u8;
> > +     unsigned long *ulong;
> > +};
> > +
>
> I suggest you define those unions inside the function body, no one else
> is using them.
>

They will be used in memset(), in patch 3/3

> > +void *__memcpy(void *dest, const void *src, size_t count)
> > +{
> > +     const int bytes_long = BITS_PER_LONG / 8;
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +     const int mask = bytes_long - 1;
> > +     const int distance = (src - dest) & mask;
>
> Why not unsigned ints ?
>

Ok.

> > +#endif
> > +     union const_types s = { .u8 = src };
> > +     union types d = { .u8 = dest };
> > +
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> If you want to be compliant with memcpy you should check for overlapping
> regions here since "The memory areas must not overlap", and do nothing
> about it because according to POSIX this leads to undefined behavior.
> That's why recent libc implementations use memmove in any case (memcpy
> is an alias to memmove), which is the suggested approach.
>

Mmm which memcpy arch implementation does this check?
I guess that noone is currently doing it.

> > +     if (count < MIN_THRESHOLD)
> > +             goto copy_remainder;
> > +
> > +     /* copy a byte at time until destination is aligned */
> > +     for (; count && d.uptr & mask; count--)
> > +             *d.u8++ = *s.u8++;
> > +
>
> You should check for !IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) here.
>

I tought that only Little Endian RISC-V machines were supported in Linux.
Should I add a BUILD_BUG_ON()?
Anyway, if this is going in generic lib/, I'll take care of the endianness.

> > +     if (distance) {
> > +             unsigned long last, next;
> > +
> > +             /* move s backward to the previous alignment boundary */
> > +             s.u8 -= distance;
>
> It'd help here to explain that since s is distance bytes ahead relative
> to d, and d reached the alignment boundary above, s is now aligned but
> the data needs to be shifted to compensate for distance, in order to do
> word-by-word copy.
>
> > +
> > +             /* 32/64 bit wide copy from s to d.
> > +              * d is aligned now but s is not, so read s alignment wise,
> > +              * and do proper shift to get the right value.
> > +              * Works only on Little Endian machines.
> > +              */
>
> This commend is misleading because s is aligned or else s.ulong[0]/[1]
> below would result an unaligned access.
>

Yes, those two comments should be rephrased, merged and put above.

> > +             for (next = s.ulong[0]; count >= bytes_long + mask; count -=
> > bytes_long) {
> > +                     last = next;
> > +                     next = s.ulong[1];
> > +
> > +                     d.ulong[0] = last >> (distance * 8) |
> > +                                  next << ((bytes_long - distance) * 8);
> > +
> > +                     d.ulong++;
> > +                     s.ulong++;
> > +             }
> > +
> > +             /* restore s with the original offset */
> > +             s.u8 += distance;
> > +     } else
> > +#endif
> > +     {
> > +             /* if the source and dest lower bits are the same, do a simple
> > +              * 32/64 bit wide copy.
> > +              */
>
> A while() loop would make more sense here.
>

Ok.

> > +             for (; count >= bytes_long; count -= bytes_long)
> > +                     *d.ulong++ = *s.ulong++;
> > +     }
> > +
> > +     /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> > +     goto copy_remainder;
> > +
> > +copy_remainder:
> > +     while (count--)
> > +             *d.u8++ = *s.u8++;
> > +
> > +     return dest;
> > +}
> > +EXPORT_SYMBOL(__memcpy);
> > +
> > +void *memcpy(void *dest, const void *src, size_t count) __weak
> > __alias(__memcpy);
> > +EXPORT_SYMBOL(memcpy);

Regards,
-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
  2021-06-22  1:07     ` Nick Kossifidis
@ 2021-06-23  0:08       ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-23  0:08 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Guo Ren

On Tue, Jun 22, 2021 at 3:07 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > +     union types dest = { .u8 = s };
> > +
> > +     if (count >= MIN_THRESHOLD) {
> > +             const int bytes_long = BITS_PER_LONG / 8;
>
> You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const
> int mask = bytes_long - 1;' from your memcpy patch visible to memset as
> well (static const...) and use them here (mask would make more sense to
> be named as word_mask).
>

I'll do

> > +             unsigned long cu = (unsigned long)c;
> > +
> > +             /* Compose an ulong with 'c' repeated 4/8 times */
> > +             cu |= cu << 8;
> > +             cu |= cu << 16;
> > +#if BITS_PER_LONG == 64
> > +             cu |= cu << 32;
> > +#endif
> > +
>
> You don't have to create cu here, you'll fill dest buffer with 'c'
> anyway so after filling up enough 'c's to be able to grab an aligned
> word full of them from dest, you can just grab that word and keep
> filling up dest with it.
>

I tried that, but this way I have to wait 8 bytes more before starting
the memset.
And, the machine code needed to generate 'cu' is just 6 instructions on riscv:

slli a5,a0,8
or a5,a5,a0
slli a0,a5,16
or a0,a0,a5
slli a5,a0,32
or a0,a5,a0

so probably it's not worth it.

> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +             /* Fill the buffer one byte at time until the destination
> > +              * is aligned on a 32/64 bit boundary.
> > +              */
> > +             for (; count && dest.uptr % bytes_long; count--)
>
> You could reuse & mask here instead of % bytes_long.
>

Sure, even if the machine code will be the same.

> > +                     *dest.u8++ = c;
> > +#endif
>
> I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> and even if it did we are talking about a loop that will run just a few
> times to reach the alignment boundary (worst case scenario it'll run 7
> times), I don't think we gain much here, even for archs that have
> efficient unaligned access.

It doesn't _now_, but maybe in the future we will have a CPU which
handles unaligned accesses correctly!

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
@ 2021-06-23  0:08       ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-23  0:08 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Guo Ren

On Tue, Jun 22, 2021 at 3:07 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > +     union types dest = { .u8 = s };
> > +
> > +     if (count >= MIN_THRESHOLD) {
> > +             const int bytes_long = BITS_PER_LONG / 8;
>
> You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const
> int mask = bytes_long - 1;' from your memcpy patch visible to memset as
> well (static const...) and use them here (mask would make more sense to
> be named as word_mask).
>

I'll do

> > +             unsigned long cu = (unsigned long)c;
> > +
> > +             /* Compose an ulong with 'c' repeated 4/8 times */
> > +             cu |= cu << 8;
> > +             cu |= cu << 16;
> > +#if BITS_PER_LONG == 64
> > +             cu |= cu << 32;
> > +#endif
> > +
>
> You don't have to create cu here, you'll fill dest buffer with 'c'
> anyway so after filling up enough 'c's to be able to grab an aligned
> word full of them from dest, you can just grab that word and keep
> filling up dest with it.
>

I tried that, but this way I have to wait 8 bytes more before starting
the memset.
And, the machine code needed to generate 'cu' is just 6 instructions on riscv:

slli a5,a0,8
or a5,a5,a0
slli a0,a5,16
or a0,a0,a5
slli a5,a0,32
or a0,a5,a0

so probably it's not worth it.

> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +             /* Fill the buffer one byte at time until the destination
> > +              * is aligned on a 32/64 bit boundary.
> > +              */
> > +             for (; count && dest.uptr % bytes_long; count--)
>
> You could reuse & mask here instead of % bytes_long.
>

Sure, even if the machine code will be the same.

> > +                     *dest.u8++ = c;
> > +#endif
>
> I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> and even if it did we are talking about a loop that will run just a few
> times to reach the alignment boundary (worst case scenario it'll run 7
> times), I don't think we gain much here, even for archs that have
> efficient unaligned access.

It doesn't _now_, but maybe in the future we will have a CPU which
handles unaligned accesses correctly!

-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
  2021-06-22  8:38       ` David Laight
@ 2021-06-23  1:14         ` Matteo Croce
  -1 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-23  1:14 UTC (permalink / raw)
  To: David Laight
  Cc: Nick Kossifidis, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	Guo Ren

On Tue, Jun 22, 2021 at 10:38 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Nick Kossifidis
> > Sent: 22 June 2021 02:08
> >
> > Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > > +
> > > +void *__memset(void *s, int c, size_t count)
> > > +{
> > > +   union types dest = { .u8 = s };
> > > +
> > > +   if (count >= MIN_THRESHOLD) {
> > > +           const int bytes_long = BITS_PER_LONG / 8;
> >
> > You could make 'const int bytes_long = BITS_PER_LONG / 8;'
>
> What is wrong with sizeof (long) ?
> ...

Nothing, I guess that BITS_PER_LONG is just (sizeof(long) * 8) anyway

> > > +           unsigned long cu = (unsigned long)c;
> > > +
> > > +           /* Compose an ulong with 'c' repeated 4/8 times */
> > > +           cu |= cu << 8;
> > > +           cu |= cu << 16;
> > > +#if BITS_PER_LONG == 64
> > > +           cu |= cu << 32;
> > > +#endif
> > > +
> >
> > You don't have to create cu here, you'll fill dest buffer with 'c'
> > anyway so after filling up enough 'c's to be able to grab an aligned
> > word full of them from dest, you can just grab that word and keep
> > filling up dest with it.
>
> That will be a lot slower - especially if run on something like x86.
> A write-read of the same size is optimised by the store-load forwarder.
> But the byte write, word read will have to go via the cache.
>
> You can just write:
>         cu = (unsigned long)c * 0x0101010101010101ull;
> and let the compiler sort out the best way to generate the constant.
>

Interesting. I see that most compilers do an integer multiplication,
is it faster than three shift and three or?

clang on riscv generates even more instructions to create the immediate:

unsigned long repeat_shift(int c)
{
  unsigned long cu = (unsigned long)c;
  cu |= cu << 8;
  cu |= cu << 16;
  cu |= cu << 32;

  return cu;
}

unsigned long repeat_mul(int c)
{
  return (unsigned long)c * 0x0101010101010101ull;
}

repeat_shift:
  slli a1, a0, 8
  or a0, a0, a1
  slli a1, a0, 16
  or a0, a0, a1
  slli a1, a0, 32
  or a0, a0, a1
  ret

repeat_mul:
  lui a1, 4112
  addiw a1, a1, 257
  slli a1, a1, 16
  addi a1, a1, 257
  slli a1, a1, 16
  addi a1, a1, 257
  mul a0, a0, a1
  ret

> >
> > > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > > +           /* Fill the buffer one byte at time until the destination
> > > +            * is aligned on a 32/64 bit boundary.
> > > +            */
> > > +           for (; count && dest.uptr % bytes_long; count--)
> >
> > You could reuse & mask here instead of % bytes_long.
> >
> > > +                   *dest.u8++ = c;
> > > +#endif
> >
> > I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> > memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> > and even if it did we are talking about a loop that will run just a few
> > times to reach the alignment boundary (worst case scenario it'll run 7
> > times), I don't think we gain much here, even for archs that have
> > efficient unaligned access.
>
> With CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS it probably isn't worth
> even checking the alignment.
> While aligning the copy will be quicker for an unaligned buffer they
> almost certainly don't happen often enough to worry about.
> In any case you'd want to do a misaligned word write to the start
> of the buffer - not separate byte writes.
> Provided the buffer is long enough you can also do a misaligned write
> to the end of the buffer before filling from the start.
>

I don't understand this one, a misaligned write here is ~30x slower
than an aligned one because it gets trapped and emulated in SBI.
How can this be convenient?

> I suspect you may need either barrier() or use a ptr to packed
> to avoid the perverted 'undefined behaviour' fubar.'
>

Which UB are you referring to?

Regards,
--
per aspera ad upstream

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/3] riscv: optimized memset
@ 2021-06-23  1:14         ` Matteo Croce
  0 siblings, 0 replies; 48+ messages in thread
From: Matteo Croce @ 2021-06-23  1:14 UTC (permalink / raw)
  To: David Laight
  Cc: Nick Kossifidis, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	Guo Ren

On Tue, Jun 22, 2021 at 10:38 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Nick Kossifidis
> > Sent: 22 June 2021 02:08
> >
> > Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > > +
> > > +void *__memset(void *s, int c, size_t count)
> > > +{
> > > +   union types dest = { .u8 = s };
> > > +
> > > +   if (count >= MIN_THRESHOLD) {
> > > +           const int bytes_long = BITS_PER_LONG / 8;
> >
> > You could make 'const int bytes_long = BITS_PER_LONG / 8;'
>
> What is wrong with sizeof (long) ?
> ...

Nothing, I guess that BITS_PER_LONG is just (sizeof(long) * 8) anyway

> > > +           unsigned long cu = (unsigned long)c;
> > > +
> > > +           /* Compose an ulong with 'c' repeated 4/8 times */
> > > +           cu |= cu << 8;
> > > +           cu |= cu << 16;
> > > +#if BITS_PER_LONG == 64
> > > +           cu |= cu << 32;
> > > +#endif
> > > +
> >
> > You don't have to create cu here, you'll fill dest buffer with 'c'
> > anyway so after filling up enough 'c's to be able to grab an aligned
> > word full of them from dest, you can just grab that word and keep
> > filling up dest with it.
>
> That will be a lot slower - especially if run on something like x86.
> A write-read of the same size is optimised by the store-load forwarder.
> But the byte write, word read will have to go via the cache.
>
> You can just write:
>         cu = (unsigned long)c * 0x0101010101010101ull;
> and let the compiler sort out the best way to generate the constant.
>

Interesting. I see that most compilers do an integer multiplication,
is it faster than three shift and three or?

clang on riscv generates even more instructions to create the immediate:

unsigned long repeat_shift(int c)
{
  unsigned long cu = (unsigned long)c;
  cu |= cu << 8;
  cu |= cu << 16;
  cu |= cu << 32;

  return cu;
}

unsigned long repeat_mul(int c)
{
  return (unsigned long)c * 0x0101010101010101ull;
}

repeat_shift:
  slli a1, a0, 8
  or a0, a0, a1
  slli a1, a0, 16
  or a0, a0, a1
  slli a1, a0, 32
  or a0, a0, a1
  ret

repeat_mul:
  lui a1, 4112
  addiw a1, a1, 257
  slli a1, a1, 16
  addi a1, a1, 257
  slli a1, a1, 16
  addi a1, a1, 257
  mul a0, a0, a1
  ret

> >
> > > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > > +           /* Fill the buffer one byte at time until the destination
> > > +            * is aligned on a 32/64 bit boundary.
> > > +            */
> > > +           for (; count && dest.uptr % bytes_long; count--)
> >
> > You could reuse & mask here instead of % bytes_long.
> >
> > > +                   *dest.u8++ = c;
> > > +#endif
> >
> > I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> > memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> > and even if it did we are talking about a loop that will run just a few
> > times to reach the alignment boundary (worst case scenario it'll run 7
> > times), I don't think we gain much here, even for archs that have
> > efficient unaligned access.
>
> With CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS it probably isn't worth
> even checking the alignment.
> While aligning the copy will be quicker for an unaligned buffer they
> almost certainly don't happen often enough to worry about.
> In any case you'd want to do a misaligned word write to the start
> of the buffer - not separate byte writes.
> Provided the buffer is long enough you can also do a misaligned write
> to the end of the buffer before filling from the start.
>

I don't understand this one, a misaligned write here is ~30x slower
than an aligned one because it gets trapped and emulated in SBI.
How can this be convenient?

> I suspect you may need either barrier() or use a ptr to packed
> to avoid the perverted 'undefined behaviour' fubar.'
>

Which UB are you referring to?

Regards,
--
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v3 3/3] riscv: optimized memset
  2021-06-23  1:14         ` Matteo Croce
@ 2021-06-23  9:05           ` David Laight
  -1 siblings, 0 replies; 48+ messages in thread
From: David Laight @ 2021-06-23  9:05 UTC (permalink / raw)
  To: 'Matteo Croce'
  Cc: Nick Kossifidis, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	Guo Ren

From: Matteo Croce
> Sent: 23 June 2021 02:15
> 
> On Tue, Jun 22, 2021 at 10:38 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Nick Kossifidis
...
> > You can just write:
> >         cu = (unsigned long)c * 0x0101010101010101ull;
> > and let the compiler sort out the best way to generate the constant.
> >
> 
> Interesting. I see that most compilers do an integer multiplication,
> is it faster than three shift and three or?
> 
> clang on riscv generates even more instructions to create the immediate:
> 
> unsigned long repeat_shift(int c)
> {
>   unsigned long cu = (unsigned long)c;
>   cu |= cu << 8;
>   cu |= cu << 16;
>   cu |= cu << 32;
> 
>   return cu;
> }
> 
> unsigned long repeat_mul(int c)
> {
>   return (unsigned long)c * 0x0101010101010101ull;
> }
> 
> repeat_shift:
>   slli a1, a0, 8
>   or a0, a0, a1
>   slli a1, a0, 16
>   or a0, a0, a1
>   slli a1, a0, 32
>   or a0, a0, a1
>   ret
> 
> repeat_mul:
>   lui a1, 4112
>   addiw a1, a1, 257
>   slli a1, a1, 16
>   addi a1, a1, 257
>   slli a1, a1, 16
>   addi a1, a1, 257
>   mul a0, a0, a1
>   ret

Hmmm... I expected the compiler to convert it to the first form.
It is also pretty crap at generating that constant.
Stupid compilers.

In any case, for the usual case of 'c' being a constant zero
you really don't want the latency of those instructions at all.

It is almost worth just pushing that expansion into the caller.

eg by having:
#define memset(p, v, l) memset_w(p, (v) * 0x0101010101010101, l)
(or some other byte replicator).

Really annoyingly you want to write the code that generates
the 64bit constant, and then have the compiler optimise away
the part that generates the high 32 bits on 32 bits systems.
But one of the compilers is going to 'bleat' about truncating
a constant value.
Stupid compilers (again).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v3 3/3] riscv: optimized memset
@ 2021-06-23  9:05           ` David Laight
  0 siblings, 0 replies; 48+ messages in thread
From: David Laight @ 2021-06-23  9:05 UTC (permalink / raw)
  To: 'Matteo Croce'
  Cc: Nick Kossifidis, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	Guo Ren

From: Matteo Croce
> Sent: 23 June 2021 02:15
> 
> On Tue, Jun 22, 2021 at 10:38 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Nick Kossifidis
...
> > You can just write:
> >         cu = (unsigned long)c * 0x0101010101010101ull;
> > and let the compiler sort out the best way to generate the constant.
> >
> 
> Interesting. I see that most compilers do an integer multiplication,
> is it faster than three shift and three or?
> 
> clang on riscv generates even more instructions to create the immediate:
> 
> unsigned long repeat_shift(int c)
> {
>   unsigned long cu = (unsigned long)c;
>   cu |= cu << 8;
>   cu |= cu << 16;
>   cu |= cu << 32;
> 
>   return cu;
> }
> 
> unsigned long repeat_mul(int c)
> {
>   return (unsigned long)c * 0x0101010101010101ull;
> }
> 
> repeat_shift:
>   slli a1, a0, 8
>   or a0, a0, a1
>   slli a1, a0, 16
>   or a0, a0, a1
>   slli a1, a0, 32
>   or a0, a0, a1
>   ret
> 
> repeat_mul:
>   lui a1, 4112
>   addiw a1, a1, 257
>   slli a1, a1, 16
>   addi a1, a1, 257
>   slli a1, a1, 16
>   addi a1, a1, 257
>   mul a0, a0, a1
>   ret

Hmmm... I expected the compiler to convert it to the first form.
It is also pretty crap at generating that constant.
Stupid compilers.

In any case, for the usual case of 'c' being a constant zero
you really don't want the latency of those instructions at all.

It is almost worth just pushing that expansion into the caller.

eg by having:
#define memset(p, v, l) memset_w(p, (v) * 0x0101010101010101, l)
(or some other byte replicator).

Really annoyingly you want to write the code that generates
the 64bit constant, and then have the compiler optimise away
the part that generates the high 32 bits on 32 bits systems.
But one of the compilers is going to 'bleat' about truncating
a constant value.
Stupid compilers (again).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
  2021-06-22 23:35       ` Matteo Croce
@ 2021-06-23  9:48         ` Nick Kossifidis
  -1 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-23  9:48 UTC (permalink / raw)
  To: Matteo Croce
  Cc: Nick Kossifidis, linux-riscv, Linux Kernel Mailing List,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini,
	Bin Meng, David Laight, Guo Ren

Στις 2021-06-23 02:35, Matteo Croce έγραψε:
>> 
>> If you want to be compliant with memcpy you should check for 
>> overlapping
>> regions here since "The memory areas must not overlap", and do nothing
>> about it because according to POSIX this leads to undefined behavior.
>> That's why recent libc implementations use memmove in any case (memcpy
>> is an alias to memmove), which is the suggested approach.
>> 
> 
> Mmm which memcpy arch implementation does this check?
> I guess that noone is currently doing it.
> 

Yup because even if they did the wouldn't know what to do about it since 
POSIX leaves this as undefined behavior. So instead of using memcpy it's 
suggested that people use memmove that can handle overlapping regions, 
and because we can't patch the rest of the kernel to only use memmove 
(or the rest of the programs if we were a libc), the idea is to just 
alias memcpy to memmove (BTW Torvalds also thinks this is a good idea: 
https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/3] riscv: optimized memcpy
@ 2021-06-23  9:48         ` Nick Kossifidis
  0 siblings, 0 replies; 48+ messages in thread
From: Nick Kossifidis @ 2021-06-23  9:48 UTC (permalink / raw)
  To: Matteo Croce
  Cc: Nick Kossifidis, linux-riscv, Linux Kernel Mailing List,
	linux-arch, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Emil Renner Berthing, Akira Tsukamoto, Drew Fustini,
	Bin Meng, David Laight, Guo Ren

Στις 2021-06-23 02:35, Matteo Croce έγραψε:
>> 
>> If you want to be compliant with memcpy you should check for 
>> overlapping
>> regions here since "The memory areas must not overlap", and do nothing
>> about it because according to POSIX this leads to undefined behavior.
>> That's why recent libc implementations use memmove in any case (memcpy
>> is an alias to memmove), which is the suggested approach.
>> 
> 
> Mmm which memcpy arch implementation does this check?
> I guess that noone is currently doing it.
> 

Yup because even if they did the wouldn't know what to do about it since 
POSIX leaves this as undefined behavior. So instead of using memcpy it's 
suggested that people use memmove that can handle overlapping regions, 
and because we can't patch the rest of the kernel to only use memmove 
(or the rest of the programs if we were a libc), the idea is to just 
alias memcpy to memmove (BTW Torvalds also thinks this is a good idea: 
https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132).

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/3] riscv: optimized memmove
  2021-06-17 15:27   ` Matteo Croce
  (?)
@ 2021-06-30  4:40     ` kernel test robot
  -1 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2021-06-30  4:40 UTC (permalink / raw)
  To: Matteo Croce, linux-riscv
  Cc: kbuild-all, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

[-- Attachment #1: Type: text/plain, Size: 2440 bytes --]

Hi Matteo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.13 next-20210629]
[cannot apply to atish-riscv-linux/topo_v3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd860052c99b1e088352bdd4fb7aef46f8d2ef47
config: riscv-allyesconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/211a65b2ee262f10eec0dfc17896aab1d2f0aea2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
        git checkout 211a65b2ee262f10eec0dfc17896aab1d2f0aea2
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/riscv/lib/string.c: In function '__memmove':
>> arch/riscv/lib/string.c:90:7: error: inlining failed in call to always_inline 'memcpy': function body can be overwritten at link time
      90 | void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
         |       ^~~~~~
   arch/riscv/lib/string.c:100:10: note: called from here
     100 |   return memcpy(dest, src, count);
         |          ^~~~~~~~~~~~~~~~~~~~~~~~


vim +/memcpy +90 arch/riscv/lib/string.c

c35a3474b67826 Matteo Croce 2021-06-17  89  
c35a3474b67826 Matteo Croce 2021-06-17 @90  void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
c35a3474b67826 Matteo Croce 2021-06-17  91  EXPORT_SYMBOL(memcpy);
211a65b2ee262f Matteo Croce 2021-06-17  92  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 69804 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/3] riscv: optimized memmove
@ 2021-06-30  4:40     ` kernel test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2021-06-30  4:40 UTC (permalink / raw)
  To: Matteo Croce, linux-riscv
  Cc: kbuild-all, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini

[-- Attachment #1: Type: text/plain, Size: 2440 bytes --]

Hi Matteo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.13 next-20210629]
[cannot apply to atish-riscv-linux/topo_v3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd860052c99b1e088352bdd4fb7aef46f8d2ef47
config: riscv-allyesconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/211a65b2ee262f10eec0dfc17896aab1d2f0aea2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
        git checkout 211a65b2ee262f10eec0dfc17896aab1d2f0aea2
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/riscv/lib/string.c: In function '__memmove':
>> arch/riscv/lib/string.c:90:7: error: inlining failed in call to always_inline 'memcpy': function body can be overwritten at link time
      90 | void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
         |       ^~~~~~
   arch/riscv/lib/string.c:100:10: note: called from here
     100 |   return memcpy(dest, src, count);
         |          ^~~~~~~~~~~~~~~~~~~~~~~~


vim +/memcpy +90 arch/riscv/lib/string.c

c35a3474b67826 Matteo Croce 2021-06-17  89  
c35a3474b67826 Matteo Croce 2021-06-17 @90  void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
c35a3474b67826 Matteo Croce 2021-06-17  91  EXPORT_SYMBOL(memcpy);
211a65b2ee262f Matteo Croce 2021-06-17  92  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 69804 bytes --]

[-- Attachment #3: Type: text/plain, Size: 161 bytes --]

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/3] riscv: optimized memmove
@ 2021-06-30  4:40     ` kernel test robot
  0 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2021-06-30  4:40 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2492 bytes --]

Hi Matteo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.13 next-20210629]
[cannot apply to atish-riscv-linux/topo_v3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd860052c99b1e088352bdd4fb7aef46f8d2ef47
config: riscv-allyesconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/211a65b2ee262f10eec0dfc17896aab1d2f0aea2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/riscv-optimized-mem-functions/20210617-232934
        git checkout 211a65b2ee262f10eec0dfc17896aab1d2f0aea2
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/riscv/lib/string.c: In function '__memmove':
>> arch/riscv/lib/string.c:90:7: error: inlining failed in call to always_inline 'memcpy': function body can be overwritten at link time
      90 | void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
         |       ^~~~~~
   arch/riscv/lib/string.c:100:10: note: called from here
     100 |   return memcpy(dest, src, count);
         |          ^~~~~~~~~~~~~~~~~~~~~~~~


vim +/memcpy +90 arch/riscv/lib/string.c

c35a3474b67826 Matteo Croce 2021-06-17  89  
c35a3474b67826 Matteo Croce 2021-06-17 @90  void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
c35a3474b67826 Matteo Croce 2021-06-17  91  EXPORT_SYMBOL(memcpy);
211a65b2ee262f Matteo Croce 2021-06-17  92  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 69804 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-06-30  4:42 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-17 15:27 [PATCH v3 0/3] riscv: optimized mem* functions Matteo Croce
2021-06-17 15:27 ` Matteo Croce
2021-06-17 15:27 ` [PATCH v3 1/3] riscv: optimized memcpy Matteo Croce
2021-06-17 15:27   ` Matteo Croce
2021-06-18 14:06   ` kernel test robot
2021-06-18 14:06     ` kernel test robot
2021-06-18 14:06     ` kernel test robot
2021-06-21 14:26   ` Christoph Hellwig
2021-06-21 14:26     ` Christoph Hellwig
2021-06-22  8:19     ` David Laight
2021-06-22  8:19       ` David Laight
2021-06-22 22:53       ` Matteo Croce
2021-06-22 22:53         ` Matteo Croce
2021-06-22 22:00     ` Matteo Croce
2021-06-22 22:00       ` Matteo Croce
2021-06-22  0:14   ` Nick Kossifidis
2021-06-22  0:14     ` Nick Kossifidis
2021-06-22 23:35     ` Matteo Croce
2021-06-22 23:35       ` Matteo Croce
2021-06-23  9:48       ` Nick Kossifidis
2021-06-23  9:48         ` Nick Kossifidis
2021-06-17 15:27 ` [PATCH v3 2/3] riscv: optimized memmove Matteo Croce
2021-06-17 15:27   ` Matteo Croce
2021-06-21 14:28   ` Christoph Hellwig
2021-06-21 14:28     ` Christoph Hellwig
2021-06-22  0:46   ` Nick Kossifidis
2021-06-22  0:46     ` Nick Kossifidis
2021-06-30  4:40   ` kernel test robot
2021-06-30  4:40     ` kernel test robot
2021-06-30  4:40     ` kernel test robot
2021-06-17 15:27 ` [PATCH v3 3/3] riscv: optimized memset Matteo Croce
2021-06-17 15:27   ` Matteo Croce
2021-06-21 14:32   ` Christoph Hellwig
2021-06-21 14:32     ` Christoph Hellwig
2021-06-22  1:07   ` Nick Kossifidis
2021-06-22  1:07     ` Nick Kossifidis
2021-06-22  8:38     ` David Laight
2021-06-22  8:38       ` David Laight
2021-06-23  1:14       ` Matteo Croce
2021-06-23  1:14         ` Matteo Croce
2021-06-23  9:05         ` David Laight
2021-06-23  9:05           ` David Laight
2021-06-23  0:08     ` Matteo Croce
2021-06-23  0:08       ` Matteo Croce
2021-06-22  1:09 ` [PATCH v3 0/3] riscv: optimized mem* functions Nick Kossifidis
2021-06-22  1:09   ` Nick Kossifidis
2021-06-22  2:39   ` Guo Ren
2021-06-22  2:39     ` Guo Ren

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.