All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] lib/string: optimized mem* functions
@ 2021-07-02 12:31 ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic mem{cpy,move,set} so that memory is accessed with
the widest size possible, but without doing unaligned accesses.

This was originally posted as C string functions for RISC-V[1], but as
there was no specific RISC-V code, it was proposed for the generic
lib/string.c implementation.

Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
and HAVE_EFFICIENT_UNALIGNED_ACCESS.

These are the performances of memcpy() and memset() of a RISC-V machine
on a 32 mbyte buffer:

memcpy:
original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

memset:
original aligned:	140 Mb/s
original unaligned:	140 Mb/s
new aligned:		241 Mb/s
new unaligned:		241 Mb/s

The size increase is negligible:

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
Function                                     old     new   delta
memcpy                                        29     351    +322
memset                                        29     117     +88
strlcat                                       68      78     +10
strlcpy                                       50      57      +7
memmove                                       56      50      -6
Total: Before=8556964, After=8557385, chg +0.00%

These functions will be used for RISC-V initially.

[1] https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

Matteo Croce (3):
  lib/string: optimized memcpy
  lib/string: optimized memmove
  lib/string: optimized memset

 lib/string.c | 130 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 113 insertions(+), 17 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 0/3] lib/string: optimized mem* functions
@ 2021-07-02 12:31 ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic mem{cpy,move,set} so that memory is accessed with
the widest size possible, but without doing unaligned accesses.

This was originally posted as C string functions for RISC-V[1], but as
there was no specific RISC-V code, it was proposed for the generic
lib/string.c implementation.

Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
and HAVE_EFFICIENT_UNALIGNED_ACCESS.

These are the performances of memcpy() and memset() of a RISC-V machine
on a 32 mbyte buffer:

memcpy:
original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

memset:
original aligned:	140 Mb/s
original unaligned:	140 Mb/s
new aligned:		241 Mb/s
new unaligned:		241 Mb/s

The size increase is negligible:

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
Function                                     old     new   delta
memcpy                                        29     351    +322
memset                                        29     117     +88
strlcat                                       68      78     +10
strlcpy                                       50      57      +7
memmove                                       56      50      -6
Total: Before=8556964, After=8557385, chg +0.00%

These functions will be used for RISC-V initially.

[1] https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

Matteo Croce (3):
  lib/string: optimized memcpy
  lib/string: optimized memmove
  lib/string: optimized memset

 lib/string.c | 130 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 113 insertions(+), 17 deletions(-)

-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/3] lib/string: optimized memcpy
  2021-07-02 12:31 ` Matteo Croce
@ 2021-07-02 12:31   ` Matteo Croce
  -1 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic memcpy() to copy a word at time, without generating
unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next long
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

This is the improvement on RISC-V:

original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

and this the binary size increase according to bloat-o-meter:

Function     old     new   delta
memcpy        36     324    +288


Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 546d59711a12..caeef4264c43 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -33,6 +33,23 @@
 #include <asm/word-at-a-time.h>
 #include <asm/page.h>
 
+#define BYTES_LONG	sizeof(long)
+#define WORD_MASK	(BYTES_LONG - 1)
+#define MIN_THRESHOLD	(BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	const unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
 #ifndef __HAVE_ARCH_STRNCASECMP
 /**
  * strncasecmp - Case insensitive, length-limited string comparison
@@ -869,6 +886,13 @@ EXPORT_SYMBOL(memset64);
 #endif
 
 #ifndef __HAVE_ARCH_MEMCPY
+
+#ifdef __BIG_ENDIAN
+#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((BYTES_LONG - (d)) * 8))
+#else
+#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((BYTES_LONG - (d)) * 8))
+#endif
+
 /**
  * memcpy - Copy one area of memory to another
  * @dest: Where to copy to
@@ -880,14 +904,64 @@ EXPORT_SYMBOL(memset64);
  */
 void *memcpy(void *dest, const void *src, size_t count)
 {
-	char *tmp = dest;
-	const char *s = src;
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		if (count < MIN_THRESHOLD)
+			goto copy_remainder;
+
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & WORD_MASK; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & WORD_MASK;
+	}
+
+	if (distance) {
+		unsigned long last, next;
 
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = MERGE_UL(last, next, distance);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*d.as_ulong++ = *s.as_ulong++;
+	}
+
+copy_remainder:
 	while (count--)
-		*tmp++ = *s++;
+		*d.as_u8++ = *s.as_u8++;
+
 	return dest;
 }
 EXPORT_SYMBOL(memcpy);
+
+#undef MERGE_UL
+
 #endif
 
 #ifndef __HAVE_ARCH_MEMMOVE
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 1/3] lib/string: optimized memcpy
@ 2021-07-02 12:31   ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic memcpy() to copy a word at time, without generating
unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next long
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

This is the improvement on RISC-V:

original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

and this the binary size increase according to bloat-o-meter:

Function     old     new   delta
memcpy        36     324    +288


Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 546d59711a12..caeef4264c43 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -33,6 +33,23 @@
 #include <asm/word-at-a-time.h>
 #include <asm/page.h>
 
+#define BYTES_LONG	sizeof(long)
+#define WORD_MASK	(BYTES_LONG - 1)
+#define MIN_THRESHOLD	(BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	const unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
 #ifndef __HAVE_ARCH_STRNCASECMP
 /**
  * strncasecmp - Case insensitive, length-limited string comparison
@@ -869,6 +886,13 @@ EXPORT_SYMBOL(memset64);
 #endif
 
 #ifndef __HAVE_ARCH_MEMCPY
+
+#ifdef __BIG_ENDIAN
+#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((BYTES_LONG - (d)) * 8))
+#else
+#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((BYTES_LONG - (d)) * 8))
+#endif
+
 /**
  * memcpy - Copy one area of memory to another
  * @dest: Where to copy to
@@ -880,14 +904,64 @@ EXPORT_SYMBOL(memset64);
  */
 void *memcpy(void *dest, const void *src, size_t count)
 {
-	char *tmp = dest;
-	const char *s = src;
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		if (count < MIN_THRESHOLD)
+			goto copy_remainder;
+
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & WORD_MASK; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & WORD_MASK;
+	}
+
+	if (distance) {
+		unsigned long last, next;
 
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = MERGE_UL(last, next, distance);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*d.as_ulong++ = *s.as_ulong++;
+	}
+
+copy_remainder:
 	while (count--)
-		*tmp++ = *s++;
+		*d.as_u8++ = *s.as_u8++;
+
 	return dest;
 }
 EXPORT_SYMBOL(memcpy);
+
+#undef MERGE_UL
+
 #endif
 
 #ifndef __HAVE_ARCH_MEMMOVE
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/3] lib/string: optimized memmove
  2021-07-02 12:31 ` Matteo Croce
@ 2021-07-02 12:31   ` Matteo Croce
  -1 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

This "optimization" only covers a common case. In future, proper code
which does the same thing as memcpy() does but backwards can be done.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index caeef4264c43..108b83c34cec 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -975,19 +975,13 @@ EXPORT_SYMBOL(memcpy);
  */
 void *memmove(void *dest, const void *src, size_t count)
 {
-	char *tmp;
-	const char *s;
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
 
-	if (dest <= src) {
-		tmp = dest;
-		s = src;
-		while (count--)
-			*tmp++ = *s++;
-	} else {
-		tmp = dest;
-		tmp += count;
-		s = src;
-		s += count;
 		while (count--)
 			*--tmp = *--s;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/3] lib/string: optimized memmove
@ 2021-07-02 12:31   ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

This "optimization" only covers a common case. In future, proper code
which does the same thing as memcpy() does but backwards can be done.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index caeef4264c43..108b83c34cec 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -975,19 +975,13 @@ EXPORT_SYMBOL(memcpy);
  */
 void *memmove(void *dest, const void *src, size_t count)
 {
-	char *tmp;
-	const char *s;
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
 
-	if (dest <= src) {
-		tmp = dest;
-		s = src;
-		while (count--)
-			*tmp++ = *s++;
-	} else {
-		tmp = dest;
-		tmp += count;
-		s = src;
-		s += count;
 		while (count--)
 			*--tmp = *--s;
 	}
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/3] lib/string: optimized memset
  2021-07-02 12:31 ` Matteo Croce
@ 2021-07-02 12:31   ` Matteo Croce
  -1 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

On a RISC-V machine the speed goes from 140 Mb/s to 241 Mb/s,
and this the binary size increase according to bloat-o-meter:

Function     old     new   delta
memset        32     148    +116

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 108b83c34cec..264821f0e795 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -810,10 +810,38 @@ EXPORT_SYMBOL(__sysfs_match_string);
  */
 void *memset(void *s, int c, size_t count)
 {
-	char *xs = s;
+	union types dest = { .as_u8 = s };
 
+	if (count >= MIN_THRESHOLD) {
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+		cu *= 0x0101010101010101UL;
+#else
+		cu |= cu << 8;
+		cu |= cu << 16;
+		/* Suppress warning on 32 bit machines */
+		cu |= (cu << 16) << 16;
+#endif
+		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+			/*
+			 * Fill the buffer one byte at time until
+			 * the destination is word aligned.
+			 */
+			for (; count && dest.as_uptr & WORD_MASK; count--)
+				*dest.as_u8++ = c;
+		}
+
+		/* Copy using the largest size allowed */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*dest.as_ulong++ = cu;
+	}
+
+	/* copy the remainder */
 	while (count--)
-		*xs++ = c;
+		*dest.as_u8++ = c;
+
 	return s;
 }
 EXPORT_SYMBOL(memset);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/3] lib/string: optimized memset
@ 2021-07-02 12:31   ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 12:31 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

On a RISC-V machine the speed goes from 140 Mb/s to 241 Mb/s,
and this the binary size increase according to bloat-o-meter:

Function     old     new   delta
memset        32     148    +116

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 108b83c34cec..264821f0e795 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -810,10 +810,38 @@ EXPORT_SYMBOL(__sysfs_match_string);
  */
 void *memset(void *s, int c, size_t count)
 {
-	char *xs = s;
+	union types dest = { .as_u8 = s };
 
+	if (count >= MIN_THRESHOLD) {
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+		cu *= 0x0101010101010101UL;
+#else
+		cu |= cu << 8;
+		cu |= cu << 16;
+		/* Suppress warning on 32 bit machines */
+		cu |= (cu << 16) << 16;
+#endif
+		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+			/*
+			 * Fill the buffer one byte at time until
+			 * the destination is word aligned.
+			 */
+			for (; count && dest.as_uptr & WORD_MASK; count--)
+				*dest.as_u8++ = c;
+		}
+
+		/* Copy using the largest size allowed */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*dest.as_ulong++ = cu;
+	}
+
+	/* copy the remainder */
 	while (count--)
-		*xs++ = c;
+		*dest.as_u8++ = c;
+
 	return s;
 }
 EXPORT_SYMBOL(memset);
-- 
2.31.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] lib/string: optimized memcpy
  2021-07-02 12:31   ` Matteo Croce
@ 2021-07-02 14:37     ` Ben Dooks
  -1 siblings, 0 replies; 18+ messages in thread
From: Ben Dooks @ 2021-07-02 14:37 UTC (permalink / raw)
  To: Matteo Croce, linux-kernel, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

On 02/07/2021 13:31, Matteo Croce wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Rewrite the generic memcpy() to copy a word at time, without generating
> unaligned accesses.
> 
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next long
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
> 
> This is the improvement on RISC-V:
> 
> original aligned:	 75 Mb/s
> original unaligned:	 75 Mb/s
> new aligned:		114 Mb/s
> new unaligned:		107 Mb/s
> 
> and this the binary size increase according to bloat-o-meter:
> 
> Function     old     new   delta
> memcpy        36     324    +288
> 
> 
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>   lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 77 insertions(+), 3 deletions(-)

Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
optimised version? I would have thought the lib/string.c version
was not being used?


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] lib/string: optimized memcpy
@ 2021-07-02 14:37     ` Ben Dooks
  0 siblings, 0 replies; 18+ messages in thread
From: Ben Dooks @ 2021-07-02 14:37 UTC (permalink / raw)
  To: Matteo Croce, linux-kernel, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

On 02/07/2021 13:31, Matteo Croce wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Rewrite the generic memcpy() to copy a word at time, without generating
> unaligned accesses.
> 
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next long
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
> 
> This is the improvement on RISC-V:
> 
> original aligned:	 75 Mb/s
> original unaligned:	 75 Mb/s
> new aligned:		114 Mb/s
> new unaligned:		107 Mb/s
> 
> and this the binary size increase according to bloat-o-meter:
> 
> Function     old     new   delta
> memcpy        36     324    +288
> 
> 
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>   lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 77 insertions(+), 3 deletions(-)

Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
optimised version? I would have thought the lib/string.c version
was not being used?


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] lib/string: optimized memcpy
  2021-07-02 14:37     ` Ben Dooks
@ 2021-07-02 14:44       ` Matteo Croce
  -1 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 14:44 UTC (permalink / raw)
  To: Ben Dooks
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini, linux-arch, Andrew Morton,
	Nick Desaulniers, linux-riscv

On Fri, Jul 2, 2021 at 4:37 PM Ben Dooks <ben.dooks@codethink.co.uk> wrote:
>
> On 02/07/2021 13:31, Matteo Croce wrote:
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Rewrite the generic memcpy() to copy a word at time, without generating
> > unaligned accesses.
> >
> > The procedure is made of three steps:
> > First copy data one byte at time until the destination buffer is aligned
> > to a long boundary.
> > Then copy the data one long at time shifting the current and the next long
> > to compose a long at every cycle.
> > Finally, copy the remainder one byte at time.
> >
> > This is the improvement on RISC-V:
> >
> > original aligned:      75 Mb/s
> > original unaligned:    75 Mb/s
> > new aligned:          114 Mb/s
> > new unaligned:                107 Mb/s
> >
> > and this the binary size increase according to bloat-o-meter:
> >
> > Function     old     new   delta
> > memcpy        36     324    +288
> >
> >
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > ---
> >   lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 77 insertions(+), 3 deletions(-)
>
> Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
> optimised version? I would have thought the lib/string.c version
> was not being used?
>
>

Yes, but this series started as C replacement for the assembly one,
which generates unaligned accesses.
Unfortunately the existing RISC-V processors can't handle unaligned
accesses, so they are emulated with a terrible slowdown.
Then, since there wasn't any riscv specific code, it was proposed as
generic code:

Discussion: https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] lib/string: optimized memcpy
@ 2021-07-02 14:44       ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-02 14:44 UTC (permalink / raw)
  To: Ben Dooks
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini, linux-arch, Andrew Morton,
	Nick Desaulniers, linux-riscv

On Fri, Jul 2, 2021 at 4:37 PM Ben Dooks <ben.dooks@codethink.co.uk> wrote:
>
> On 02/07/2021 13:31, Matteo Croce wrote:
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Rewrite the generic memcpy() to copy a word at time, without generating
> > unaligned accesses.
> >
> > The procedure is made of three steps:
> > First copy data one byte at time until the destination buffer is aligned
> > to a long boundary.
> > Then copy the data one long at time shifting the current and the next long
> > to compose a long at every cycle.
> > Finally, copy the remainder one byte at time.
> >
> > This is the improvement on RISC-V:
> >
> > original aligned:      75 Mb/s
> > original unaligned:    75 Mb/s
> > new aligned:          114 Mb/s
> > new unaligned:                107 Mb/s
> >
> > and this the binary size increase according to bloat-o-meter:
> >
> > Function     old     new   delta
> > memcpy        36     324    +288
> >
> >
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > ---
> >   lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 77 insertions(+), 3 deletions(-)
>
> Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
> optimised version? I would have thought the lib/string.c version
> was not being used?
>
>

Yes, but this series started as C replacement for the assembly one,
which generates unaligned accesses.
Unfortunately the existing RISC-V processors can't handle unaligned
accesses, so they are emulated with a terrible slowdown.
Then, since there wasn't any riscv specific code, it was proposed as
generic code:

Discussion: https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] lib/string: optimized mem* functions
  2021-07-02 12:31 ` Matteo Croce
@ 2021-07-10 21:31   ` Andrew Morton
  -1 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2021-07-10 21:31 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini,
	linux-arch, Nick Desaulniers, linux-riscv

On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:

> From: Matteo Croce <mcroce@microsoft.com>
> 
> Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> the widest size possible, but without doing unaligned accesses.
> 
> This was originally posted as C string functions for RISC-V[1], but as
> there was no specific RISC-V code, it was proposed for the generic
> lib/string.c implementation.
> 
> Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> 
> These are the performances of memcpy() and memset() of a RISC-V machine
> on a 32 mbyte buffer:
> 
> memcpy:
> original aligned:	 75 Mb/s
> original unaligned:	 75 Mb/s
> new aligned:		114 Mb/s
> new unaligned:		107 Mb/s
> 
> memset:
> original aligned:	140 Mb/s
> original unaligned:	140 Mb/s
> new aligned:		241 Mb/s
> new unaligned:		241 Mb/s

Did you record the x86_64 performance?


Which other architectures are affected by this change?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] lib/string: optimized mem* functions
@ 2021-07-10 21:31   ` Andrew Morton
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2021-07-10 21:31 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini,
	linux-arch, Nick Desaulniers, linux-riscv

On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:

> From: Matteo Croce <mcroce@microsoft.com>
> 
> Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> the widest size possible, but without doing unaligned accesses.
> 
> This was originally posted as C string functions for RISC-V[1], but as
> there was no specific RISC-V code, it was proposed for the generic
> lib/string.c implementation.
> 
> Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> 
> These are the performances of memcpy() and memset() of a RISC-V machine
> on a 32 mbyte buffer:
> 
> memcpy:
> original aligned:	 75 Mb/s
> original unaligned:	 75 Mb/s
> new aligned:		114 Mb/s
> new unaligned:		107 Mb/s
> 
> memset:
> original aligned:	140 Mb/s
> original unaligned:	140 Mb/s
> new aligned:		241 Mb/s
> new unaligned:		241 Mb/s

Did you record the x86_64 performance?


Which other architectures are affected by this change?

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] lib/string: optimized mem* functions
  2021-07-10 21:31   ` Andrew Morton
@ 2021-07-10 23:07     ` Matteo Croce
  -1 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-10 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini, linux-arch, Nick Desaulniers,
	linux-riscv

On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > the widest size possible, but without doing unaligned accesses.
> >
> > This was originally posted as C string functions for RISC-V[1], but as
> > there was no specific RISC-V code, it was proposed for the generic
> > lib/string.c implementation.
> >
> > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> >
> > These are the performances of memcpy() and memset() of a RISC-V machine
> > on a 32 mbyte buffer:
> >
> > memcpy:
> > original aligned:      75 Mb/s
> > original unaligned:    75 Mb/s
> > new aligned:          114 Mb/s
> > new unaligned:                107 Mb/s
> >
> > memset:
> > original aligned:     140 Mb/s
> > original unaligned:   140 Mb/s
> > new aligned:          241 Mb/s
> > new unaligned:                241 Mb/s
>
> Did you record the x86_64 performance?
>
>
> Which other architectures are affected by this change?

x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
and has optimized implementations in arch/x86/lib.
Anyway, I was curious and I tested them on x86_64 too, there was zero
gain over the generic ones.

The only architecture which will use all the three function will be
riscv, while memmove() will be used by arc, h8300, hexagon, ia64,
openrisc and parisc.
Keep in mind that memmove() isn't anything special, it just calls
memcpy() when possible (e.g. buffers not overlapping), and fallbacks
to the byte by byte copy otherwise.

In future we can write two functions, one which copies forward and
another one which copies backward, and call the right one depending on
the buffers position.
Then, we could alias memcpy() and memmove(), as proposed by Linus:

https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] lib/string: optimized mem* functions
@ 2021-07-10 23:07     ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2021-07-10 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini, linux-arch, Nick Desaulniers,
	linux-riscv

On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > the widest size possible, but without doing unaligned accesses.
> >
> > This was originally posted as C string functions for RISC-V[1], but as
> > there was no specific RISC-V code, it was proposed for the generic
> > lib/string.c implementation.
> >
> > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> >
> > These are the performances of memcpy() and memset() of a RISC-V machine
> > on a 32 mbyte buffer:
> >
> > memcpy:
> > original aligned:      75 Mb/s
> > original unaligned:    75 Mb/s
> > new aligned:          114 Mb/s
> > new unaligned:                107 Mb/s
> >
> > memset:
> > original aligned:     140 Mb/s
> > original unaligned:   140 Mb/s
> > new aligned:          241 Mb/s
> > new unaligned:                241 Mb/s
>
> Did you record the x86_64 performance?
>
>
> Which other architectures are affected by this change?

x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
and has optimized implementations in arch/x86/lib.
Anyway, I was curious and I tested them on x86_64 too, there was zero
gain over the generic ones.

The only architecture which will use all the three function will be
riscv, while memmove() will be used by arc, h8300, hexagon, ia64,
openrisc and parisc.
Keep in mind that memmove() isn't anything special, it just calls
memcpy() when possible (e.g. buffers not overlapping), and fallbacks
to the byte by byte copy otherwise.

In future we can write two functions, one which copies forward and
another one which copies backward, and call the right one depending on
the buffers position.
Then, we could alias memcpy() and memmove(), as proposed by Linus:

https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132

Regards,
-- 
per aspera ad upstream

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v2 0/3] lib/string: optimized mem* functions
  2021-07-10 23:07     ` Matteo Croce
@ 2021-07-12  8:15       ` David Laight
  -1 siblings, 0 replies; 18+ messages in thread
From: David Laight @ 2021-07-12  8:15 UTC (permalink / raw)
  To: 'Matteo Croce', Andrew Morton
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, Palmer Dabbelt, Emil Renner Berthing,
	Drew Fustini, linux-arch, Nick Desaulniers, linux-riscv

From: Matteo Croce
> Sent: 11 July 2021 00:08
> 
> On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:
> >
> > > From: Matteo Croce <mcroce@microsoft.com>
> > >
> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > > the widest size possible, but without doing unaligned accesses.
> > >
> > > This was originally posted as C string functions for RISC-V[1], but as
> > > there was no specific RISC-V code, it was proposed for the generic
> > > lib/string.c implementation.
> > >
> > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> > >
> > > These are the performances of memcpy() and memset() of a RISC-V machine
> > > on a 32 mbyte buffer:
> > >
> > > memcpy:
> > > original aligned:      75 Mb/s
> > > original unaligned:    75 Mb/s
> > > new aligned:          114 Mb/s
> > > new unaligned:                107 Mb/s
> > >
> > > memset:
> > > original aligned:     140 Mb/s
> > > original unaligned:   140 Mb/s
> > > new aligned:          241 Mb/s
> > > new unaligned:                241 Mb/s
> >
> > Did you record the x86_64 performance?
> >
> >
> > Which other architectures are affected by this change?
> 
> x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
> and has optimized implementations in arch/x86/lib.
> Anyway, I was curious and I tested them on x86_64 too, there was zero
> gain over the generic ones.

x86 performance (and attainable performance) does depend on the cpu
micro-archiecture.

Any recent 'desktop' intel cpu will almost certainly manage to
re-order the execution of almost any copy loop and attain 1 write per clock.
(Even the trivial 'while (count--) *dest++ = *src++;' loop.)

The same isn't true of the Atom based cpu that may be on small servers.
Theses are no slouches (eg 4 cores at 2.4GHz) but only have limited
out-of-order execution and so are much more sensitive to instruction
ordering.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v2 0/3] lib/string: optimized mem* functions
@ 2021-07-12  8:15       ` David Laight
  0 siblings, 0 replies; 18+ messages in thread
From: David Laight @ 2021-07-12  8:15 UTC (permalink / raw)
  To: 'Matteo Croce', Andrew Morton
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, Palmer Dabbelt, Emil Renner Berthing,
	Drew Fustini, linux-arch, Nick Desaulniers, linux-riscv

From: Matteo Croce
> Sent: 11 July 2021 00:08
> 
> On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:
> >
> > > From: Matteo Croce <mcroce@microsoft.com>
> > >
> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > > the widest size possible, but without doing unaligned accesses.
> > >
> > > This was originally posted as C string functions for RISC-V[1], but as
> > > there was no specific RISC-V code, it was proposed for the generic
> > > lib/string.c implementation.
> > >
> > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> > >
> > > These are the performances of memcpy() and memset() of a RISC-V machine
> > > on a 32 mbyte buffer:
> > >
> > > memcpy:
> > > original aligned:      75 Mb/s
> > > original unaligned:    75 Mb/s
> > > new aligned:          114 Mb/s
> > > new unaligned:                107 Mb/s
> > >
> > > memset:
> > > original aligned:     140 Mb/s
> > > original unaligned:   140 Mb/s
> > > new aligned:          241 Mb/s
> > > new unaligned:                241 Mb/s
> >
> > Did you record the x86_64 performance?
> >
> >
> > Which other architectures are affected by this change?
> 
> x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
> and has optimized implementations in arch/x86/lib.
> Anyway, I was curious and I tested them on x86_64 too, there was zero
> gain over the generic ones.

x86 performance (and attainable performance) does depend on the cpu
micro-archiecture.

Any recent 'desktop' intel cpu will almost certainly manage to
re-order the execution of almost any copy loop and attain 1 write per clock.
(Even the trivial 'while (count--) *dest++ = *src++;' loop.)

The same isn't true of the Atom based cpu that may be on small servers.
Theses are no slouches (eg 4 cores at 2.4GHz) but only have limited
out-of-order execution and so are much more sensitive to instruction
ordering.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-07-12  9:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-02 12:31 [PATCH v2 0/3] lib/string: optimized mem* functions Matteo Croce
2021-07-02 12:31 ` Matteo Croce
2021-07-02 12:31 ` [PATCH v2 1/3] lib/string: optimized memcpy Matteo Croce
2021-07-02 12:31   ` Matteo Croce
2021-07-02 14:37   ` Ben Dooks
2021-07-02 14:37     ` Ben Dooks
2021-07-02 14:44     ` Matteo Croce
2021-07-02 14:44       ` Matteo Croce
2021-07-02 12:31 ` [PATCH v2 2/3] lib/string: optimized memmove Matteo Croce
2021-07-02 12:31   ` Matteo Croce
2021-07-02 12:31 ` [PATCH v2 3/3] lib/string: optimized memset Matteo Croce
2021-07-02 12:31   ` Matteo Croce
2021-07-10 21:31 ` [PATCH v2 0/3] lib/string: optimized mem* functions Andrew Morton
2021-07-10 21:31   ` Andrew Morton
2021-07-10 23:07   ` Matteo Croce
2021-07-10 23:07     ` Matteo Croce
2021-07-12  8:15     ` David Laight
2021-07-12  8:15       ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.