linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] lib/string: optimized mem* functions
@ 2021-06-25  1:01 Matteo Croce
  2021-06-25  1:01 ` [PATCH 1/3] lib/string: optimized memcpy Matteo Croce
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Matteo Croce @ 2021-06-25  1:01 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic mem{cpy,move,set} so that memory is accessed with
the widest size possible, but without doing unaligned accesses.

This was originally posted as C string functions for RISC-V[1], but as
there was no specific RISC-V code, it was proposed for the generic
lib/string.c implementation.

Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
and HAVE_EFFICIENT_UNALIGNED_ACCESS.

Further testing on big endian machines will be appreciated, as I don't
have such hardware at the moment.

[1] https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

Matteo Croce (3):
  lib/string: optimized memcpy
  lib/string: optimized memmove
  lib/string: optimized memset

 lib/string.c | 129 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 112 insertions(+), 17 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] lib/string: optimized memcpy
  2021-06-25  1:01 [PATCH 0/3] lib/string: optimized mem* functions Matteo Croce
@ 2021-06-25  1:01 ` Matteo Croce
  2021-06-25  5:01   ` kernel test robot
  2021-06-25 14:41   ` kernel test robot
  2021-06-25  1:01 ` [PATCH 2/3] lib/string: optimized memmove Matteo Croce
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 9+ messages in thread
From: Matteo Croce @ 2021-06-25  1:01 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic memcpy() to copy a word at time, without generating
unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next long
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 78 insertions(+), 3 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 546d59711a12..15e906f97d9e 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -33,6 +33,24 @@
 #include <asm/word-at-a-time.h>
 #include <asm/page.h>
 
+#define MIN_THRESHOLD (sizeof(long) * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	const unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+static const unsigned int bytes_long = sizeof(long);
+static const unsigned int word_mask = bytes_long - 1;
+
 #ifndef __HAVE_ARCH_STRNCASECMP
 /**
  * strncasecmp - Case insensitive, length-limited string comparison
@@ -878,16 +896,73 @@ EXPORT_SYMBOL(memset64);
  * You should not use this function to access IO space, use memcpy_toio()
  * or memcpy_fromio() instead.
  */
+
+#ifdef __BIG_ENDIAN
+#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((bytes_long - (d)) * 8))
+#else
+#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((bytes_long - (d)) * 8))
+#endif
+
 void *memcpy(void *dest, const void *src, size_t count)
 {
-	char *tmp = dest;
-	const char *s = src;
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		if (count < MIN_THRESHOLD)
+			goto copy_remainder;
+
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & word_mask; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & word_mask;
+	}
 
+	if (distance) {
+		unsigned long last, next;
+
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= bytes_long + word_mask; count -= bytes_long) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = MERGE_UL(last, next, distance);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= bytes_long; count -= bytes_long)
+			*d.as_ulong++ = *s.as_ulong++;
+	}
+
+copy_remainder:
 	while (count--)
-		*tmp++ = *s++;
+		*d.as_u8++ = *s.as_u8++;
+
 	return dest;
 }
 EXPORT_SYMBOL(memcpy);
+
+#undef MERGE_UL
+
 #endif
 
 #ifndef __HAVE_ARCH_MEMMOVE
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] lib/string: optimized memmove
  2021-06-25  1:01 [PATCH 0/3] lib/string: optimized mem* functions Matteo Croce
  2021-06-25  1:01 ` [PATCH 1/3] lib/string: optimized memcpy Matteo Croce
@ 2021-06-25  1:01 ` Matteo Croce
  2021-06-25  1:02 ` [PATCH 3/3] lib/string: optimized memset Matteo Croce
  2021-06-25 17:45 ` [PATCH 0/3] lib/string: optimized mem* functions Nick Desaulniers
  3 siblings, 0 replies; 9+ messages in thread
From: Matteo Croce @ 2021-06-25  1:01 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

This "optimization" only covers a common case. In future, proper code
which does the same thing as memcpy() does but backwards can be done.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 15e906f97d9e..69adec252597 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -976,19 +976,13 @@ EXPORT_SYMBOL(memcpy);
  */
 void *memmove(void *dest, const void *src, size_t count)
 {
-	char *tmp;
-	const char *s;
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
 
-	if (dest <= src) {
-		tmp = dest;
-		s = src;
-		while (count--)
-			*tmp++ = *s++;
-	} else {
-		tmp = dest;
-		tmp += count;
-		s = src;
-		s += count;
 		while (count--)
 			*--tmp = *--s;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] lib/string: optimized memset
  2021-06-25  1:01 [PATCH 0/3] lib/string: optimized mem* functions Matteo Croce
  2021-06-25  1:01 ` [PATCH 1/3] lib/string: optimized memcpy Matteo Croce
  2021-06-25  1:01 ` [PATCH 2/3] lib/string: optimized memmove Matteo Croce
@ 2021-06-25  1:02 ` Matteo Croce
  2021-06-25 17:45 ` [PATCH 0/3] lib/string: optimized mem* functions Nick Desaulniers
  3 siblings, 0 replies; 9+ messages in thread
From: Matteo Croce @ 2021-06-25  1:02 UTC (permalink / raw)
  To: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini
  Cc: linux-arch, Andrew Morton, Nick Desaulniers, linux-riscv

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 69adec252597..598ece5434e9 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -811,10 +811,36 @@ EXPORT_SYMBOL(__sysfs_match_string);
  */
 void *memset(void *s, int c, size_t count)
 {
-	char *xs = s;
+	union types dest = { .as_u8 = s };
 
+	if (count >= MIN_THRESHOLD) {
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+		cu |= cu << 8;
+		cu |= cu << 16;
+#if BITS_PER_LONG == 64
+		cu |= cu << 32;
+#endif
+
+		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+			/*
+			 * Fill the buffer one byte at time until
+			 * the destination is word aligned.
+			 */
+			for (; count && dest.as_uptr & word_mask; count--)
+				*dest.as_u8++ = c;
+		}
+
+		/* Copy using the largest size allowed */
+		for (; count >= bytes_long; count -= bytes_long)
+			*dest.as_ulong++ = cu;
+	}
+
+	/* copy the remainder */
 	while (count--)
-		*xs++ = c;
+		*dest.as_u8++ = c;
+
 	return s;
 }
 EXPORT_SYMBOL(memset);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] lib/string: optimized memcpy
  2021-06-25  1:01 ` [PATCH 1/3] lib/string: optimized memcpy Matteo Croce
@ 2021-06-25  5:01   ` kernel test robot
  2021-06-25 14:41   ` kernel test robot
  1 sibling, 0 replies; 9+ messages in thread
From: kernel test robot @ 2021-06-25  5:01 UTC (permalink / raw)
  To: Matteo Croce, linux-kernel, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini
  Cc: kbuild-all, linux-arch, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2627 bytes --]

Hi Matteo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on hch-configfs/for-next linus/master v5.13-rc7 next-20210624]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/lib-string-optimized-mem-functions/20210625-090352
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 0c18f29aae7ce3dadd26d8ee3505d07cc982df75
config: powerpc64-randconfig-p001-20210622 (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/e9af69e22b4b9fe629d72f5c8b022aca3244cc01
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/lib-string-optimized-mem-functions/20210625-090352
        git checkout e9af69e22b4b9fe629d72f5c8b022aca3244cc01
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   lib/string.c:901: warning: Function parameter or member 'h' not described in 'MERGE_UL'
   lib/string.c:901: warning: Function parameter or member 'l' not described in 'MERGE_UL'
   lib/string.c:901: warning: Function parameter or member 'd' not described in 'MERGE_UL'
>> lib/string.c:901: warning: expecting prototype for memcpy(). Prototype was for MERGE_UL() instead


vim +901 lib/string.c

   888	
   889	#ifndef __HAVE_ARCH_MEMCPY
   890	/**
   891	 * memcpy - Copy one area of memory to another
   892	 * @dest: Where to copy to
   893	 * @src: Where to copy from
   894	 * @count: The size of the area.
   895	 *
   896	 * You should not use this function to access IO space, use memcpy_toio()
   897	 * or memcpy_fromio() instead.
   898	 */
   899	
   900	#ifdef __BIG_ENDIAN
 > 901	#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((bytes_long - (d)) * 8))
   902	#else
   903	#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((bytes_long - (d)) * 8))
   904	#endif
   905	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34041 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] lib/string: optimized memcpy
  2021-06-25  1:01 ` [PATCH 1/3] lib/string: optimized memcpy Matteo Croce
  2021-06-25  5:01   ` kernel test robot
@ 2021-06-25 14:41   ` kernel test robot
  1 sibling, 0 replies; 9+ messages in thread
From: kernel test robot @ 2021-06-25 14:41 UTC (permalink / raw)
  To: Matteo Croce, linux-kernel, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini
  Cc: kbuild-all, clang-built-linux, linux-arch, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

Hi Matteo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on hch-configfs/for-next linus/master v5.13-rc7 next-20210624]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matteo-Croce/lib-string-optimized-mem-functions/20210625-090352
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 0c18f29aae7ce3dadd26d8ee3505d07cc982df75
config: x86_64-randconfig-a002-20210622 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 9ca0171a9ffdef5fdb1511d197a3fd72490362de)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/e9af69e22b4b9fe629d72f5c8b022aca3244cc01
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Matteo-Croce/lib-string-optimized-mem-functions/20210625-090352
        git checkout e9af69e22b4b9fe629d72f5c8b022aca3244cc01
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> lib/string.c:52:27: warning: unused variable 'word_mask' [-Wunused-const-variable]
   static const unsigned int word_mask = bytes_long - 1;
                             ^
   1 warning generated.
--
   lib/string.c:901: warning: Function parameter or member 'h' not described in 'MERGE_UL'
   lib/string.c:901: warning: Function parameter or member 'l' not described in 'MERGE_UL'
   lib/string.c:901: warning: Function parameter or member 'd' not described in 'MERGE_UL'
>> lib/string.c:901: warning: expecting prototype for memcpy(). Prototype was for MERGE_UL() instead


vim +/word_mask +52 lib/string.c

    50	
    51	static const unsigned int bytes_long = sizeof(long);
  > 52	static const unsigned int word_mask = bytes_long - 1;
    53	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 43108 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] lib/string: optimized mem* functions
  2021-06-25  1:01 [PATCH 0/3] lib/string: optimized mem* functions Matteo Croce
                   ` (2 preceding siblings ...)
  2021-06-25  1:02 ` [PATCH 3/3] lib/string: optimized memset Matteo Croce
@ 2021-06-25 17:45 ` Nick Desaulniers
  2021-06-27  0:19   ` Matteo Croce
  3 siblings, 1 reply; 9+ messages in thread
From: Nick Desaulniers @ 2021-06-25 17:45 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-kernel, Nick Kossifidis, Guo Ren, Christoph Hellwig,
	David Laight, Palmer Dabbelt, Emil Renner Berthing, Drew Fustini,
	linux-arch, Andrew Morton, linux-riscv

On Thu, Jun 24, 2021 at 6:02 PM Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> the widest size possible, but without doing unaligned accesses.
>
> This was originally posted as C string functions for RISC-V[1], but as
> there was no specific RISC-V code, it was proposed for the generic
> lib/string.c implementation.
>
> Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> and HAVE_EFFICIENT_UNALIGNED_ACCESS.
>
> Further testing on big endian machines will be appreciated, as I don't
> have such hardware at the moment.

Hi Matteo,
Neat patches.  Do you have you any benchmark data showing the claimed
improvements? Is it worthwhile to define these only when
CC_OPTIMIZE_FOR_PERFORMANCE/CC_OPTIMIZE_FOR_PERFORMANCE_O3 are
defined, not CC_OPTIMIZE_FOR_SIZE? I'd be curious to know the delta in
ST_SIZE of these functions otherwise.

For big endian, you ought to be able to boot test in QEMU.  I think
you'd find out pretty quickly if any of the above had issues.
(Enabling KASAN is probably also a good idea for a test, too). Check
out
https://github.com/ClangBuiltLinux/boot-utils
for ready made images and scripts for launching various architectures
and endiannesses.

>
> [1] https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/
>
> Matteo Croce (3):
>   lib/string: optimized memcpy
>   lib/string: optimized memmove
>   lib/string: optimized memset
>
>  lib/string.c | 129 ++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 112 insertions(+), 17 deletions(-)
>
> --
> 2.31.1
>


--
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] lib/string: optimized mem* functions
  2021-06-25 17:45 ` [PATCH 0/3] lib/string: optimized mem* functions Nick Desaulniers
@ 2021-06-27  0:19   ` Matteo Croce
  2021-07-01  0:29     ` Matteo Croce
  0 siblings, 1 reply; 9+ messages in thread
From: Matteo Croce @ 2021-06-27  0:19 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini, linux-arch, Andrew Morton,
	linux-riscv

On Fri, Jun 25, 2021 at 7:45 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Thu, Jun 24, 2021 at 6:02 PM Matteo Croce <mcroce@linux.microsoft.com> wrote:
> >
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > the widest size possible, but without doing unaligned accesses.
> >
> > This was originally posted as C string functions for RISC-V[1], but as
> > there was no specific RISC-V code, it was proposed for the generic
> > lib/string.c implementation.
> >
> > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> >
> > Further testing on big endian machines will be appreciated, as I don't
> > have such hardware at the moment.
>
> Hi Matteo,
> Neat patches.  Do you have you any benchmark data showing the claimed
> improvements? Is it worthwhile to define these only when
> CC_OPTIMIZE_FOR_PERFORMANCE/CC_OPTIMIZE_FOR_PERFORMANCE_O3 are
> defined, not CC_OPTIMIZE_FOR_SIZE? I'd be curious to know the delta in
> ST_SIZE of these functions otherwise.
>

I compared the current versions with the new one with bloat-o-meter,
the kernel grows by ~400 bytes on x86_64 and RISC-V

x86_64

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
Function                                     old     new   delta
memcpy                                        29     351    +322
memset                                        29     117     +88
strlcat                                       68      78     +10
strlcpy                                       50      57      +7
memmove                                       56      50      -6
Total: Before=8556964, After=8557385, chg +0.00%

RISC-V

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 0/0 grow/shrink: 4/2 up/down: 432/-36 (396)
Function                                     old     new   delta
memcpy                                        36     324    +288
memset                                        32     148    +116
strlcpy                                      116     132     +16
strscpy_pad                                   84      96     +12
strlcat                                      176     164     -12
memmove                                       76      52     -24
Total: Before=1225371, After=1225767, chg +0.03%

I will post benchmarks made on a RISC-V machine which can't handle
unaligned accesses, and it will be the first user of the new
functions.

> For big endian, you ought to be able to boot test in QEMU.  I think
> you'd find out pretty quickly if any of the above had issues.
> (Enabling KASAN is probably also a good idea for a test, too). Check
> out
> https://github.com/ClangBuiltLinux/boot-utils
> for ready made images and scripts for launching various architectures
> and endiannesses.
>

Will do!

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] lib/string: optimized mem* functions
  2021-06-27  0:19   ` Matteo Croce
@ 2021-07-01  0:29     ` Matteo Croce
  0 siblings, 0 replies; 9+ messages in thread
From: Matteo Croce @ 2021-07-01  0:29 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Linux Kernel Mailing List, Nick Kossifidis, Guo Ren,
	Christoph Hellwig, David Laight, Palmer Dabbelt,
	Emil Renner Berthing, Drew Fustini, linux-arch, Andrew Morton,
	linux-riscv

On Sun, 27 Jun 2021 02:19:59 +0200
Matteo Croce <mcroce@linux.microsoft.com> wrote:

> On Fri, Jun 25, 2021 at 7:45 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Thu, Jun 24, 2021 at 6:02 PM Matteo Croce
> > <mcroce@linux.microsoft.com> wrote:
> > >
> > > From: Matteo Croce <mcroce@microsoft.com>
> > >
> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed
> > > with the widest size possible, but without doing unaligned
> > > accesses.
> > >
> > > This was originally posted as C string functions for RISC-V[1],
> > > but as there was no specific RISC-V code, it was proposed for the
> > > generic lib/string.c implementation.
> > >
> > > Tested on RISC-V and on x86_64 by undefining
> > > __HAVE_ARCH_MEM{CPY,SET,MOVE} and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> > >
> > > Further testing on big endian machines will be appreciated, as I
> > > don't have such hardware at the moment.
> >
> > Hi Matteo,
> > Neat patches.  Do you have you any benchmark data showing the
> > claimed improvements? Is it worthwhile to define these only when
> > CC_OPTIMIZE_FOR_PERFORMANCE/CC_OPTIMIZE_FOR_PERFORMANCE_O3 are
> > defined, not CC_OPTIMIZE_FOR_SIZE? I'd be curious to know the delta
> > in ST_SIZE of these functions otherwise.
> >
> 
> I compared the current versions with the new one with bloat-o-meter,
> the kernel grows by ~400 bytes on x86_64 and RISC-V
> 
> x86_64
> 
> $ scripts/bloat-o-meter vmlinux.orig vmlinux
> add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
> Function                                     old     new   delta
> memcpy                                        29     351    +322
> memset                                        29     117     +88
> strlcat                                       68      78     +10
> strlcpy                                       50      57      +7
> memmove                                       56      50      -6
> Total: Before=8556964, After=8557385, chg +0.00%
> 
> RISC-V
> 
> $ scripts/bloat-o-meter vmlinux.orig vmlinux
> add/remove: 0/0 grow/shrink: 4/2 up/down: 432/-36 (396)
> Function                                     old     new   delta
> memcpy                                        36     324    +288
> memset                                        32     148    +116
> strlcpy                                      116     132     +16
> strscpy_pad                                   84      96     +12
> strlcat                                      176     164     -12
> memmove                                       76      52     -24
> Total: Before=1225371, After=1225767, chg +0.03%
> 
> I will post benchmarks made on a RISC-V machine which can't handle
> unaligned accesses, and it will be the first user of the new
> functions.
> 
> > For big endian, you ought to be able to boot test in QEMU.  I think
> > you'd find out pretty quickly if any of the above had issues.
> > (Enabling KASAN is probably also a good idea for a test, too). Check
> > out
> > https://github.com/ClangBuiltLinux/boot-utils
> > for ready made images and scripts for launching various
> > architectures and endiannesses.
> >
> 
> Will do!
> 

I finally made the benchmarks on RISC-V.

The current byte-at-time memcpy() always copy 74 Mbyte/sec, no matter
the alignment of the buffers, while my implementation do 114 Mb/s on
two aligned buffers and 107 Mb/s when need to shift and merge words.

For memset(), the current byte-at-time implementation always writes 140
Mb/s, my word-wise implementation always copies 241 Mb/s. Both
implementations have the same performance with aligned and unaligned
buffers.

The memcpy() test is a simple loop which calls memcpy with different
offsets on a 32MB buffer:

#define NUM_PAGES	(1 << (MAX_ORDER + 2))
#define PG_SIZE		(PAGE_SIZE * NUM_PAGES)


	page1 = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0);
	page2 = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0);

	src = page_to_virt(page1);
	dst = page_to_virt(page2);

	for (i = 0; i < sizeof(void*); i++) {
		for (j = 0; j < sizeof(void*); j++) {
			t0 = ktime_get();
			memcpy(dst + i, src + j, PG_SIZE - max(i, j));
			t1 = ktime_get();
			printk("Strings selftest: memcpy(dst+%d, src+%d), distance %lu: %llu Mb/s\n",
			       i, j, (j - i) % sizeof(long),
			       PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
		}
		printk("\n");
	}

Similarly, the memset() one:

	page = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0);
	dst = page_to_virt(page);

	for (i = 0; i < sizeof(void*); i++) {
		t0 = ktime_get();
		memset(dst + i, 0, PG_SIZE - i);
		t1 = ktime_get();
		printk("Strings selftest: memset(dst+%d): %llu Mb/s\n",
		       i, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
	}

The results for the current one are:

[   27.893931] Strings selftest: memcpy testing with size: 32 Mb
[   28.315272] Strings selftest: memcpy(dst+0, src+0), distance 0: 75 Mb/s
[   28.736485] Strings selftest: memcpy(dst+0, src+1), distance 1: 75 Mb/s
[   29.156826] Strings selftest: memcpy(dst+0, src+2), distance 2: 76 Mb/s
[   29.576199] Strings selftest: memcpy(dst+0, src+3), distance 3: 76 Mb/s
[   29.994360] Strings selftest: memcpy(dst+0, src+4), distance 4: 76 Mb/s
[   30.411766] Strings selftest: memcpy(dst+0, src+5), distance 5: 76 Mb/s
[   30.828124] Strings selftest: memcpy(dst+0, src+6), distance 6: 76 Mb/s
[   31.243514] Strings selftest: memcpy(dst+0, src+7), distance 7: 76 Mb/s

[...]

[   52.077251] Strings selftest: memcpy(dst+7, src+0), distance 1: 74 Mb/s
[   52.508115] Strings selftest: memcpy(dst+7, src+1), distance 2: 74 Mb/s
[   52.939309] Strings selftest: memcpy(dst+7, src+2), distance 3: 74 Mb/s
[   53.370493] Strings selftest: memcpy(dst+7, src+3), distance 4: 74 Mb/s
[   53.801865] Strings selftest: memcpy(dst+7, src+4), distance 5: 74 Mb/s
[   54.233287] Strings selftest: memcpy(dst+7, src+5), distance 6: 74 Mb/s
[   54.664990] Strings selftest: memcpy(dst+7, src+6), distance 7: 74 Mb/s
[   55.086996] Strings selftest: memcpy(dst+7, src+7), distance 0: 75 Mb/s

[   55.109680] Strings selftest: memset testing with size: 32 Mb
[   55.337873] Strings selftest: memset(dst+0): 140 Mb/s
[   55.565905] Strings selftest: memset(dst+1): 140 Mb/s
[   55.793987] Strings selftest: memset(dst+2): 140 Mb/s
[   56.022140] Strings selftest: memset(dst+3): 140 Mb/s
[   56.250259] Strings selftest: memset(dst+4): 140 Mb/s
[   56.478283] Strings selftest: memset(dst+5): 140 Mb/s
[   56.706296] Strings selftest: memset(dst+6): 140 Mb/s
[   56.934335] Strings selftest: memset(dst+7): 140 Mb/s

While for the proposed one:

[   38.843970] Strings selftest: memcpy testing with size: 32 Mb
[   39.124047] Strings selftest: memcpy(dst+0, src+0), distance 0: 114 Mb/s
[   39.421848] Strings selftest: memcpy(dst+0, src+1), distance 1: 107 Mb/s
[   39.719613] Strings selftest: memcpy(dst+0, src+2), distance 2: 107 Mb/s
[   40.017310] Strings selftest: memcpy(dst+0, src+3), distance 3: 107 Mb/s
[   40.314939] Strings selftest: memcpy(dst+0, src+4), distance 4: 107 Mb/s
[   40.612485] Strings selftest: memcpy(dst+0, src+5), distance 5: 107 Mb/s
[   40.910054] Strings selftest: memcpy(dst+0, src+6), distance 6: 107 Mb/s
[   41.207577] Strings selftest: memcpy(dst+0, src+7), distance 7: 107 Mb/s

[...]

[   55.682517] Strings selftest: memcpy(dst+7, src+0), distance 1: 107 Mb/s
[   55.980262] Strings selftest: memcpy(dst+7, src+1), distance 2: 107 Mb/s
[   56.277862] Strings selftest: memcpy(dst+7, src+2), distance 3: 107 Mb/s
[   56.575514] Strings selftest: memcpy(dst+7, src+3), distance 4: 107 Mb/s
[   56.873142] Strings selftest: memcpy(dst+7, src+4), distance 5: 107 Mb/s
[   57.170840] Strings selftest: memcpy(dst+7, src+5), distance 6: 107 Mb/s
[   57.468553] Strings selftest: memcpy(dst+7, src+6), distance 7: 107 Mb/s
[   57.748231] Strings selftest: memcpy(dst+7, src+7), distance 0: 114 Mb/s

[   57.772721] Strings selftest: memset testing with size: 32 Mb
[   57.905358] Strings selftest: memset(dst+0): 241 Mb/s
[   58.037974] Strings selftest: memset(dst+1): 241 Mb/s
[   58.170619] Strings selftest: memset(dst+2): 241 Mb/s
[   58.303228] Strings selftest: memset(dst+3): 241 Mb/s
[   58.435808] Strings selftest: memset(dst+4): 241 Mb/s
[   58.568373] Strings selftest: memset(dst+5): 241 Mb/s
[   58.700968] Strings selftest: memset(dst+6): 241 Mb/s
[   58.833499] Strings selftest: memset(dst+7): 241 Mb/s

Anyway, I have to submit a v2 because the current one fails to build on
compilers which can't understand that byte_long is constant:

lib/string.c:52:39: error: initializer element is not constant
 static const unsigned int word_mask = bytes_long - 1;
                                       ^~~~~~~~~~

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-07-01  0:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-25  1:01 [PATCH 0/3] lib/string: optimized mem* functions Matteo Croce
2021-06-25  1:01 ` [PATCH 1/3] lib/string: optimized memcpy Matteo Croce
2021-06-25  5:01   ` kernel test robot
2021-06-25 14:41   ` kernel test robot
2021-06-25  1:01 ` [PATCH 2/3] lib/string: optimized memmove Matteo Croce
2021-06-25  1:02 ` [PATCH 3/3] lib/string: optimized memset Matteo Croce
2021-06-25 17:45 ` [PATCH 0/3] lib/string: optimized mem* functions Nick Desaulniers
2021-06-27  0:19   ` Matteo Croce
2021-07-01  0:29     ` Matteo Croce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).