linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] x86: Optimize memchr() for x86-64
@ 2022-05-28  8:12 Yu-Jen Chang
  2022-05-28  8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Yu-Jen Chang @ 2022-05-28  8:12 UTC (permalink / raw)
  To: ak, jdike
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, keescook, linux-kernel,
	linux-hardening, richard, anton.ivanov, johannes, linux-um,
	jserv, Yu-Jen Chang

*** BLURB HERE ***
These patch series add an optimized "memchr()" for x86-64 and 
USER-MODE LINUX (UML).
 
There exists an assemebly implementation for x86-32. However, 
for x86-64, there isn't any optimized version. We implement word-wise 
comparison so that 8 characters can be compared at the same time on 
x86-64 CPU. The optimized “memchr()” is nearly 4x faster than the 
orginal implementation for long strings.

We test the optimized “memchr()” in UML and also recompile the 5.18 
Kernel with the optimized “memchr()”. They run correctly.

In this patch we add a new file "string_64.c", which only contains 
"memchr()". We can add more optimized string functions in it in the 
future.

Yu-Jen Chang (2):
  x86/lib: Optimize memchr()
  x86/um: Use x86_64-optimized memchr

 arch/x86/include/asm/string_64.h |  3 ++
 arch/x86/lib/Makefile            |  1 +
 arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
 arch/x86/um/Makefile             |  2 +-
 4 files changed, 83 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/lib/string_64.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] x86/lib: Optimize memchr()
  2022-05-28  8:12 [PATCH 0/2] x86: Optimize memchr() for x86-64 Yu-Jen Chang
@ 2022-05-28  8:12 ` Yu-Jen Chang
  2022-05-28 16:41   ` Tao Zhou
  2022-05-30  8:09   ` David Laight
  2022-05-28  8:12 ` [PATCH 2/2] x86/um: Use x86_64-optimized memchr Yu-Jen Chang
  2022-05-29  1:10 ` [PATCH 0/2] x86: Optimize memchr() for x86-64 Andi Kleen
  2 siblings, 2 replies; 10+ messages in thread
From: Yu-Jen Chang @ 2022-05-28  8:12 UTC (permalink / raw)
  To: ak, jdike
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, keescook, linux-kernel,
	linux-hardening, richard, anton.ivanov, johannes, linux-um,
	jserv, Yu-Jen Chang

The original assembly version of memchr() is implemented with
the byte-wise comparing technique, which does not fully
use 64-bits registers in x86_64 CPU. We use word-wide
comparing so that 8 characters can be compared at the same time
on x86_64 CPU. First we align the input and then use word-wise
comparing to find the first 64-bit word that contain the target.
Secondly, we compare every byte in the word and get the output.

We create two files to measure the performance. The first file
contains on average 10 characters ahead the target character.
The second file contains at least 1000 characters ahead the
target character. Our implementation of “memchr()” is slightly
better in the first test and nearly 4x faster than the orginal
implementation in the second test.

Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
---
 arch/x86/include/asm/string_64.h |  3 ++
 arch/x86/lib/Makefile            |  1 +
 arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+)
 create mode 100644 arch/x86/lib/string_64.c

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 6e450827f..edce657e0 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -14,6 +14,9 @@
 extern void *memcpy(void *to, const void *from, size_t len);
 extern void *__memcpy(void *to, const void *from, size_t len);
 
+#define __HAVE_ARCH_MEMCHR
+extern void *memchr(const void *cs, int c, size_t length);
+
 #define __HAVE_ARCH_MEMSET
 void *memset(void *s, int c, size_t n);
 void *__memset(void *s, int c, size_t n);
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f76747862..4d530e559 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -69,5 +69,6 @@ else
         lib-y += clear_page_64.o copy_page_64.o
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o
+        lib-y += string_64.o
 	lib-y += cmpxchg16b_emu.o
 endif
diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
new file mode 100644
index 000000000..4e067d5be
--- /dev/null
+++ b/arch/x86/lib/string_64.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/string.h>
+#include <linux/export.h>
+#include <linux/align.h>
+
+/* How many bytes are loaded each iteration of the word copy loop */
+#define LBLOCKSIZE (sizeof(long))
+
+#ifdef __HAVE_ARCH_MEMCHR
+
+void *memchr(const void *cs, int c, size_t length)
+{
+	const unsigned char *src = (const unsigned char *)cs, d = c;
+
+	while (!IS_ALIGNED((long)src, sizeof(long))) {
+		if (!length--)
+			return NULL;
+		if (*src == d)
+			return (void *)src;
+		src++;
+	}
+	if (length >= LBLOCKSIZE) {
+		unsigned long mask = d << 8 | d;
+		unsigned int i = 32;
+		long xor, data;
+		const long consta = 0xFEFEFEFEFEFEFEFF,
+			   constb = 0x8080808080808080;
+
+		/*
+		 * Create a 8-bytes mask for word-wise comparing.
+		 * For example, a mask for 'a' is 0x6161616161616161.
+		 */
+
+		mask |= mask << 16;
+		for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
+			mask |= mask << i;
+		/*
+		 * We perform word-wise comparing with following operation:
+		 *	1. Perform xor on the long word @src and @mask
+		 *	   and put into @xor.
+		 *	2. Add @xor with @consta.
+		 *	3. ~@xor & @constb.
+		 *	4. Perform & with the result of step 2 and 3.
+		 *
+		 * Step 1 creates a byte which is 0 in the long word if
+		 * there is at least one target byte in it.
+		 *
+		 * Step 2 to Step 4 find if there is a byte with 0 in
+		 * the long word.
+		 */
+		asm volatile("1:\n\t"
+			     "movq (%0),%1\n\t"
+			     "xorq %6,%1\n\t"
+			     "lea (%1,%4), %2\n\t"
+			     "notq %1\n\t"
+			     "andq %5,%1\n\t"
+			     "testq %1,%2\n\t"
+			     "jne 2f\n\t"
+			     "add $8,%0\n\t"
+			     "sub $8,%3\n\t"
+			     "cmp $7,%3\n\t"
+			     "ja 1b\n\t"
+			     "2:\n\t"
+			     : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)
+			     : "r"(consta), "r"(constb), "r"(mask), "0"(src),
+			       "1"(xor), "2"(data), "3"(length)
+			     : "memory", "cc");
+	}
+
+	while (length--) {
+		if (*src == d)
+			return (void *)src;
+		src++;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(memchr);
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] x86/um: Use x86_64-optimized memchr
  2022-05-28  8:12 [PATCH 0/2] x86: Optimize memchr() for x86-64 Yu-Jen Chang
  2022-05-28  8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
@ 2022-05-28  8:12 ` Yu-Jen Chang
  2022-05-29  1:10 ` [PATCH 0/2] x86: Optimize memchr() for x86-64 Andi Kleen
  2 siblings, 0 replies; 10+ messages in thread
From: Yu-Jen Chang @ 2022-05-28  8:12 UTC (permalink / raw)
  To: ak, jdike
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, keescook, linux-kernel,
	linux-hardening, richard, anton.ivanov, johannes, linux-um,
	jserv, Yu-Jen Chang

Add x86_64-optimized memchr, which is 4x faster
than the original implementation, into um.

Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
---
 arch/x86/um/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index ba5789c35..52b7c21ca 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -28,7 +28,7 @@ else
 
 obj-y += syscalls_64.o vdso/
 
-subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o ../entry/thunk_64.o
+subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o ../lib/string_64.o ../entry/thunk_64.o
 
 endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/lib: Optimize memchr()
  2022-05-28  8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
@ 2022-05-28 16:41   ` Tao Zhou
  2022-05-29 12:05     ` arthur chang arthur
  2022-05-30  8:09   ` David Laight
  1 sibling, 1 reply; 10+ messages in thread
From: Tao Zhou @ 2022-05-28 16:41 UTC (permalink / raw)
  To: Yu-Jen Chang, tao.zhou
  Cc: ak, jdike, tglx, mingo, bp, dave.hansen, x86, hpa, keescook,
	linux-kernel, linux-hardening, richard, anton.ivanov, johannes,
	linux-um, jserv

On Sat, May 28, 2022 at 04:12:35PM +0800, Yu-Jen Chang wrote:

> The original assembly version of memchr() is implemented with
> the byte-wise comparing technique, which does not fully
> use 64-bits registers in x86_64 CPU. We use word-wide
> comparing so that 8 characters can be compared at the same time
> on x86_64 CPU. First we align the input and then use word-wise
> comparing to find the first 64-bit word that contain the target.
> Secondly, we compare every byte in the word and get the output.
> 
> We create two files to measure the performance. The first file
> contains on average 10 characters ahead the target character.
> The second file contains at least 1000 characters ahead the
> target character. Our implementation of “memchr()” is slightly
> better in the first test and nearly 4x faster than the orginal
> implementation in the second test.
> 
> Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> ---
>  arch/x86/include/asm/string_64.h |  3 ++
>  arch/x86/lib/Makefile            |  1 +
>  arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
>  3 files changed, 82 insertions(+)
>  create mode 100644 arch/x86/lib/string_64.c
> 
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 6e450827f..edce657e0 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -14,6 +14,9 @@
>  extern void *memcpy(void *to, const void *from, size_t len);
>  extern void *__memcpy(void *to, const void *from, size_t len);
>  
> +#define __HAVE_ARCH_MEMCHR
> +extern void *memchr(const void *cs, int c, size_t length);
> +
>  #define __HAVE_ARCH_MEMSET
>  void *memset(void *s, int c, size_t n);
>  void *__memset(void *s, int c, size_t n);
> diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> index f76747862..4d530e559 100644
> --- a/arch/x86/lib/Makefile
> +++ b/arch/x86/lib/Makefile
> @@ -69,5 +69,6 @@ else
>          lib-y += clear_page_64.o copy_page_64.o
>          lib-y += memmove_64.o memset_64.o
>          lib-y += copy_user_64.o
> +        lib-y += string_64.o
>  	lib-y += cmpxchg16b_emu.o
>  endif
> diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> new file mode 100644
> index 000000000..4e067d5be
> --- /dev/null
> +++ b/arch/x86/lib/string_64.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/string.h>
> +#include <linux/export.h>
> +#include <linux/align.h>
> +
> +/* How many bytes are loaded each iteration of the word copy loop */
> +#define LBLOCKSIZE (sizeof(long))
> +
> +#ifdef __HAVE_ARCH_MEMCHR
> +
> +void *memchr(const void *cs, int c, size_t length)
> +{
> +	const unsigned char *src = (const unsigned char *)cs, d = c;

I don't know why this 'd = c' is not error.
d is a char pointer and c is int. At least this is not safe me do not know.

> +	while (!IS_ALIGNED((long)src, sizeof(long))) {
> +		if (!length--)
> +			return NULL;
> +		if (*src == d)

Compare a character value to a pointer value and this value is c.
May be right do not know.

Or:

char d = c;
...

> +			return (void *)src;
> +		src++;
> +	}
> +	if (length >= LBLOCKSIZE) {
> +		unsigned long mask = d << 8 | d;
> +		unsigned int i = 32;
> +		long xor, data;
> +		const long consta = 0xFEFEFEFEFEFEFEFF,
> +			   constb = 0x8080808080808080;

Two magic number..

> +		/*
> +		 * Create a 8-bytes mask for word-wise comparing.
> +		 * For example, a mask for 'a' is 0x6161616161616161.
> +		 */
> +
> +		mask |= mask << 16;
> +		for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> +			mask |= mask << i;
> +		/*
> +		 * We perform word-wise comparing with following operation:
> +		 *	1. Perform xor on the long word @src and @mask
> +		 *	   and put into @xor.
> +		 *	2. Add @xor with @consta.
> +		 *	3. ~@xor & @constb.
> +		 *	4. Perform & with the result of step 2 and 3.
> +		 *
> +		 * Step 1 creates a byte which is 0 in the long word if
> +		 * there is at least one target byte in it.
> +		 *
> +		 * Step 2 to Step 4 find if there is a byte with 0 in
> +		 * the long word.
> +		 */
> +		asm volatile("1:\n\t"
> +			     "movq (%0),%1\n\t"
> +			     "xorq %6,%1\n\t"
> +			     "lea (%1,%4), %2\n\t"
> +			     "notq %1\n\t"
> +			     "andq %5,%1\n\t"
> +			     "testq %1,%2\n\t"
> +			     "jne 2f\n\t"

s/jne/jnz/

Lack much here from me. But I give example that should check the 
CF flag is zero.

1) contain matching byte.

1111111011111111(consta)
                        +add
0000000001101100(xor)
------------------------
1111111101101011 (%1)


1111111110010100(~xor)
                        &and
1000000010000000(constb)
------------------------
1000000010000000 (%2)


the logical and of %1 and %2 is
1000000000000000 that is not zero.

2) not contain matching byte

1111111011111111
                 +
0110111011011100
----------------
0110110111011011(%1)

1001000100100011
                 &
1000000010000000
----------------
1000000000000000(%2)

%1 and %2 is
0000000000000000 that is zero.

I guess that here should use jump instruction jnz instead.
Even though, I do not know why that two magic number is so magical..

Thanks,
Tao
> +			     "add $8,%0\n\t"
> +			     "sub $8,%3\n\t"
> +			     "cmp $7,%3\n\t"
> +			     "ja 1b\n\t"
> +			     "2:\n\t"
> +			     : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)
> +			     : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> +			       "1"(xor), "2"(data), "3"(length)
> +			     : "memory", "cc");
> +	}
> +
> +	while (length--) {
> +		if (*src == d)
> +			return (void *)src;
> +		src++;
> +	}
> +	return NULL;
> +}
> +EXPORT_SYMBOL(memchr);
> +#endif
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] x86: Optimize memchr() for x86-64
  2022-05-28  8:12 [PATCH 0/2] x86: Optimize memchr() for x86-64 Yu-Jen Chang
  2022-05-28  8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
  2022-05-28  8:12 ` [PATCH 2/2] x86/um: Use x86_64-optimized memchr Yu-Jen Chang
@ 2022-05-29  1:10 ` Andi Kleen
  2 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2022-05-29  1:10 UTC (permalink / raw)
  To: Yu-Jen Chang, jdike
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, keescook, linux-kernel,
	linux-hardening, richard, anton.ivanov, johannes, linux-um,
	jserv


On 5/28/2022 1:12 AM, Yu-Jen Chang wrote:
> *** BLURB HERE ***
> These patch series add an optimized "memchr()" for x86-64 and
> USER-MODE LINUX (UML).
>   
> There exists an assemebly implementation for x86-32. However,
> for x86-64, there isn't any optimized version. We implement word-wise
> comparison so that 8 characters can be compared at the same time on
> x86-64 CPU. The optimized “memchr()” is nearly 4x faster than the
> orginal implementation for long strings.
>
> We test the optimized “memchr()” in UML and also recompile the 5.18
> Kernel with the optimized “memchr()”. They run correctly.
>
> In this patch we add a new file "string_64.c", which only contains
> "memchr()". We can add more optimized string functions in it in the
> future.

Are there any workloads that care? From a quick grep I don't see any 
that look performance critical.

It would be good to describe what you optimized it for. For example 
optimization for small input strings is quite different than large 
strings. I don't know what is more common in the kernel.

I assume you ran it through some existing test suites for memchr (like 
glibc etc.) for correctness testing?

(bugs in optimized string functions are often subtle, it might be also 
worth trying some randomized testing comparing against a known reference)

-Andi


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/lib: Optimize memchr()
  2022-05-28 16:41   ` Tao Zhou
@ 2022-05-29 12:05     ` arthur chang arthur
  0 siblings, 0 replies; 10+ messages in thread
From: arthur chang arthur @ 2022-05-29 12:05 UTC (permalink / raw)
  To: Tao Zhou
  Cc: ak, jdike, tglx, mingo, bp, dave.hansen, x86, hpa, Kees Cook,
	linux-kernel, linux-hardening, richard, anton.ivanov, johannes,
	linux-um, jserv

On Sun, May 29, 2022 at 12:40 AM Tao Zhou <tao.zhou@linux.dev> wrote:
>
> On Sat, May 28, 2022 at 04:12:35PM +0800, Yu-Jen Chang wrote:
>
> > The original assembly version of memchr() is implemented with
> > the byte-wise comparing technique, which does not fully
> > use 64-bits registers in x86_64 CPU. We use word-wide
> > comparing so that 8 characters can be compared at the same time
> > on x86_64 CPU. First we align the input and then use word-wise
> > comparing to find the first 64-bit word that contain the target.
> > Secondly, we compare every byte in the word and get the output.
> >
> > We create two files to measure the performance. The first file
> > contains on average 10 characters ahead the target character.
> > The second file contains at least 1000 characters ahead the
> > target character. Our implementation of “memchr()” is slightly
> > better in the first test and nearly 4x faster than the orginal
> > implementation in the second test.
> >
> > Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> > Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> > ---
> >  arch/x86/include/asm/string_64.h |  3 ++
> >  arch/x86/lib/Makefile            |  1 +
> >  arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
> >  3 files changed, 82 insertions(+)
> >  create mode 100644 arch/x86/lib/string_64.c
> >
> > diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> > index 6e450827f..edce657e0 100644
> > --- a/arch/x86/include/asm/string_64.h
> > +++ b/arch/x86/include/asm/string_64.h
> > @@ -14,6 +14,9 @@
> >  extern void *memcpy(void *to, const void *from, size_t len);
> >  extern void *__memcpy(void *to, const void *from, size_t len);
> >
> > +#define __HAVE_ARCH_MEMCHR
> > +extern void *memchr(const void *cs, int c, size_t length);
> > +
> >  #define __HAVE_ARCH_MEMSET
> >  void *memset(void *s, int c, size_t n);
> >  void *__memset(void *s, int c, size_t n);
> > diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> > index f76747862..4d530e559 100644
> > --- a/arch/x86/lib/Makefile
> > +++ b/arch/x86/lib/Makefile
> > @@ -69,5 +69,6 @@ else
> >          lib-y += clear_page_64.o copy_page_64.o
> >          lib-y += memmove_64.o memset_64.o
> >          lib-y += copy_user_64.o
> > +        lib-y += string_64.o
> >       lib-y += cmpxchg16b_emu.o
> >  endif
> > diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> > new file mode 100644
> > index 000000000..4e067d5be
> > --- /dev/null
> > +++ b/arch/x86/lib/string_64.c
> > @@ -0,0 +1,78 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/string.h>
> > +#include <linux/export.h>
> > +#include <linux/align.h>
> > +
> > +/* How many bytes are loaded each iteration of the word copy loop */
> > +#define LBLOCKSIZE (sizeof(long))
> > +
> > +#ifdef __HAVE_ARCH_MEMCHR
> > +
> > +void *memchr(const void *cs, int c, size_t length)
> > +{
> > +     const unsigned char *src = (const unsigned char *)cs, d = c;
>
> I don't know why this 'd = c' is not error.
> d is a char pointer and c is int. At least this is not safe me do not know.
>

The parameter d is a character value, not a pointer. Therefore 'd = c'
is not an error.

> > +     while (!IS_ALIGNED((long)src, sizeof(long))) {
> > +             if (!length--)
> > +                     return NULL;
> > +             if (*src == d)
>
> Compare a character value to a pointer value and this value is c.
> May be right do not know.
>
> Or:
>
> char d = c;
> ...
>
> > +                     return (void *)src;
> > +             src++;
> > +     }
> > +     if (length >= LBLOCKSIZE) {
> > +             unsigned long mask = d << 8 | d;
> > +             unsigned int i = 32;
> > +             long xor, data;
> > +             const long consta = 0xFEFEFEFEFEFEFEFF,
> > +                        constb = 0x8080808080808080;
>
> Two magic number..
>
> > +             /*
> > +              * Create a 8-bytes mask for word-wise comparing.
> > +              * For example, a mask for 'a' is 0x6161616161616161.
> > +              */
> > +
> > +             mask |= mask << 16;
> > +             for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> > +                     mask |= mask << i;
> > +             /*
> > +              * We perform word-wise comparing with following operation:
> > +              *      1. Perform xor on the long word @src and @mask
> > +              *         and put into @xor.
> > +              *      2. Add @xor with @consta.
> > +              *      3. ~@xor & @constb.
> > +              *      4. Perform & with the result of step 2 and 3.
> > +              *
> > +              * Step 1 creates a byte which is 0 in the long word if
> > +              * there is at least one target byte in it.
> > +              *
> > +              * Step 2 to Step 4 find if there is a byte with 0 in
> > +              * the long word.
> > +              */
> > +             asm volatile("1:\n\t"
> > +                          "movq (%0),%1\n\t"
> > +                          "xorq %6,%1\n\t"
> > +                          "lea (%1,%4), %2\n\t"
> > +                          "notq %1\n\t"
> > +                          "andq %5,%1\n\t"
> > +                          "testq %1,%2\n\t"
> > +                          "jne 2f\n\t"
>
> s/jne/jnz/
>
> Lack much here from me. But I give example that should check the
> CF flag is zero.
>
> 1) contain matching byte.
>
> 1111111011111111(consta)
>                         +add
> 0000000001101100(xor)
> ------------------------
> 1111111101101011 (%1)
>
>
> 1111111110010100(~xor)
>                         &and
> 1000000010000000(constb)
> ------------------------
> 1000000010000000 (%2)
>
>
> the logical and of %1 and %2 is
> 1000000000000000 that is not zero.
>
> 2) not contain matching byte
>
> 1111111011111111
>                  +
> 0110111011011100
> ----------------
> 0110110111011011(%1)
>
> 1001000100100011
>                  &
> 1000000010000000
> ----------------
> 1000000000000000(%2)
>
> %1 and %2 is
> 0000000000000000 that is zero.
>
> I guess that here should use jump instruction jnz instead.
> Even though, I do not know why that two magic number is so magical..
>
> Thanks,
> Tao

According to the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 2A
page 3-484 and 3-485, instruction jne and jnz jump according to zero flag. They
both jump when ZF = 0. Here I believe that using jnz is easier to understand the
assembly code.

consta is the 2's complement of 0x0101010101010101. That is, we perform
subtraction here. As a result, it is not necessary to check the CF flags.

Here I explain how these two magic numbers work..

 After we perform xor to the long word and the mask, the match byte will
be 0x00. The goal to subtract 0x0101010101010101 is turning the zero byte
into 0xFF. The following step, ~xor & constb, transforms the zero byte
into 0x80. Then, 0x80 & 0xFF = 0x80.

On the other hand, for a none zero byte, let k (0 <= k <=7) be the position
of the lowest 1 in this byte. The lowest k bits are 0. After the addition,
the byte ends in a single bit of value 0 and k bits of value 1. To
explain clearly,
we perform & (~xor) at first. This byte turns into k bits of 1, which
is 2^k - 1.
Then, we perform & constb with the previous result and get 0.  Here we just do
(~xor) & 0x8080808080808080 at first.

Thanks,
Yu-Jen Chang

> > +                          "add $8,%0\n\t"
> > +                          "sub $8,%3\n\t"
> > +                          "cmp $7,%3\n\t"
> > +                          "ja 1b\n\t"
> > +                          "2:\n\t"
> > +                          : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)
> > +                          : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> > +                            "1"(xor), "2"(data), "3"(length)
> > +                          : "memory", "cc");
> > +     }
> > +
> > +     while (length--) {
> > +             if (*src == d)
> > +                     return (void *)src;
> > +             src++;
> > +     }
> > +     return NULL;
> > +}
> > +EXPORT_SYMBOL(memchr);
> > +#endif
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] x86/lib: Optimize memchr()
  2022-05-28  8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
  2022-05-28 16:41   ` Tao Zhou
@ 2022-05-30  8:09   ` David Laight
  2022-06-01  5:58     ` Yu-Jen Chang
  1 sibling, 1 reply; 10+ messages in thread
From: David Laight @ 2022-05-30  8:09 UTC (permalink / raw)
  To: 'Yu-Jen Chang', ak, jdike
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, keescook, linux-kernel,
	linux-hardening, richard, anton.ivanov, johannes, linux-um,
	jserv

From: Yu-Jen Chang
> Sent: 28 May 2022 09:13
> 
> The original assembly version of memchr() is implemented with
> the byte-wise comparing technique, which does not fully
> use 64-bits registers in x86_64 CPU. We use word-wide
> comparing so that 8 characters can be compared at the same time
> on x86_64 CPU. First we align the input and then use word-wise
> comparing to find the first 64-bit word that contain the target.
> Secondly, we compare every byte in the word and get the output.
> 
> We create two files to measure the performance. The first file
> contains on average 10 characters ahead the target character.
> The second file contains at least 1000 characters ahead the
> target character. Our implementation of “memchr()” is slightly
> better in the first test and nearly 4x faster than the orginal
> implementation in the second test.
> 
> Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> ---
>  arch/x86/include/asm/string_64.h |  3 ++
>  arch/x86/lib/Makefile            |  1 +
>  arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
>  3 files changed, 82 insertions(+)
>  create mode 100644 arch/x86/lib/string_64.c
> 
...
> diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> new file mode 100644
> index 000000000..4e067d5be
> --- /dev/null
> +++ b/arch/x86/lib/string_64.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/string.h>
> +#include <linux/export.h>
> +#include <linux/align.h>
> +
> +/* How many bytes are loaded each iteration of the word copy loop */
> +#define LBLOCKSIZE (sizeof(long))
> +
> +#ifdef __HAVE_ARCH_MEMCHR
> +
> +void *memchr(const void *cs, int c, size_t length)
> +{
> +	const unsigned char *src = (const unsigned char *)cs, d = c;

You don't need the cast.

> +
> +	while (!IS_ALIGNED((long)src, sizeof(long))) {
> +		if (!length--)
> +			return NULL;
> +		if (*src == d)
> +			return (void *)src;
> +		src++;
> +	}

There is no point aligning the address.
On tests I've done misaligned reads don't even take an extra
clock - even if you get the cpu doing two reads/clock.
Even if they did the code isn't memory limited.

> +	if (length >= LBLOCKSIZE) {
> +		unsigned long mask = d << 8 | d;
> +		unsigned int i = 32;
> +		long xor, data;
> +		const long consta = 0xFEFEFEFEFEFEFEFF,
> +			   constb = 0x8080808080808080;
> +
> +		/*
> +		 * Create a 8-bytes mask for word-wise comparing.
> +		 * For example, a mask for 'a' is 0x6161616161616161.
> +		 */
> +
> +		mask |= mask << 16;
> +		for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> +			mask |= mask << i;

Given that consta/b only support 64 bit why the loop.
Just do mask |= mask << 32.
I'd also put all 3 calculations together - not hide one
in the initialiser.

> +		/*
> +		 * We perform word-wise comparing with following operation:
> +		 *	1. Perform xor on the long word @src and @mask
> +		 *	   and put into @xor.
> +		 *	2. Add @xor with @consta.
> +		 *	3. ~@xor & @constb.
> +		 *	4. Perform & with the result of step 2 and 3.
> +		 *
> +		 * Step 1 creates a byte which is 0 in the long word if
> +		 * there is at least one target byte in it.
> +		 *
> +		 * Step 2 to Step 4 find if there is a byte with 0 in
> +		 * the long word.
> +		 */
> +		asm volatile("1:\n\t"
> +			     "movq (%0),%1\n\t"
> +			     "xorq %6,%1\n\t"
> +			     "lea (%1,%4), %2\n\t"
> +			     "notq %1\n\t"
> +			     "andq %5,%1\n\t"
> +			     "testq %1,%2\n\t"
> +			     "jne 2f\n\t"
> +			     "add $8,%0\n\t"
> +			     "sub $8,%3\n\t"
> +			     "cmp $7,%3\n\t"
> +			     "ja 1b\n\t"
> +			     "2:\n\t"
> +			     : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)

Why constrain src to %rdi?

> +			     : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> +			       "1"(xor), "2"(data), "3"(length)

Use "+r" in the outputs instead of respecifying the args.
I'd also suggest using named arguments - much easier to read.

> +			     : "memory", "cc");

Doesn't the compiler generate much the same code?
You should also be able to code without needing add, sub and cmp
at the end of the loop.
If you use negative offsets from the end of the buffer
the loop can be a single add and jnz.

	David

> +	}
> +
> +	while (length--) {
> +		if (*src == d)
> +			return (void *)src;
> +		src++;
> +	}
> +	return NULL;
> +}
> +EXPORT_SYMBOL(memchr);
> +#endif
> --
> 2.25.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/lib: Optimize memchr()
  2022-05-30  8:09   ` David Laight
@ 2022-06-01  5:58     ` Yu-Jen Chang
  2022-06-01  8:25       ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Yu-Jen Chang @ 2022-06-01  5:58 UTC (permalink / raw)
  To: David Laight
  Cc: ak, jdike, tglx, mingo, bp, dave.hansen, x86, hpa, keescook,
	linux-kernel, linux-hardening, richard, anton.ivanov, johannes,
	linux-um, jserv

David Laight <David.Laight@aculab.com> 於 2022年5月30日 週一 下午4:10寫道:
>
> From: Yu-Jen Chang
> > Sent: 28 May 2022 09:13
> >
> > The original assembly version of memchr() is implemented with
> > the byte-wise comparing technique, which does not fully
> > use 64-bits registers in x86_64 CPU. We use word-wide
> > comparing so that 8 characters can be compared at the same time
> > on x86_64 CPU. First we align the input and then use word-wise
> > comparing to find the first 64-bit word that contain the target.
> > Secondly, we compare every byte in the word and get the output.
> >
> > We create two files to measure the performance. The first file
> > contains on average 10 characters ahead the target character.
> > The second file contains at least 1000 characters ahead the
> > target character. Our implementation of “memchr()” is slightly
> > better in the first test and nearly 4x faster than the orginal
> > implementation in the second test.
> >
> > Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> > Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> > ---
> >  arch/x86/include/asm/string_64.h |  3 ++
> >  arch/x86/lib/Makefile            |  1 +
> >  arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
> >  3 files changed, 82 insertions(+)
> >  create mode 100644 arch/x86/lib/string_64.c
> >
> ...
> > diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> > new file mode 100644
> > index 000000000..4e067d5be
> > --- /dev/null
> > +++ b/arch/x86/lib/string_64.c
> > @@ -0,0 +1,78 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/string.h>
> > +#include <linux/export.h>
> > +#include <linux/align.h>
> > +
> > +/* How many bytes are loaded each iteration of the word copy loop */
> > +#define LBLOCKSIZE (sizeof(long))
> > +
> > +#ifdef __HAVE_ARCH_MEMCHR
> > +
> > +void *memchr(const void *cs, int c, size_t length)
> > +{
> > +     const unsigned char *src = (const unsigned char *)cs, d = c;
>
> You don't need the cast.
>
> > +
> > +     while (!IS_ALIGNED((long)src, sizeof(long))) {
> > +             if (!length--)
> > +                     return NULL;
> > +             if (*src == d)
> > +                     return (void *)src;
> > +             src++;
> > +     }
>
> There is no point aligning the address.
> On tests I've done misaligned reads don't even take an extra
> clock - even if you get the cpu doing two reads/clock.
> Even if they did the code isn't memory limited.
>
> > +     if (length >= LBLOCKSIZE) {
> > +             unsigned long mask = d << 8 | d;
> > +             unsigned int i = 32;
> > +             long xor, data;
> > +             const long consta = 0xFEFEFEFEFEFEFEFF,
> > +                        constb = 0x8080808080808080;
> > +
> > +             /*
> > +              * Create a 8-bytes mask for word-wise comparing.
> > +              * For example, a mask for 'a' is 0x6161616161616161.
> > +              */
> > +
> > +             mask |= mask << 16;
> > +             for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> > +                     mask |= mask << i;
>
> Given that consta/b only support 64 bit why the loop.
> Just do mask |= mask << 32.
> I'd also put all 3 calculations together - not hide one
> in the initialiser.
>
> > +             /*
> > +              * We perform word-wise comparing with following operation:
> > +              *      1. Perform xor on the long word @src and @mask
> > +              *         and put into @xor.
> > +              *      2. Add @xor with @consta.
> > +              *      3. ~@xor & @constb.
> > +              *      4. Perform & with the result of step 2 and 3.
> > +              *
> > +              * Step 1 creates a byte which is 0 in the long word if
> > +              * there is at least one target byte in it.
> > +              *
> > +              * Step 2 to Step 4 find if there is a byte with 0 in
> > +              * the long word.
> > +              */
> > +             asm volatile("1:\n\t"
> > +                          "movq (%0),%1\n\t"
> > +                          "xorq %6,%1\n\t"
> > +                          "lea (%1,%4), %2\n\t"
> > +                          "notq %1\n\t"
> > +                          "andq %5,%1\n\t"
> > +                          "testq %1,%2\n\t"
> > +                          "jne 2f\n\t"
> > +                          "add $8,%0\n\t"
> > +                          "sub $8,%3\n\t"
> > +                          "cmp $7,%3\n\t"
> > +                          "ja 1b\n\t"
> > +                          "2:\n\t"
> > +                          : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)
>
> Why constrain src to %rdi?

At first I try to use some instructions related to %rdi, but I realize
that I won't use these instructions. It is unnecessary to constrain
src to %rdi.

>
> > +                          : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> > +                            "1"(xor), "2"(data), "3"(length)
>
> Use "+r" in the outputs instead of respecifying the args.
> I'd also suggest using named arguments - much easier to read.
>
> > +                          : "memory", "cc");
>
> Doesn't the compiler generate much the same code?
> You should also be able to code without needing add, sub and cmp
> at the end of the loop.
> If you use negative offsets from the end of the buffer
> the loop can be a single add and jnz.
>
>         David
>
> > +     }
> > +
> > +     while (length--) {
> > +             if (*src == d)
> > +                     return (void *)src;
> > +             src++;
> > +     }
> > +     return NULL;
> > +}
> > +EXPORT_SYMBOL(memchr);
> > +#endif
> > --
> > 2.25.1
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

I remove the aligning address part. On my tests the performance are similar.
Here I rewrite the assembly using named arguments and I reduce one instruction
in the loop by adding two parameters, which are  'end' and 'dst'.
'end' stores the
address of the end of the string. 'dst' stores the address of the end
of word-wise
comparison. As a result, when 'src' is equal to 'dst', the number of remaining
characters is less than 8. The following while loop will find if the
target character is
in these remaining characters.

On my test the performance is similar with the my original implementation. Only
a little bit fast when going through a very long string, which contains 128*1024
characters and the target character is near the end of the string.

I also explain how consta and constb work clearly in the comments. Hope that it
helps understanding.

The following code is what I change.

void *memchr(const void *cs, int c, size_t length)
{
     const unsigned char *src = (const unsigned char *)cs;
     const unsigned char *end = src + length;

     if (length >= LBLOCKSIZE) {
             unsigned long mask = c << 8 | c;
             long xor, data;
             const long consta = 0xFEFEFEFEFEFEFEFF,
                        constb = 0x8080808080808080;
             const unsigned char *dst = (const unsigned char *)src +
                                               (length & 0xFFFFFFFFFFFFFFF8);

             /*
              * Create a 8-bytes mask for word-wise comparing.
              * For example, a mask for 'a' is 0x6161616161616161.
              */

             mask |= mask << 16;
             mask |= mask << 32;
             /*
              * We perform word-wise comparing with following operation:
              * 1. Perform xor on the long word @src and @mask
              *    and put into @xor.
              * 2. Add @xor with @consta.
              * 3. ~@xor & @constb.
              * 4. Perform & with the result of step 2 and 3.
              *
              * If there is a zero byte in @xor, step 2 turns it into
              * 0xFF. Then step 3 and 4 turn it into 0x80.
              *
              * If there is a none-zero byte in @xor, let k
              * (0 <= k <= 7) be the lowest 1 in this byte. The lowest
              * k bits are 0. After step 2, the byte ends in a single
              * bit of value 0. Step 3 and 4 turns this byte into k
              * bits of 1, which is 2^k - 1, at first. Then & @constb
              * makes it into 0.
              *
              * Step 2 to Step 4 find if there is a byte with 0 in
              * the long word.
              */
              asm volatile("1:\n\t"
                            "movq (%[src]),%[xor]\n\t"
                            "xorq %[mask],%[xor]\n\t"
                            "lea (%[xor],%[const_a]), %[tmp]\n\t"
                            "notq %[xor]\n\t"
                            "andq %[const_b],%[xor]\n\t"
                            "testq %[xor],%[tmp]\n\t"
                            "jnz 2f\n\t"
                            "add $8,%[src]\n\t"
                            "cmp %[src], %[dst]\n\t"
                            "ja 1b\n\t"
                            "2:\n\t"
                            :
                            [src] "+r"(src), [xor] "+r"(xor), [tmp] "+r"(data)
                            : [const_a] "r"(consta), [const_b] "r"(constb),
                              [mask] "r"(mask), [dst] "r"(dst)
                            : "memory", "cc");
        }

        while (src <= end) {
             if (*src == d)
                     return (void *)src;
             src++;
        }
        return NULL;
}

Thanks,
Yu-Jen Chang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] x86/lib: Optimize memchr()
  2022-06-01  5:58     ` Yu-Jen Chang
@ 2022-06-01  8:25       ` David Laight
  2022-06-06  3:25         ` Yu-Jen Chang
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2022-06-01  8:25 UTC (permalink / raw)
  To: 'Yu-Jen Chang'
  Cc: ak, jdike, tglx, mingo, bp, dave.hansen, x86, hpa, keescook,
	linux-kernel, linux-hardening, richard, anton.ivanov, johannes,
	linux-um, jserv

From: Yu-Jen Chang
> Sent: 01 June 2022 06:59
> 
> David Laight <David.Laight@aculab.com> 於 2022年5月30日 週一 下午4:10寫道:
> >
> > From: Yu-Jen Chang
> > > Sent: 28 May 2022 09:13
> > >
> > > The original assembly version of memchr() is implemented with
> > > the byte-wise comparing technique, which does not fully
> > > use 64-bits registers in x86_64 CPU. We use word-wide
> > > comparing so that 8 characters can be compared at the same time
> > > on x86_64 CPU. First we align the input and then use word-wise
> > > comparing to find the first 64-bit word that contain the target.
> > > Secondly, we compare every byte in the word and get the output.
> > >
> > > We create two files to measure the performance. The first file
> > > contains on average 10 characters ahead the target character.
> > > The second file contains at least 1000 characters ahead the
> > > target character. Our implementation of “memchr()” is slightly
> > > better in the first test and nearly 4x faster than the orginal
> > > implementation in the second test.
> > >
> > > Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> > > Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> > > ---
> > >  arch/x86/include/asm/string_64.h |  3 ++
> > >  arch/x86/lib/Makefile            |  1 +
> > >  arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
> > >  3 files changed, 82 insertions(+)
> > >  create mode 100644 arch/x86/lib/string_64.c
> > >
> > ...
> > > diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> > > new file mode 100644
> > > index 000000000..4e067d5be
> > > --- /dev/null
> > > +++ b/arch/x86/lib/string_64.c
> > > @@ -0,0 +1,78 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include <linux/string.h>
> > > +#include <linux/export.h>
> > > +#include <linux/align.h>
> > > +
> > > +/* How many bytes are loaded each iteration of the word copy loop */
> > > +#define LBLOCKSIZE (sizeof(long))
> > > +
> > > +#ifdef __HAVE_ARCH_MEMCHR
> > > +
> > > +void *memchr(const void *cs, int c, size_t length)
> > > +{
> > > +     const unsigned char *src = (const unsigned char *)cs, d = c;
> >
> > You don't need the cast.
> >
> > > +
> > > +     while (!IS_ALIGNED((long)src, sizeof(long))) {
> > > +             if (!length--)
> > > +                     return NULL;
> > > +             if (*src == d)
> > > +                     return (void *)src;
> > > +             src++;
> > > +     }
> >
> > There is no point aligning the address.
> > On tests I've done misaligned reads don't even take an extra
> > clock - even if you get the cpu doing two reads/clock.
> > Even if they did the code isn't memory limited.
> >
> > > +     if (length >= LBLOCKSIZE) {
> > > +             unsigned long mask = d << 8 | d;
> > > +             unsigned int i = 32;
> > > +             long xor, data;
> > > +             const long consta = 0xFEFEFEFEFEFEFEFF,
> > > +                        constb = 0x8080808080808080;
> > > +
> > > +             /*
> > > +              * Create a 8-bytes mask for word-wise comparing.
> > > +              * For example, a mask for 'a' is 0x6161616161616161.
> > > +              */
> > > +
> > > +             mask |= mask << 16;
> > > +             for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> > > +                     mask |= mask << i;
> >
> > Given that consta/b only support 64 bit why the loop.
> > Just do mask |= mask << 32.
> > I'd also put all 3 calculations together - not hide one
> > in the initialiser.
> >
> > > +             /*
> > > +              * We perform word-wise comparing with following operation:
> > > +              *      1. Perform xor on the long word @src and @mask
> > > +              *         and put into @xor.
> > > +              *      2. Add @xor with @consta.
> > > +              *      3. ~@xor & @constb.
> > > +              *      4. Perform & with the result of step 2 and 3.
> > > +              *
> > > +              * Step 1 creates a byte which is 0 in the long word if
> > > +              * there is at least one target byte in it.
> > > +              *
> > > +              * Step 2 to Step 4 find if there is a byte with 0 in
> > > +              * the long word.
> > > +              */
> > > +             asm volatile("1:\n\t"
> > > +                          "movq (%0),%1\n\t"
> > > +                          "xorq %6,%1\n\t"
> > > +                          "lea (%1,%4), %2\n\t"
> > > +                          "notq %1\n\t"
> > > +                          "andq %5,%1\n\t"
> > > +                          "testq %1,%2\n\t"
> > > +                          "jne 2f\n\t"
> > > +                          "add $8,%0\n\t"
> > > +                          "sub $8,%3\n\t"
> > > +                          "cmp $7,%3\n\t"
> > > +                          "ja 1b\n\t"
> > > +                          "2:\n\t"
> > > +                          : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)
> >
> > Why constrain src to %rdi?
> 
> At first I try to use some instructions related to %rdi, but I realize
> that I won't use these instructions. It is unnecessary to constrain
> src to %rdi.
> 
> >
> > > +                          : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> > > +                            "1"(xor), "2"(data), "3"(length)
> >
> > Use "+r" in the outputs instead of respecifying the args.
> > I'd also suggest using named arguments - much easier to read.
> >
> > > +                          : "memory", "cc");
> >
> > Doesn't the compiler generate much the same code?
> > You should also be able to code without needing add, sub and cmp
> > at the end of the loop.
> > If you use negative offsets from the end of the buffer
> > the loop can be a single add and jnz.
> >
> >         David
> >
> > > +     }
> > > +
> > > +     while (length--) {
> > > +             if (*src == d)
> > > +                     return (void *)src;
> > > +             src++;
> > > +     }
> > > +     return NULL;
> > > +}
> > > +EXPORT_SYMBOL(memchr);
> > > +#endif
> > > --
> > > 2.25.1
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
> 
> I remove the aligning address part. On my tests the performance are similar.
> Here I rewrite the assembly using named arguments and I reduce one instruction
> in the loop by adding two parameters, which are  'end' and 'dst'.
> 'end' stores the
> address of the end of the string. 'dst' stores the address of the end
> of word-wise
> comparison. As a result, when 'src' is equal to 'dst', the number of remaining
> characters is less than 8. The following while loop will find if the
> target character is
> in these remaining characters.
> 
> On my test the performance is similar with the my original implementation. Only
> a little bit fast when going through a very long string, which contains 128*1024
> characters and the target character is near the end of the string.
> 
> I also explain how consta and constb work clearly in the comments. Hope that it
> helps understanding.
> 
> The following code is what I change.
> 
> void *memchr(const void *cs, int c, size_t length)
> {
>      const unsigned char *src = (const unsigned char *)cs;
>      const unsigned char *end = src + length;
> 
>      if (length >= LBLOCKSIZE) {
>              unsigned long mask = c << 8 | c;

That is wrong if 'c' is outside 0..255.
I suspect it is best to at least allow -128..-1.

>              long xor, data;
>              const long consta = 0xFEFEFEFEFEFEFEFF,
>                         constb = 0x8080808080808080;
>              const unsigned char *dst = (const unsigned char *)src +
>                                                (length & 0xFFFFFFFFFFFFFFF8);
> 
>              /*
>               * Create a 8-bytes mask for word-wise comparing.
>               * For example, a mask for 'a' is 0x6161616161616161.
>               */
> 
>              mask |= mask << 16;
>              mask |= mask << 32;
>              /*
>               * We perform word-wise comparing with following operation:
>               * 1. Perform xor on the long word @src and @mask
>               *    and put into @xor.
>               * 2. Add @xor with @consta.
>               * 3. ~@xor & @constb.
>               * 4. Perform & with the result of step 2 and 3.
>               *
>               * If there is a zero byte in @xor, step 2 turns it into
>               * 0xFF. Then step 3 and 4 turn it into 0x80.
>               *
>               * If there is a none-zero byte in @xor, let k
>               * (0 <= k <= 7) be the lowest 1 in this byte. The lowest
>               * k bits are 0. After step 2, the byte ends in a single
>               * bit of value 0. Step 3 and 4 turns this byte into k
>               * bits of 1, which is 2^k - 1, at first. Then & @constb
>               * makes it into 0.
>               *
>               * Step 2 to Step 4 find if there is a byte with 0 in
>               * the long word.
>               */
>               asm volatile("1:\n\t"
>                             "movq (%[src]),%[xor]\n\t"
>                             "xorq %[mask],%[xor]\n\t"
>                             "lea (%[xor],%[const_a]), %[tmp]\n\t"
>                             "notq %[xor]\n\t"
>                             "andq %[const_b],%[xor]\n\t"
>                             "testq %[xor],%[tmp]\n\t"
>                             "jnz 2f\n\t"
>                             "add $8,%[src]\n\t"
>                             "cmp %[src], %[dst]\n\t"
>                             "ja 1b\n\t"
>                             "2:\n\t"
>                             :
>                             [src] "+r"(src), [xor] "+r"(xor), [tmp] "+r"(data)
>                             : [const_a] "r"(consta), [const_b] "r"(constb),
>                               [mask] "r"(mask), [dst] "r"(dst)
>                             : "memory", "cc");
>         }
> 
>         while (src <= end) {
>              if (*src == d)

I think you mean 'c'.

>                      return (void *)src;
>              src++;
>         }
>         return NULL;
> }
> 
> Thanks,
> Yu-Jen Chang

Gcc compiles this C to the same loop and is easier to read.
Valid on all LE 64bit systems.

void *memchr(const void *p, int c, unsigned long length)
{
    unsigned long mask, val;
    const void *end = p + length;

    c &= 0xff;
    if (p <= end - 8) {
        mask = c | c << 8;
        mask |= mask << 16;
        mask |= mask << 32;

        for (; p <= end - 8; p += 8) {
            val = *(unsigned long *)p ^ mask;
            if ((val + 0xfefefefefefefeffu) & (~val & 0x8080808080808080u))
                break;
        }
    }

    for (; p < end; p++)
        if (*(unsigned char *)p == c)
            return p;

    return NULL;
}

See https://godbolt.org/z/6rqTqfEsx

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/lib: Optimize memchr()
  2022-06-01  8:25       ` David Laight
@ 2022-06-06  3:25         ` Yu-Jen Chang
  0 siblings, 0 replies; 10+ messages in thread
From: Yu-Jen Chang @ 2022-06-06  3:25 UTC (permalink / raw)
  To: David Laight
  Cc: ak, jdike, tglx, mingo, bp, dave.hansen, x86, hpa, keescook,
	linux-kernel, linux-hardening, richard, anton.ivanov, johannes,
	linux-um, jserv

David Laight <David.Laight@aculab.com> 於 2022年6月1日 週三 下午4:25寫道:
>
> From: Yu-Jen Chang
> > Sent: 01 June 2022 06:59
> >
> > David Laight <David.Laight@aculab.com> 於 2022年5月30日 週一 下午4:10寫道:
> > >
> > > From: Yu-Jen Chang
> > > > Sent: 28 May 2022 09:13
> > > >
> > > > The original assembly version of memchr() is implemented with
> > > > the byte-wise comparing technique, which does not fully
> > > > use 64-bits registers in x86_64 CPU. We use word-wide
> > > > comparing so that 8 characters can be compared at the same time
> > > > on x86_64 CPU. First we align the input and then use word-wise
> > > > comparing to find the first 64-bit word that contain the target.
> > > > Secondly, we compare every byte in the word and get the output.
> > > >
> > > > We create two files to measure the performance. The first file
> > > > contains on average 10 characters ahead the target character.
> > > > The second file contains at least 1000 characters ahead the
> > > > target character. Our implementation of “memchr()” is slightly
> > > > better in the first test and nearly 4x faster than the orginal
> > > > implementation in the second test.
> > > >
> > > > Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> > > > Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> > > > ---
> > > >  arch/x86/include/asm/string_64.h |  3 ++
> > > >  arch/x86/lib/Makefile            |  1 +
> > > >  arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
> > > >  3 files changed, 82 insertions(+)
> > > >  create mode 100644 arch/x86/lib/string_64.c
> > > >
> > > ...
> > > > diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> > > > new file mode 100644
> > > > index 000000000..4e067d5be
> > > > --- /dev/null
> > > > +++ b/arch/x86/lib/string_64.c
> > > > @@ -0,0 +1,78 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +#include <linux/string.h>
> > > > +#include <linux/export.h>
> > > > +#include <linux/align.h>
> > > > +
> > > > +/* How many bytes are loaded each iteration of the word copy loop */
> > > > +#define LBLOCKSIZE (sizeof(long))
> > > > +
> > > > +#ifdef __HAVE_ARCH_MEMCHR
> > > > +
> > > > +void *memchr(const void *cs, int c, size_t length)
> > > > +{
> > > > +     const unsigned char *src = (const unsigned char *)cs, d = c;
> > >
> > > You don't need the cast.
> > >
> > > > +
> > > > +     while (!IS_ALIGNED((long)src, sizeof(long))) {
> > > > +             if (!length--)
> > > > +                     return NULL;
> > > > +             if (*src == d)
> > > > +                     return (void *)src;
> > > > +             src++;
> > > > +     }
> > >
> > > There is no point aligning the address.
> > > On tests I've done misaligned reads don't even take an extra
> > > clock - even if you get the cpu doing two reads/clock.
> > > Even if they did the code isn't memory limited.
> > >
> > > > +     if (length >= LBLOCKSIZE) {
> > > > +             unsigned long mask = d << 8 | d;
> > > > +             unsigned int i = 32;
> > > > +             long xor, data;
> > > > +             const long consta = 0xFEFEFEFEFEFEFEFF,
> > > > +                        constb = 0x8080808080808080;
> > > > +
> > > > +             /*
> > > > +              * Create a 8-bytes mask for word-wise comparing.
> > > > +              * For example, a mask for 'a' is 0x6161616161616161.
> > > > +              */
> > > > +
> > > > +             mask |= mask << 16;
> > > > +             for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> > > > +                     mask |= mask << i;
> > >
> > > Given that consta/b only support 64 bit why the loop.
> > > Just do mask |= mask << 32.
> > > I'd also put all 3 calculations together - not hide one
> > > in the initialiser.
> > >
> > > > +             /*
> > > > +              * We perform word-wise comparing with following operation:
> > > > +              *      1. Perform xor on the long word @src and @mask
> > > > +              *         and put into @xor.
> > > > +              *      2. Add @xor with @consta.
> > > > +              *      3. ~@xor & @constb.
> > > > +              *      4. Perform & with the result of step 2 and 3.
> > > > +              *
> > > > +              * Step 1 creates a byte which is 0 in the long word if
> > > > +              * there is at least one target byte in it.
> > > > +              *
> > > > +              * Step 2 to Step 4 find if there is a byte with 0 in
> > > > +              * the long word.
> > > > +              */
> > > > +             asm volatile("1:\n\t"
> > > > +                          "movq (%0),%1\n\t"
> > > > +                          "xorq %6,%1\n\t"
> > > > +                          "lea (%1,%4), %2\n\t"
> > > > +                          "notq %1\n\t"
> > > > +                          "andq %5,%1\n\t"
> > > > +                          "testq %1,%2\n\t"
> > > > +                          "jne 2f\n\t"
> > > > +                          "add $8,%0\n\t"
> > > > +                          "sub $8,%3\n\t"
> > > > +                          "cmp $7,%3\n\t"
> > > > +                          "ja 1b\n\t"
> > > > +                          "2:\n\t"
> > > > +                          : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)
> > >
> > > Why constrain src to %rdi?
> >
> > At first I try to use some instructions related to %rdi, but I realize
> > that I won't use these instructions. It is unnecessary to constrain
> > src to %rdi.
> >
> > >
> > > > +                          : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> > > > +                            "1"(xor), "2"(data), "3"(length)
> > >
> > > Use "+r" in the outputs instead of respecifying the args.
> > > I'd also suggest using named arguments - much easier to read.
> > >
> > > > +                          : "memory", "cc");
> > >
> > > Doesn't the compiler generate much the same code?
> > > You should also be able to code without needing add, sub and cmp
> > > at the end of the loop.
> > > If you use negative offsets from the end of the buffer
> > > the loop can be a single add and jnz.
> > >
> > >         David
> > >
> > > > +     }
> > > > +
> > > > +     while (length--) {
> > > > +             if (*src == d)
> > > > +                     return (void *)src;
> > > > +             src++;
> > > > +     }
> > > > +     return NULL;
> > > > +}
> > > > +EXPORT_SYMBOL(memchr);
> > > > +#endif
> > > > --
> > > > 2.25.1
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)
> >
> > I remove the aligning address part. On my tests the performance are similar.
> > Here I rewrite the assembly using named arguments and I reduce one instruction
> > in the loop by adding two parameters, which are  'end' and 'dst'.
> > 'end' stores the
> > address of the end of the string. 'dst' stores the address of the end
> > of word-wise
> > comparison. As a result, when 'src' is equal to 'dst', the number of remaining
> > characters is less than 8. The following while loop will find if the
> > target character is
> > in these remaining characters.
> >
> > On my test the performance is similar with the my original implementation. Only
> > a little bit fast when going through a very long string, which contains 128*1024
> > characters and the target character is near the end of the string.
> >
> > I also explain how consta and constb work clearly in the comments. Hope that it
> > helps understanding.
> >
> > The following code is what I change.
> >
> > void *memchr(const void *cs, int c, size_t length)
> > {
> >      const unsigned char *src = (const unsigned char *)cs;
> >      const unsigned char *end = src + length;
> >
> >      if (length >= LBLOCKSIZE) {
> >              unsigned long mask = c << 8 | c;
>
> That is wrong if 'c' is outside 0..255.
> I suspect it is best to at least allow -128..-1.
>
> >              long xor, data;
> >              const long consta = 0xFEFEFEFEFEFEFEFF,
> >                         constb = 0x8080808080808080;
> >              const unsigned char *dst = (const unsigned char *)src +
> >                                                (length & 0xFFFFFFFFFFFFFFF8);
> >
> >              /*
> >               * Create a 8-bytes mask for word-wise comparing.
> >               * For example, a mask for 'a' is 0x6161616161616161.
> >               */
> >
> >              mask |= mask << 16;
> >              mask |= mask << 32;
> >              /*
> >               * We perform word-wise comparing with following operation:
> >               * 1. Perform xor on the long word @src and @mask
> >               *    and put into @xor.
> >               * 2. Add @xor with @consta.
> >               * 3. ~@xor & @constb.
> >               * 4. Perform & with the result of step 2 and 3.
> >               *
> >               * If there is a zero byte in @xor, step 2 turns it into
> >               * 0xFF. Then step 3 and 4 turn it into 0x80.
> >               *
> >               * If there is a none-zero byte in @xor, let k
> >               * (0 <= k <= 7) be the lowest 1 in this byte. The lowest
> >               * k bits are 0. After step 2, the byte ends in a single
> >               * bit of value 0. Step 3 and 4 turns this byte into k
> >               * bits of 1, which is 2^k - 1, at first. Then & @constb
> >               * makes it into 0.
> >               *
> >               * Step 2 to Step 4 find if there is a byte with 0 in
> >               * the long word.
> >               */
> >               asm volatile("1:\n\t"
> >                             "movq (%[src]),%[xor]\n\t"
> >                             "xorq %[mask],%[xor]\n\t"
> >                             "lea (%[xor],%[const_a]), %[tmp]\n\t"
> >                             "notq %[xor]\n\t"
> >                             "andq %[const_b],%[xor]\n\t"
> >                             "testq %[xor],%[tmp]\n\t"
> >                             "jnz 2f\n\t"
> >                             "add $8,%[src]\n\t"
> >                             "cmp %[src], %[dst]\n\t"
> >                             "ja 1b\n\t"
> >                             "2:\n\t"
> >                             :
> >                             [src] "+r"(src), [xor] "+r"(xor), [tmp] "+r"(data)
> >                             : [const_a] "r"(consta), [const_b] "r"(constb),
> >                               [mask] "r"(mask), [dst] "r"(dst)
> >                             : "memory", "cc");
> >         }
> >
> >         while (src <= end) {
> >              if (*src == d)
>
> I think you mean 'c'.
>
> >                      return (void *)src;
> >              src++;
> >         }
> >         return NULL;
> > }
> >
> > Thanks,
> > Yu-Jen Chang
>
> Gcc compiles this C to the same loop and is easier to read.
> Valid on all LE 64bit systems.
>
> void *memchr(const void *p, int c, unsigned long length)
> {
>     unsigned long mask, val;
>     const void *end = p + length;
>
>     c &= 0xff;
>     if (p <= end - 8) {
>         mask = c | c << 8;
>         mask |= mask << 16;
>         mask |= mask << 32;
>
>         for (; p <= end - 8; p += 8) {
>             val = *(unsigned long *)p ^ mask;
>             if ((val + 0xfefefefefefefeffu) & (~val & 0x8080808080808080u))
>                 break;
>         }
>     }
>
>     for (; p < end; p++)
>         if (*(unsigned char *)p == c)
>             return p;

Here I think we should return (void*) p. So that there is no
compilation warning.

>
>     return NULL;
> }
>
> See https://godbolt.org/z/6rqTqfEsx
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

Your modification is easier to understand. I add the comments to
explain how it work.
I will use it in the second version of the patch. Thanks for your advice.

Yu-Jen Chang

void *memchr(const void *p, int c, unsigned long length)
{
    unsigned long mask, val;
    const void *end = p + length;

    c &= 0xff;
    if (p <= end - 8) {


        /*
         * Create a 8-bytes mask for word-wise comparing.
         * For example, a mask for 'a' is 0x6161616161616161.
         */

        mask = c | c << 8;
        mask |= mask << 16;
        mask |= mask << 32;

        /*
         * We perform word-wise comparing with following operation:
         * 1. Perform xor on the long word @p and @mask
         *    and put into @val.
         * 2. Add @val with 0xfefefefefefefeff.
         * 3. ~@val & 0x8080808080808080
         * 4. Perform & with the result of step 2 and 3.
         *
         * If there is a zero byte in @val, step 2 turns it into
         * 0xFF. Then step 3 and 4 turn it into 0x80.
         *
         * If there is a none-zero byte in @val, let k
         * (0 <= k <= 7) be the lowest 1 in this byte. The lowest
         * k bits are 0. After step 2, the byte ends in a single
         * bit of value 0. Step 3 and 4 turns this byte into k
         * bits of 1, which is 2^k - 1, at first. Then &
         * 0x8080808080808080 makes it into 0
         */

        for (; p <= end - 8; p += 8) {
            val = *(unsigned long *)p ^ mask;
            if ((val + 0xfefefefefefefeffu) & (~val & 0x8080808080808080u))
                break;
        }
    }

    for (; p < end; p++)
        if (*(unsigned char *)p == c)
            return (void *)p;

    return NULL;
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-06  3:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-28  8:12 [PATCH 0/2] x86: Optimize memchr() for x86-64 Yu-Jen Chang
2022-05-28  8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
2022-05-28 16:41   ` Tao Zhou
2022-05-29 12:05     ` arthur chang arthur
2022-05-30  8:09   ` David Laight
2022-06-01  5:58     ` Yu-Jen Chang
2022-06-01  8:25       ` David Laight
2022-06-06  3:25         ` Yu-Jen Chang
2022-05-28  8:12 ` [PATCH 2/2] x86/um: Use x86_64-optimized memchr Yu-Jen Chang
2022-05-29  1:10 ` [PATCH 0/2] x86: Optimize memchr() for x86-64 Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).