linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Optimize memchr()
@ 2022-07-10 14:28 Yu-Jen Chang
  2022-07-10 14:28 ` [PATCH 1/2] lib/string.c: Add a macro for memchr_inv() Yu-Jen Chang
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-10 14:28 UTC (permalink / raw)
  To: andy, akinobu.mita; +Cc: jserv, linux-kernel, Yu-Jen Chang

*** BLURB HERE ***
This patche series optimized "memchr()" and add a macro for 
"memchr_inv()" so that both funtions can use it to generate bit mask.

The original implementaion of "memchr()" is based on byte-wise comparison,
which do not fully use 64-bit or 32-bit register in CPU. We implement a
word-wise comparison so that at least 4 bytes can be compared at the same
time. The optimized "memchr()" is nearly 4x faster than the original one
for long strings. In Linux Kernel, we find that the length of the string
searched by "memchr()" is up to 512 bytes in drivers/misc/lkdtm/heap.c.
In our test, the optimized version is about 20% faster if the target
character is at the end of the string when going through a 512-byte
string.

We recompile the 5.18 kernel with optimized "memchr()" in 32-bit and
64-bit. They run correctly.

Yu-Jen Chang (2):
  lib/string.c: Add a macro for memchr_inv()
  lib/string.c: Optimize memchr()

 lib/string.c | 62 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 45 insertions(+), 17 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] lib/string.c: Add a macro for memchr_inv()
  2022-07-10 14:28 [PATCH 0/2] Optimize memchr() Yu-Jen Chang
@ 2022-07-10 14:28 ` Yu-Jen Chang
  2022-07-10 14:28 ` [PATCH 2/2] lib/string.c: Optimize memchr() Yu-Jen Chang
  2022-08-12 19:06 ` [PATCH 0/2] " Pavel Machek
  2 siblings, 0 replies; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-10 14:28 UTC (permalink / raw)
  To: andy, akinobu.mita; +Cc: jserv, linux-kernel, Yu-Jen Chang

We add a macro MEMCHR_MASK_GEN() so that both memchr_inv()
and memchr() can use it to generate a 8 bytes mask.

Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
---
 lib/string.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 485777c9d..80469e6c3 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -879,6 +879,29 @@ char *strnstr(const char *s1, const char *s2, size_t len)
 EXPORT_SYMBOL(strnstr);
 #endif
 
+#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && BITS_PER_LONG == 64
+
+#define MEMCHR_MASK_GEN(mask) (mask *= 0x0101010101010101ULL)
+
+#elif defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER)
+
+#define MEMCHR_MASK_GEN(mask)                                                  \
+	do {                                                                   \
+		mask *= 0x01010101;                                            \
+		mask |= mask << 32;                                            \
+	} while (0)
+
+#else
+
+#define MEMCHR_MASK_GEN(mask)                                                  \
+	do {                                                                   \
+		mask |= mask << 8;                                             \
+		mask |= mask << 16;                                            \
+		mask |= mask << 32;                                            \
+	} while (0)
+
+#endif
+
 #ifndef __HAVE_ARCH_MEMCHR
 /**
  * memchr - Find a character in an area of memory.
@@ -932,16 +955,7 @@ void *memchr_inv(const void *start, int c, size_t bytes)
 		return check_bytes8(start, value, bytes);
 
 	value64 = value;
-#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && BITS_PER_LONG == 64
-	value64 *= 0x0101010101010101ULL;
-#elif defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER)
-	value64 *= 0x01010101;
-	value64 |= value64 << 32;
-#else
-	value64 |= value64 << 8;
-	value64 |= value64 << 16;
-	value64 |= value64 << 32;
-#endif
+	MEMCHR_MASK_GEN(value64);
 
 	prefix = (unsigned long)start % 8;
 	if (prefix) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-10 14:28 [PATCH 0/2] Optimize memchr() Yu-Jen Chang
  2022-07-10 14:28 ` [PATCH 1/2] lib/string.c: Add a macro for memchr_inv() Yu-Jen Chang
@ 2022-07-10 14:28 ` Yu-Jen Chang
  2022-07-10 15:16   ` Joe Perches
                     ` (3 more replies)
  2022-08-12 19:06 ` [PATCH 0/2] " Pavel Machek
  2 siblings, 4 replies; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-10 14:28 UTC (permalink / raw)
  To: andy, akinobu.mita; +Cc: jserv, linux-kernel, Yu-Jen Chang

The original version of memchr() is implemented with the byte-wise
comparing technique, which does not fully use 64-bits or 32-bits
registers in CPU. We use word-wide comparing so that 8 characters
can be compared at the same time on CPU. This code is base on
David Laight's implementation.

We create two files to measure the performance. The first file
contains on average 10 characters ahead the target character.
The second file contains at least 1000 characters ahead the
target character. Our implementation of “memchr()” is slightly
better in the first test and nearly 4x faster than the orginal
implementation in the second test.

Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
---
 lib/string.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 80469e6c3..8ca965431 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
 #ifndef __HAVE_ARCH_MEMCHR
 /**
  * memchr - Find a character in an area of memory.
- * @s: The memory area
+ * @p: The memory area
  * @c: The byte to search for
- * @n: The size of the area.
+ * @length: The size of the area.
  *
  * returns the address of the first occurrence of @c, or %NULL
  * if @c is not found
  */
-void *memchr(const void *s, int c, size_t n)
+void *memchr(const void *p, int c, unsigned long length)
 {
-	const unsigned char *p = s;
-	while (n-- != 0) {
-        	if ((unsigned char)c == *p++) {
-			return (void *)(p - 1);
+	u64 mask, val;
+	const void *end = p + length;
+
+	c &= 0xff;
+	if (p <= end - 8) {
+		mask = c;
+		MEMCHR_MASK_GEN(mask);
+
+		for (; p <= end - 8; p += 8) {
+			val = *(u64 *)p ^ mask;
+			if ((val + 0xfefefefefefefeffu) &
+			    (~val & 0x8080808080808080u))
+				break;
 		}
 	}
+
+	for (; p < end; p++)
+		if (*(unsigned char *)p == c)
+			return (void *)p;
+
 	return NULL;
 }
 EXPORT_SYMBOL(memchr);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-10 14:28 ` [PATCH 2/2] lib/string.c: Optimize memchr() Yu-Jen Chang
@ 2022-07-10 15:16   ` Joe Perches
  2022-07-11 14:50     ` Yu-Jen Chang
  2022-07-10 16:58   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Joe Perches @ 2022-07-10 15:16 UTC (permalink / raw)
  To: Yu-Jen Chang, andy, akinobu.mita; +Cc: jserv, linux-kernel

On Sun, 2022-07-10 at 22:28 +0800, Yu-Jen Chang wrote:
> The original version of memchr() is implemented with the byte-wise
> comparing technique, which does not fully use 64-bits or 32-bits
> registers in CPU. We use word-wide comparing so that 8 characters
> can be compared at the same time on CPU. This code is base on
> David Laight's implementation.
> 
> We create two files to measure the performance. The first file
> contains on average 10 characters ahead the target character.
> The second file contains at least 1000 characters ahead the
> target character. Our implementation of “memchr()” is slightly
> better in the first test and nearly 4x faster than the orginal
> implementation in the second test.

It seems you did not test this with 32bit compilers as
there are 64 bit constants without ull

> diff --git a/lib/string.c b/lib/string.c
[]
> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
>  #ifndef __HAVE_ARCH_MEMCHR
>  /**
>   * memchr - Find a character in an area of memory.
> - * @s: The memory area
> + * @p: The memory area
>   * @c: The byte to search for
> - * @n: The size of the area.
> + * @length: The size of the area.
>   *
>   * returns the address of the first occurrence of @c, or %NULL
>   * if @c is not found
>   */
> -void *memchr(const void *s, int c, size_t n)
> +void *memchr(const void *p, int c, unsigned long length)
>  {
> -	const unsigned char *p = s;
> -	while (n-- != 0) {
> -        	if ((unsigned char)c == *p++) {
> -			return (void *)(p - 1);
> +	u64 mask, val;
> +	const void *end = p + length;
> +
> +	c &= 0xff;
> +	if (p <= end - 8) {
> +		mask = c;
> +		MEMCHR_MASK_GEN(mask);
> +
> +		for (; p <= end - 8; p += 8) {
> +			val = *(u64 *)p ^ mask;
> +			if ((val + 0xfefefefefefefeffu) &
> +			    (~val & 0x8080808080808080u))

here.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-10 14:28 ` [PATCH 2/2] lib/string.c: Optimize memchr() Yu-Jen Chang
  2022-07-10 15:16   ` Joe Perches
@ 2022-07-10 16:58   ` kernel test robot
  2022-07-10 20:01   ` Andrey Semashev
  2022-07-21  5:06   ` kernel test robot
  3 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2022-07-10 16:58 UTC (permalink / raw)
  To: Yu-Jen Chang, andy, akinobu.mita
  Cc: llvm, kbuild-all, jserv, linux-kernel, Yu-Jen Chang

Hi Yu-Jen,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.19-rc5 next-20220708]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yu-Jen-Chang/Optimize-memchr/20220710-223118
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b1c428b6c3684ee8ddf4137d68b3e8d51d2a700f
config: hexagon-randconfig-r041-20220710 (https://download.01.org/0day-ci/archive/20220711/202207110028.KNZ3Yupd-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 6ce63e267aab79ca87bf63453d34dd3909ab978d)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/307ee542c2bd8378b2225910f3f9b982df7a7669
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yu-Jen-Chang/Optimize-memchr/20220710-223118
        git checkout 307ee542c2bd8378b2225910f3f9b982df7a7669
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> lib/string.c:902:7: error: conflicting types for 'memchr'
   void *memchr(const void *p, int c, unsigned long length)
         ^
   include/linux/string.h:162:15: note: previous declaration is here
   extern void * memchr(const void *,int,__kernel_size_t);
                 ^
   1 error generated.


vim +/memchr +902 lib/string.c

   891	
   892	#ifndef __HAVE_ARCH_MEMCHR
   893	/**
   894	 * memchr - Find a character in an area of memory.
   895	 * @p: The memory area
   896	 * @c: The byte to search for
   897	 * @length: The size of the area.
   898	 *
   899	 * returns the address of the first occurrence of @c, or %NULL
   900	 * if @c is not found
   901	 */
 > 902	void *memchr(const void *p, int c, unsigned long length)
   903	{
   904		u64 mask, val;
   905		const void *end = p + length;
   906	
   907		c &= 0xff;
   908		if (p <= end - 8) {
   909			mask = c;
   910			MEMCHR_MASK_GEN(mask);
   911	
   912			for (; p <= end - 8; p += 8) {
   913				val = *(u64 *)p ^ mask;
   914				if ((val + 0xfefefefefefefeffu) &
   915				    (~val & 0x8080808080808080u))
   916					break;
   917			}
   918		}
   919	
   920		for (; p < end; p++)
   921			if (*(unsigned char *)p == c)
   922				return (void *)p;
   923	
   924		return NULL;
   925	}
   926	EXPORT_SYMBOL(memchr);
   927	#endif
   928	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-10 14:28 ` [PATCH 2/2] lib/string.c: Optimize memchr() Yu-Jen Chang
  2022-07-10 15:16   ` Joe Perches
  2022-07-10 16:58   ` kernel test robot
@ 2022-07-10 20:01   ` Andrey Semashev
  2022-07-11 14:52     ` Yu-Jen Chang
  2022-07-21  5:06   ` kernel test robot
  3 siblings, 1 reply; 22+ messages in thread
From: Andrey Semashev @ 2022-07-10 20:01 UTC (permalink / raw)
  To: Yu-Jen Chang, andy, akinobu.mita; +Cc: jserv, linux-kernel

On 7/10/22 17:28, Yu-Jen Chang wrote:
> The original version of memchr() is implemented with the byte-wise
> comparing technique, which does not fully use 64-bits or 32-bits
> registers in CPU. We use word-wide comparing so that 8 characters
> can be compared at the same time on CPU. This code is base on
> David Laight's implementation.
> 
> We create two files to measure the performance. The first file
> contains on average 10 characters ahead the target character.
> The second file contains at least 1000 characters ahead the
> target character. Our implementation of “memchr()” is slightly
> better in the first test and nearly 4x faster than the orginal
> implementation in the second test.
> 
> Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> ---
>  lib/string.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/string.c b/lib/string.c
> index 80469e6c3..8ca965431 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
>  #ifndef __HAVE_ARCH_MEMCHR
>  /**
>   * memchr - Find a character in an area of memory.
> - * @s: The memory area
> + * @p: The memory area
>   * @c: The byte to search for
> - * @n: The size of the area.
> + * @length: The size of the area.
>   *
>   * returns the address of the first occurrence of @c, or %NULL
>   * if @c is not found
>   */
> -void *memchr(const void *s, int c, size_t n)
> +void *memchr(const void *p, int c, unsigned long length)
>  {
> -	const unsigned char *p = s;
> -	while (n-- != 0) {
> -        	if ((unsigned char)c == *p++) {
> -			return (void *)(p - 1);
> +	u64 mask, val;
> +	const void *end = p + length;
> +
> +	c &= 0xff;
> +	if (p <= end - 8) {
> +		mask = c;
> +		MEMCHR_MASK_GEN(mask);
> +
> +		for (; p <= end - 8; p += 8) {
> +			val = *(u64 *)p ^ mask;

What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
targets support (efficient) unaligned loads, do they?

> +			if ((val + 0xfefefefefefefeffu) &
> +			    (~val & 0x8080808080808080u))
> +				break;
>  		}
>  	}
> +
> +	for (; p < end; p++)
> +		if (*(unsigned char *)p == c)
> +			return (void *)p;
> +
>  	return NULL;
>  }
>  EXPORT_SYMBOL(memchr);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-10 15:16   ` Joe Perches
@ 2022-07-11 14:50     ` Yu-Jen Chang
  0 siblings, 0 replies; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-11 14:50 UTC (permalink / raw)
  To: Joe Perches; +Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

Joe Perches <joe@perches.com> 於 2022年7月10日 週日 晚上11:16寫道:
>
> On Sun, 2022-07-10 at 22:28 +0800, Yu-Jen Chang wrote:
> > The original version of memchr() is implemented with the byte-wise
> > comparing technique, which does not fully use 64-bits or 32-bits
> > registers in CPU. We use word-wide comparing so that 8 characters
> > can be compared at the same time on CPU. This code is base on
> > David Laight's implementation.
> >
> > We create two files to measure the performance. The first file
> > contains on average 10 characters ahead the target character.
> > The second file contains at least 1000 characters ahead the
> > target character. Our implementation of “memchr()” is slightly
> > better in the first test and nearly 4x faster than the orginal
> > implementation in the second test.
>
> It seems you did not test this with 32bit compilers as
> there are 64 bit constants without ull

Yeah, it is better to add ull at the end of the constant. I test with
32-bit compiler and it works. The compiler will use two or more
instruction to reach the same result.

See https://godbolt.org/z/svbj18foP

>
> > diff --git a/lib/string.c b/lib/string.c
> []
> > @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> >  #ifndef __HAVE_ARCH_MEMCHR
> >  /**
> >   * memchr - Find a character in an area of memory.
> > - * @s: The memory area
> > + * @p: The memory area
> >   * @c: The byte to search for
> > - * @n: The size of the area.
> > + * @length: The size of the area.
> >   *
> >   * returns the address of the first occurrence of @c, or %NULL
> >   * if @c is not found
> >   */
> > -void *memchr(const void *s, int c, size_t n)
> > +void *memchr(const void *p, int c, unsigned long length)
> >  {
> > -     const unsigned char *p = s;
> > -     while (n-- != 0) {
> > -             if ((unsigned char)c == *p++) {
> > -                     return (void *)(p - 1);
> > +     u64 mask, val;
> > +     const void *end = p + length;
> > +
> > +     c &= 0xff;
> > +     if (p <= end - 8) {
> > +             mask = c;
> > +             MEMCHR_MASK_GEN(mask);
> > +
> > +             for (; p <= end - 8; p += 8) {
> > +                     val = *(u64 *)p ^ mask;
> > +                     if ((val + 0xfefefefefefefeffu) &
> > +                         (~val & 0x8080808080808080u))
>
> here.
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-10 20:01   ` Andrey Semashev
@ 2022-07-11 14:52     ` Yu-Jen Chang
  2022-07-11 15:00       ` Andrey Semashev
  0 siblings, 1 reply; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-11 14:52 UTC (permalink / raw)
  To: Andrey Semashev; +Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

Andrey Semashev <andrey.semashev@gmail.com> 於 2022年7月11日 週一 凌晨4:01寫道:
>
> On 7/10/22 17:28, Yu-Jen Chang wrote:
> > The original version of memchr() is implemented with the byte-wise
> > comparing technique, which does not fully use 64-bits or 32-bits
> > registers in CPU. We use word-wide comparing so that 8 characters
> > can be compared at the same time on CPU. This code is base on
> > David Laight's implementation.
> >
> > We create two files to measure the performance. The first file
> > contains on average 10 characters ahead the target character.
> > The second file contains at least 1000 characters ahead the
> > target character. Our implementation of “memchr()” is slightly
> > better in the first test and nearly 4x faster than the orginal
> > implementation in the second test.
> >
> > Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> > Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> > ---
> >  lib/string.c | 28 +++++++++++++++++++++-------
> >  1 file changed, 21 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/string.c b/lib/string.c
> > index 80469e6c3..8ca965431 100644
> > --- a/lib/string.c
> > +++ b/lib/string.c
> > @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> >  #ifndef __HAVE_ARCH_MEMCHR
> >  /**
> >   * memchr - Find a character in an area of memory.
> > - * @s: The memory area
> > + * @p: The memory area
> >   * @c: The byte to search for
> > - * @n: The size of the area.
> > + * @length: The size of the area.
> >   *
> >   * returns the address of the first occurrence of @c, or %NULL
> >   * if @c is not found
> >   */
> > -void *memchr(const void *s, int c, size_t n)
> > +void *memchr(const void *p, int c, unsigned long length)
> >  {
> > -     const unsigned char *p = s;
> > -     while (n-- != 0) {
> > -             if ((unsigned char)c == *p++) {
> > -                     return (void *)(p - 1);
> > +     u64 mask, val;
> > +     const void *end = p + length;
> > +
> > +     c &= 0xff;
> > +     if (p <= end - 8) {
> > +             mask = c;
> > +             MEMCHR_MASK_GEN(mask);
> > +
> > +             for (; p <= end - 8; p += 8) {
> > +                     val = *(u64 *)p ^ mask;
>
> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
> targets support (efficient) unaligned loads, do they?

I think it works if p is not aligned to 8 or 4 bytes.

Let's say the string is 10 bytes. The for loop here will search the first
8 bytes. If the target character is in the last 2 bytes, the second for
loop will find it. It also work like this on 32-bit machine.

>
> > +                     if ((val + 0xfefefefefefefeffu) &
> > +                         (~val & 0x8080808080808080u))
> > +                             break;
> >               }
> >       }
> > +
> > +     for (; p < end; p++)
> > +             if (*(unsigned char *)p == c)
> > +                     return (void *)p;
> > +
> >       return NULL;
> >  }
> >  EXPORT_SYMBOL(memchr);
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-11 14:52     ` Yu-Jen Chang
@ 2022-07-11 15:00       ` Andrey Semashev
  2022-07-11 15:03         ` Andy Shevchenko
  2022-07-12 14:58         ` Yu-Jen Chang
  0 siblings, 2 replies; 22+ messages in thread
From: Andrey Semashev @ 2022-07-11 15:00 UTC (permalink / raw)
  To: Yu-Jen Chang; +Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

On 7/11/22 17:52, Yu-Jen Chang wrote:
> Andrey Semashev <andrey.semashev@gmail.com> 於 2022年7月11日 週一 凌晨4:01寫道:
>>
>> On 7/10/22 17:28, Yu-Jen Chang wrote:
>>> The original version of memchr() is implemented with the byte-wise
>>> comparing technique, which does not fully use 64-bits or 32-bits
>>> registers in CPU. We use word-wide comparing so that 8 characters
>>> can be compared at the same time on CPU. This code is base on
>>> David Laight's implementation.
>>>
>>> We create two files to measure the performance. The first file
>>> contains on average 10 characters ahead the target character.
>>> The second file contains at least 1000 characters ahead the
>>> target character. Our implementation of “memchr()” is slightly
>>> better in the first test and nearly 4x faster than the orginal
>>> implementation in the second test.
>>>
>>> Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
>>> Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
>>> ---
>>>  lib/string.c | 28 +++++++++++++++++++++-------
>>>  1 file changed, 21 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/lib/string.c b/lib/string.c
>>> index 80469e6c3..8ca965431 100644
>>> --- a/lib/string.c
>>> +++ b/lib/string.c
>>> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
>>>  #ifndef __HAVE_ARCH_MEMCHR
>>>  /**
>>>   * memchr - Find a character in an area of memory.
>>> - * @s: The memory area
>>> + * @p: The memory area
>>>   * @c: The byte to search for
>>> - * @n: The size of the area.
>>> + * @length: The size of the area.
>>>   *
>>>   * returns the address of the first occurrence of @c, or %NULL
>>>   * if @c is not found
>>>   */
>>> -void *memchr(const void *s, int c, size_t n)
>>> +void *memchr(const void *p, int c, unsigned long length)

I didn't comment on this initially, but is the change of length type
intentional? Why?

>>>  {
>>> -     const unsigned char *p = s;
>>> -     while (n-- != 0) {
>>> -             if ((unsigned char)c == *p++) {
>>> -                     return (void *)(p - 1);
>>> +     u64 mask, val;
>>> +     const void *end = p + length;
>>> +
>>> +     c &= 0xff;
>>> +     if (p <= end - 8) {
>>> +             mask = c;
>>> +             MEMCHR_MASK_GEN(mask);
>>> +
>>> +             for (; p <= end - 8; p += 8) {
>>> +                     val = *(u64 *)p ^ mask;
>>
>> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
>> targets support (efficient) unaligned loads, do they?
> 
> I think it works if p is not aligned to 8 or 4 bytes.
> 
> Let's say the string is 10 bytes. The for loop here will search the first
> 8 bytes. If the target character is in the last 2 bytes, the second for
> loop will find it. It also work like this on 32-bit machine.

I think you're missing the point. Loads at unaligned addresses may not
be allowed by hardware using conventional load instructions or may be
inefficient. Given that this memchr implementation is used as a fallback
when no hardware-specific version is available, you should be
conservative wrt. hardware capabilities and behavior. You should
probably have a pre-alignment loop.

>>
>>> +                     if ((val + 0xfefefefefefefeffu) &
>>> +                         (~val & 0x8080808080808080u))
>>> +                             break;
>>>               }
>>>       }
>>> +
>>> +     for (; p < end; p++)
>>> +             if (*(unsigned char *)p == c)
>>> +                     return (void *)p;
>>> +
>>>       return NULL;
>>>  }
>>>  EXPORT_SYMBOL(memchr);
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-11 15:00       ` Andrey Semashev
@ 2022-07-11 15:03         ` Andy Shevchenko
  2022-07-12 14:58         ` Yu-Jen Chang
  1 sibling, 0 replies; 22+ messages in thread
From: Andy Shevchenko @ 2022-07-11 15:03 UTC (permalink / raw)
  To: Andrey Semashev
  Cc: Yu-Jen Chang, Andy Shevchenko, Akinobu Mita, Ching-Chun Huang,
	Linux Kernel Mailing List

On Mon, Jul 11, 2022 at 5:00 PM Andrey Semashev
<andrey.semashev@gmail.com> wrote:
> On 7/11/22 17:52, Yu-Jen Chang wrote:

...

> I think you're missing the point. Loads at unaligned addresses may not
> be allowed by hardware using conventional load instructions or may be
> inefficient. Given that this memchr implementation is used as a fallback
> when no hardware-specific version is available, you should be
> conservative wrt. hardware capabilities and behavior. You should
> probably have a pre-alignment loop.

Exactly!
The initial code is broken, NAK.

P.S. At least you may look into strscpy() implementation to get a clue.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-11 15:00       ` Andrey Semashev
  2022-07-11 15:03         ` Andy Shevchenko
@ 2022-07-12 14:58         ` Yu-Jen Chang
  2022-07-12 15:08           ` Andy Shevchenko
  2022-07-13  9:39           ` David Laight
  1 sibling, 2 replies; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-12 14:58 UTC (permalink / raw)
  To: Andrey Semashev; +Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

Andrey Semashev <andrey.semashev@gmail.com> 於 2022年7月11日 週一 晚上11:00寫道:
>
> On 7/11/22 17:52, Yu-Jen Chang wrote:
> > Andrey Semashev <andrey.semashev@gmail.com> 於 2022年7月11日 週一 凌晨4:01寫道:
> >>
> >> On 7/10/22 17:28, Yu-Jen Chang wrote:
> >>> The original version of memchr() is implemented with the byte-wise
> >>> comparing technique, which does not fully use 64-bits or 32-bits
> >>> registers in CPU. We use word-wide comparing so that 8 characters
> >>> can be compared at the same time on CPU. This code is base on
> >>> David Laight's implementation.
> >>>
> >>> We create two files to measure the performance. The first file
> >>> contains on average 10 characters ahead the target character.
> >>> The second file contains at least 1000 characters ahead the
> >>> target character. Our implementation of “memchr()” is slightly
> >>> better in the first test and nearly 4x faster than the orginal
> >>> implementation in the second test.
> >>>
> >>> Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> >>> Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> >>> ---
> >>>  lib/string.c | 28 +++++++++++++++++++++-------
> >>>  1 file changed, 21 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/lib/string.c b/lib/string.c
> >>> index 80469e6c3..8ca965431 100644
> >>> --- a/lib/string.c
> >>> +++ b/lib/string.c
> >>> @@ -905,21 +905,35 @@ EXPORT_SYMBOL(strnstr);
> >>>  #ifndef __HAVE_ARCH_MEMCHR
> >>>  /**
> >>>   * memchr - Find a character in an area of memory.
> >>> - * @s: The memory area
> >>> + * @p: The memory area
> >>>   * @c: The byte to search for
> >>> - * @n: The size of the area.
> >>> + * @length: The size of the area.
> >>>   *
> >>>   * returns the address of the first occurrence of @c, or %NULL
> >>>   * if @c is not found
> >>>   */
> >>> -void *memchr(const void *s, int c, size_t n)
> >>> +void *memchr(const void *p, int c, unsigned long length)
>
> I didn't comment on this initially, but is the change of length type
> intentional? Why?

It is my mistake. I will change the type back to size_t.

>
> >>>  {
> >>> -     const unsigned char *p = s;
> >>> -     while (n-- != 0) {
> >>> -             if ((unsigned char)c == *p++) {
> >>> -                     return (void *)(p - 1);
> >>> +     u64 mask, val;
> >>> +     const void *end = p + length;
> >>> +
> >>> +     c &= 0xff;
> >>> +     if (p <= end - 8) {
> >>> +             mask = c;
> >>> +             MEMCHR_MASK_GEN(mask);
> >>> +
> >>> +             for (; p <= end - 8; p += 8) {
> >>> +                     val = *(u64 *)p ^ mask;
> >>
> >> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
> >> targets support (efficient) unaligned loads, do they?
> >
> > I think it works if p is not aligned to 8 or 4 bytes.
> >
> > Let's say the string is 10 bytes. The for loop here will search the first
> > 8 bytes. If the target character is in the last 2 bytes, the second for
> > loop will find it. It also work like this on 32-bit machine.
>
> I think you're missing the point. Loads at unaligned addresses may not
> be allowed by hardware using conventional load instructions or may be
> inefficient. Given that this memchr implementation is used as a fallback
> when no hardware-specific version is available, you should be
> conservative wrt. hardware capabilities and behavior. You should
> probably have a pre-alignment loop.

Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.

void *memchr(const void *p, int c, size_t length)
{
    u64 mask, val;
    const void *end = p + length;
    c &= 0xff;
    while ((long ) p & (sizeof(long) - 1)) {
        if (p >= end)
            return NULL;
        if (*(unsigned char *)p == c)
            return (void *) p;
        p++;
    }
    if (p <= end - 8) {
        mask = c;
        MEMCHR_MASK_GEN(mask);

        for (; p <= end - 8; p += 8) {
            val = *(u64*)p ^ mask;
            if ((val + 0xfefefefefefefeffull)
& (~val & 0x8080808080808080ull))
                break;
        }
    }

    for (; p < end; p++)
        if (*(unsigned char *)p == c)
            return (void *)p;

    return NULL;
}

>
> >>
> >>> +                     if ((val + 0xfefefefefefefeffu) &
> >>> +                         (~val & 0x8080808080808080u))
> >>> +                             break;
> >>>               }
> >>>       }
> >>> +
> >>> +     for (; p < end; p++)
> >>> +             if (*(unsigned char *)p == c)
> >>> +                     return (void *)p;
> >>> +
> >>>       return NULL;
> >>>  }
> >>>  EXPORT_SYMBOL(memchr);
> >>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-12 14:58         ` Yu-Jen Chang
@ 2022-07-12 15:08           ` Andy Shevchenko
  2022-07-13  9:39           ` David Laight
  1 sibling, 0 replies; 22+ messages in thread
From: Andy Shevchenko @ 2022-07-12 15:08 UTC (permalink / raw)
  To: Yu-Jen Chang
  Cc: Andrey Semashev, Andy Shevchenko, Akinobu Mita, Ching-Chun Huang,
	Linux Kernel Mailing List

On Tue, Jul 12, 2022 at 4:58 PM Yu-Jen Chang <arthurchang09@gmail.com> wrote:
> Andrey Semashev <andrey.semashev@gmail.com> 於 2022年7月11日 週一 晚上11:00寫道:
> > On 7/11/22 17:52, Yu-Jen Chang wrote:
> > > Andrey Semashev <andrey.semashev@gmail.com> 於 2022年7月11日 週一 凌晨4:01寫道:
> > >> On 7/10/22 17:28, Yu-Jen Chang wrote:

...

> > >>> +             for (; p <= end - 8; p += 8) {
> > >>> +                     val = *(u64 *)p ^ mask;
> > >>
> > >> What if p is not aligned to 8 (or 4 on 32-bit targets) bytes? Not all
> > >> targets support (efficient) unaligned loads, do they?
> > >
> > > I think it works if p is not aligned to 8 or 4 bytes.
> > >
> > > Let's say the string is 10 bytes. The for loop here will search the first
> > > 8 bytes. If the target character is in the last 2 bytes, the second for
> > > loop will find it. It also work like this on 32-bit machine.
> >
> > I think you're missing the point. Loads at unaligned addresses may not
> > be allowed by hardware using conventional load instructions or may be
> > inefficient. Given that this memchr implementation is used as a fallback
> > when no hardware-specific version is available, you should be
> > conservative wrt. hardware capabilities and behavior. You should
> > probably have a pre-alignment loop.
>
> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.

Still far from what can be accepted. Have you had a chance to read how
strscpy() is implemented? Do you understand why it's done that way?

> void *memchr(const void *p, int c, size_t length)
> {
>     u64 mask, val;
>     const void *end = p + length;
>     c &= 0xff;

>     while ((long ) p & (sizeof(long) - 1)) {
>         if (p >= end)
>             return NULL;
>         if (*(unsigned char *)p == c)
>             return (void *) p;
>         p++;
>     }
>     if (p <= end - 8) {
>         mask = c;
>         MEMCHR_MASK_GEN(mask);
>
>         for (; p <= end - 8; p += 8) {

Why you decided that this code will be run explicitly on 64-bit arch?

>             val = *(u64*)p ^ mask;
>             if ((val + 0xfefefefefefefeffull)
> & (~val & 0x8080808080808080ull))
>                 break;
>         }
>     }
>
>     for (; p < end; p++)
>         if (*(unsigned char *)p == c)
>             return (void *)p;
>
>     return NULL;
> }

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-12 14:58         ` Yu-Jen Chang
  2022-07-12 15:08           ` Andy Shevchenko
@ 2022-07-13  9:39           ` David Laight
  2022-07-13  9:49             ` Andrey Semashev
  2022-07-13  9:57             ` Andrey Semashev
  1 sibling, 2 replies; 22+ messages in thread
From: David Laight @ 2022-07-13  9:39 UTC (permalink / raw)
  To: 'Yu-Jen Chang', Andrey Semashev
  Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

From: Yu-Jen Chang
> Sent: 12 July 2022 15:59
...
> > I think you're missing the point. Loads at unaligned addresses may not
> > be allowed by hardware using conventional load instructions or may be
> > inefficient. Given that this memchr implementation is used as a fallback
> > when no hardware-specific version is available, you should be
> > conservative wrt. hardware capabilities and behavior. You should
> > probably have a pre-alignment loop.
> 
> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.

That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.

...
>         for (; p <= end - 8; p += 8) {
>             val = *(u64*)p ^ mask;
>             if ((val + 0xfefefefefefefeffull)
> & (~val & 0x8080808080808080ull))
>                 break;

I would add a couple of comments, like:
	// Convert to check for zero byte.
	// Standard check for a zero byte in a word.
(But not the big 4 line explanation you had.

It is also worth looking at how that code compiles
on 32bit arch that don't have a carry flag.
That is everything based on MIPS, including riscv.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-13  9:39           ` David Laight
@ 2022-07-13  9:49             ` Andrey Semashev
  2022-07-13 10:02               ` Andrey Semashev
  2022-07-13  9:57             ` Andrey Semashev
  1 sibling, 1 reply; 22+ messages in thread
From: Andrey Semashev @ 2022-07-13  9:49 UTC (permalink / raw)
  To: David Laight, 'Yu-Jen Chang'
  Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

On 7/13/22 12:39, David Laight wrote:
> From: Yu-Jen Chang
>> Sent: 12 July 2022 15:59
> ...
>>> I think you're missing the point. Loads at unaligned addresses may not
>>> be allowed by hardware using conventional load instructions or may be
>>> inefficient. Given that this memchr implementation is used as a fallback
>>> when no hardware-specific version is available, you should be
>>> conservative wrt. hardware capabilities and behavior. You should
>>> probably have a pre-alignment loop.
>>
>> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.
> 
> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
> 
> ...
>>         for (; p <= end - 8; p += 8) {
>>             val = *(u64*)p ^ mask;
>>             if ((val + 0xfefefefefefefeffull)
>> & (~val & 0x8080808080808080ull))
>>                 break;
> 
> I would add a couple of comments, like:
> 	// Convert to check for zero byte.
> 	// Standard check for a zero byte in a word.
> (But not the big 4 line explanation you had.
> 
> It is also worth looking at how that code compiles
> on 32bit arch that don't have a carry flag.
> That is everything based on MIPS, including riscv.

It may be worth looking at how glibc does it:

https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329fc302;hb=refs/heads/release/2.35/master#l46

They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
think memchr in the kernel should follow this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-13  9:39           ` David Laight
  2022-07-13  9:49             ` Andrey Semashev
@ 2022-07-13  9:57             ` Andrey Semashev
  1 sibling, 0 replies; 22+ messages in thread
From: Andrey Semashev @ 2022-07-13  9:57 UTC (permalink / raw)
  To: David Laight, 'Yu-Jen Chang'
  Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

On 7/13/22 12:39, David Laight wrote:
> From: Yu-Jen Chang
>> Sent: 12 July 2022 15:59
> ...
>>> I think you're missing the point. Loads at unaligned addresses may not
>>> be allowed by hardware using conventional load instructions or may be
>>> inefficient. Given that this memchr implementation is used as a fallback
>>> when no hardware-specific version is available, you should be
>>> conservative wrt. hardware capabilities and behavior. You should
>>> probably have a pre-alignment loop.
>>
>> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.
> 
> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.

If there is a pre-alignment loop, there won't be unaligned loads, so
there shouldn't be the need for the HAS_EFFICIENT_UNALIGNED_ACCESS
check. Unless I misunderstand what HAS_EFFICIENT_UNALIGNED_ACCESS indicates.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-13  9:49             ` Andrey Semashev
@ 2022-07-13 10:02               ` Andrey Semashev
  2022-07-13 10:24                 ` David Laight
  0 siblings, 1 reply; 22+ messages in thread
From: Andrey Semashev @ 2022-07-13 10:02 UTC (permalink / raw)
  To: David Laight, 'Yu-Jen Chang'
  Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

On 7/13/22 12:49, Andrey Semashev wrote:
> On 7/13/22 12:39, David Laight wrote:
>> From: Yu-Jen Chang
>>> Sent: 12 July 2022 15:59
>> ...
>>>> I think you're missing the point. Loads at unaligned addresses may not
>>>> be allowed by hardware using conventional load instructions or may be
>>>> inefficient. Given that this memchr implementation is used as a fallback
>>>> when no hardware-specific version is available, you should be
>>>> conservative wrt. hardware capabilities and behavior. You should
>>>> probably have a pre-alignment loop.
>>>
>>> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.
>>
>> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
>>
>> ...
>>>         for (; p <= end - 8; p += 8) {
>>>             val = *(u64*)p ^ mask;
>>>             if ((val + 0xfefefefefefefeffull)
>>> & (~val & 0x8080808080808080ull))
>>>                 break;
>>
>> I would add a couple of comments, like:
>> 	// Convert to check for zero byte.
>> 	// Standard check for a zero byte in a word.
>> (But not the big 4 line explanation you had.
>>
>> It is also worth looking at how that code compiles
>> on 32bit arch that don't have a carry flag.
>> That is everything based on MIPS, including riscv.
> 
> It may be worth looking at how glibc does it:
> 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329fc302;hb=refs/heads/release/2.35/master#l46
> 
> They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
> think memchr in the kernel should follow this.

Also, if by chance this optimization is aimed for x86-64, it may be
worth adding an arch-specific version that uses ERMS.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-13 10:02               ` Andrey Semashev
@ 2022-07-13 10:24                 ` David Laight
  2022-07-22 16:08                   ` Yu-Jen Chang
  0 siblings, 1 reply; 22+ messages in thread
From: David Laight @ 2022-07-13 10:24 UTC (permalink / raw)
  To: 'Andrey Semashev', 'Yu-Jen Chang'
  Cc: andy, akinobu.mita, Ching-Chun Huang, linux-kernel

From: Andrey Semashev
> Sent: 13 July 2022 11:03
> 
> On 7/13/22 12:49, Andrey Semashev wrote:
> > On 7/13/22 12:39, David Laight wrote:
> >> From: Yu-Jen Chang
> >>> Sent: 12 July 2022 15:59
> >> ...
> >>>> I think you're missing the point. Loads at unaligned addresses may not
> >>>> be allowed by hardware using conventional load instructions or may be
> >>>> inefficient. Given that this memchr implementation is used as a fallback
> >>>> when no hardware-specific version is available, you should be
> >>>> conservative wrt. hardware capabilities and behavior. You should
> >>>> probably have a pre-alignment loop.
> >>>
> >>> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.
> >>
> >> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
> >>
> >> ...
> >>>         for (; p <= end - 8; p += 8) {
> >>>             val = *(u64*)p ^ mask;
> >>>             if ((val + 0xfefefefefefefeffull)
> >>> & (~val & 0x8080808080808080ull))
> >>>                 break;
> >>
> >> I would add a couple of comments, like:
> >> 	// Convert to check for zero byte.
> >> 	// Standard check for a zero byte in a word.
> >> (But not the big 4 line explanation you had.
> >>
> >> It is also worth looking at how that code compiles
> >> on 32bit arch that don't have a carry flag.
> >> That is everything based on MIPS, including riscv.
> >
> > It may be worth looking at how glibc does it:
> >
> >
> https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329
> fc302;hb=refs/heads/release/2.35/master#l46
> >
> > They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
> > think memchr in the kernel should follow this.
> 
> Also, if by chance this optimization is aimed for x86-64, it may be
> worth adding an arch-specific version that uses ERMS.

Don't believe everything you see in glibc.
The common cases in the kernel are different from the ones they
tend to optimise for..

You might try using:
	#define GEN_MASK(x) (x * (unsigned long)0x0101010101010101ull)
for the mask and the two constants.
Then make all the variables 'long'.

I'm not at all sure what the test for fast multiply is about.
It may be very historic, for modern cpu generating the 64bit
constant is likely to be more problematic (check arm64).
If the input value is 'unsigned char' (or masked) then the
compiler may decide to do the repeated <<= itself.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-10 14:28 ` [PATCH 2/2] lib/string.c: Optimize memchr() Yu-Jen Chang
                     ` (2 preceding siblings ...)
  2022-07-10 20:01   ` Andrey Semashev
@ 2022-07-21  5:06   ` kernel test robot
  3 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2022-07-21  5:06 UTC (permalink / raw)
  To: Yu-Jen Chang, andy, akinobu.mita
  Cc: kbuild-all, jserv, linux-kernel, Yu-Jen Chang

Hi Yu-Jen,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.19-rc7 next-20220720]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yu-Jen-Chang/Optimize-memchr/20220710-223118
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b1c428b6c3684ee8ddf4137d68b3e8d51d2a700f
config: arc-buildonly-randconfig-r005-20220719 (https://download.01.org/0day-ci/archive/20220721/202207211216.wsqhuMcj-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/307ee542c2bd8378b2225910f3f9b982df7a7669
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yu-Jen-Chang/Optimize-memchr/20220710-223118
        git checkout 307ee542c2bd8378b2225910f3f9b982df7a7669
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> lib/string.c:902:7: error: conflicting types for 'memchr'; have 'void *(const void *, int,  long unsigned int)'
     902 | void *memchr(const void *p, int c, unsigned long length)
         |       ^~~~~~
   In file included from lib/string.c:19:
   include/linux/string.h:162:15: note: previous declaration of 'memchr' with type 'void *(const void *, int,  __kernel_size_t)' {aka 'void *(const void *, int,  unsigned int)'}
     162 | extern void * memchr(const void *,int,__kernel_size_t);
         |               ^~~~~~


vim +902 lib/string.c

   891	
   892	#ifndef __HAVE_ARCH_MEMCHR
   893	/**
   894	 * memchr - Find a character in an area of memory.
   895	 * @p: The memory area
   896	 * @c: The byte to search for
   897	 * @length: The size of the area.
   898	 *
   899	 * returns the address of the first occurrence of @c, or %NULL
   900	 * if @c is not found
   901	 */
 > 902	void *memchr(const void *p, int c, unsigned long length)
   903	{
   904		u64 mask, val;
   905		const void *end = p + length;
   906	
   907		c &= 0xff;
   908		if (p <= end - 8) {
   909			mask = c;
   910			MEMCHR_MASK_GEN(mask);
   911	
   912			for (; p <= end - 8; p += 8) {
   913				val = *(u64 *)p ^ mask;
   914				if ((val + 0xfefefefefefefeffu) &
   915				    (~val & 0x8080808080808080u))
   916					break;
   917			}
   918		}
   919	
   920		for (; p < end; p++)
   921			if (*(unsigned char *)p == c)
   922				return (void *)p;
   923	
   924		return NULL;
   925	}
   926	EXPORT_SYMBOL(memchr);
   927	#endif
   928	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
  2022-07-13 10:24                 ` David Laight
@ 2022-07-22 16:08                   ` Yu-Jen Chang
       [not found]                     ` <CAHp75Vfy6wYqzT-T9aEjVEAQCZ_k=0qN8S8OwG3knbrC-oOkMQ@mail.gmail.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-22 16:08 UTC (permalink / raw)
  To: David Laight, andy, Andrey Semashev
  Cc: akinobu.mita, Ching-Chun Huang, linux-kernel

On Wed, Jul 13, 2022 at 6:24 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Andrey Semashev
> > Sent: 13 July 2022 11:03
> >
> > On 7/13/22 12:49, Andrey Semashev wrote:
> > > On 7/13/22 12:39, David Laight wrote:
> > >> From: Yu-Jen Chang
> > >>> Sent: 12 July 2022 15:59
> > >> ...
> > >>>> I think you're missing the point. Loads at unaligned addresses may not
> > >>>> be allowed by hardware using conventional load instructions or may be
> > >>>> inefficient. Given that this memchr implementation is used as a fallback
> > >>>> when no hardware-specific version is available, you should be
> > >>>> conservative wrt. hardware capabilities and behavior. You should
> > >>>> probably have a pre-alignment loop.
> > >>>
> > >>> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.
> > >>
> > >> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
> > >>
> > >> ...
> > >>>         for (; p <= end - 8; p += 8) {
> > >>>             val = *(u64*)p ^ mask;
> > >>>             if ((val + 0xfefefefefefefeffull)
> > >>> & (~val & 0x8080808080808080ull))
> > >>>                 break;
> > >>
> > >> I would add a couple of comments, like:
> > >>    // Convert to check for zero byte.
> > >>    // Standard check for a zero byte in a word.
> > >> (But not the big 4 line explanation you had.
> > >>
> > >> It is also worth looking at how that code compiles
> > >> on 32bit arch that don't have a carry flag.
> > >> That is everything based on MIPS, including riscv.
> > >
> > > It may be worth looking at how glibc does it:
> > >
> > >
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329
> > fc302;hb=refs/heads/release/2.35/master#l46
> > >
> > > They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
> > > think memchr in the kernel should follow this.
> >
> > Also, if by chance this optimization is aimed for x86-64, it may be
> > worth adding an arch-specific version that uses ERMS.
>
> Don't believe everything you see in glibc.
> The common cases in the kernel are different from the ones they
> tend to optimise for..
>
> You might try using:
>         #define GEN_MASK(x) (x * (unsigned long)0x0101010101010101ull)
> for the mask and the two constants.
> Then make all the variables 'long'.
>
> I'm not at all sure what the test for fast multiply is about.
> It may be very historic, for modern cpu generating the 64bit
> constant is likely to be more problematic (check arm64).
> If the input value is 'unsigned char' (or masked) then the
> compiler may decide to do the repeated <<= itself.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


I have rewrite my code as the following. I use several macros to
generate the mask and detect target char for 32-bit or 64-bit.
The  DETECT_CHAR macro is the same as this part:

(val + 0xfefefefefefefeffull) & (~val & 0x8080808080808080ull)

The 0xfefefefefefefeffull is the 2's complement of  0x0101010101010101ULL.
I perform subtraction here exactly.

If the CPU architecture is unable to perform unaligned access
efficiently, the pre-alignment loop will align the string at first.


#define LBLOCKSIZE (sizeof(long))
#if BITS_PER_LONG == 64

#define DETECT_CHAR(X)                                                         \
(((X) - 0x0101010101010101ULL) & ~(X) & 0x8080808080808080ULL)

#define MEMCHR_MASK_GEN(mask)                                                  \
do {                                                                   \
    (mask) |= (mask) << 8;                                         \
    (mask) |= (mask) << 16;                                        \
    (mask) |= (mask) << 32;                                        \
} while (0)

#else

#define DETECT_CHAR(X) (((X)-0x01010101UL) & ~(X)&0x80808080UL)

#define MEMCHR_MASK_GEN(mask)                                                  \
do {                                                                   \
    (mask) |= (mask) << 8;                                         \
    (mask) |= (mask) << 16;                                        \
} while (0)

#endif

void *memchr(const void *p, int c, size_t length)
{
    unsigned long mask, val;
    const void *end = p + length;
    c &= 0xff;
#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
    while ((long)p & (sizeof(long) - 1)) {
        if (p >= end)
            return NULL;
        if (*(unsigned char *)p == c)
            return (void *)p;
        p++;
    }
#endif
    if (p <= end - LBLOCKSIZE) {
        mask = c;
        MEMCHR_MASK_GEN(mask);

       for (; p <= end - LBLOCKSIZE; p += LBLOCKSIZE) {
              val = *(unsigned long *)p ^ mask;
              if (DETECT_CHAR(val))
                  break;
       }
    }

    for (; p < end; p++)
       if (*(unsigned char *)p == c)
           return (void *)p;

    return NULL;
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib/string.c: Optimize memchr()
       [not found]                     ` <CAHp75Vfy6wYqzT-T9aEjVEAQCZ_k=0qN8S8OwG3knbrC-oOkMQ@mail.gmail.com>
@ 2022-07-29 15:42                       ` Yu-Jen Chang
  0 siblings, 0 replies; 22+ messages in thread
From: Yu-Jen Chang @ 2022-07-29 15:42 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: David Laight, andy, Andrey Semashev, akinobu.mita,
	Ching-Chun Huang, linux-kernel

On Sat, Jul 23, 2022 at 12:56 AM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
>
>
>
> On Friday, July 22, 2022, Yu-Jen Chang <arthurchang09@gmail.com> wrote:
>>
>> On Wed, Jul 13, 2022 at 6:24 PM David Laight <David.Laight@aculab.com> wrote:
>> >
>> > From: Andrey Semashev
>> > > Sent: 13 July 2022 11:03
>> > >
>> > > On 7/13/22 12:49, Andrey Semashev wrote:
>> > > > On 7/13/22 12:39, David Laight wrote:
>> > > >> From: Yu-Jen Chang
>> > > >>> Sent: 12 July 2022 15:59
>> > > >> ...
>> > > >>>> I think you're missing the point. Loads at unaligned addresses may not
>> > > >>>> be allowed by hardware using conventional load instructions or may be
>> > > >>>> inefficient. Given that this memchr implementation is used as a fallback
>> > > >>>> when no hardware-specific version is available, you should be
>> > > >>>> conservative wrt. hardware capabilities and behavior. You should
>> > > >>>> probably have a pre-alignment loop.
>> > > >>>
>> > > >>> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.
>> > > >>
>> > > >> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
>> > > >>
>> > > >> ...
>> > > >>>         for (; p <= end - 8; p += 8) {
>> > > >>>             val = *(u64*)p ^ mask;
>> > > >>>             if ((val + 0xfefefefefefefeffull)
>> > > >>> & (~val & 0x8080808080808080ull))
>> > > >>>                 break;
>> > > >>
>> > > >> I would add a couple of comments, like:
>> > > >>    // Convert to check for zero byte.
>> > > >>    // Standard check for a zero byte in a word.
>> > > >> (But not the big 4 line explanation you had.
>> > > >>
>> > > >> It is also worth looking at how that code compiles
>> > > >> on 32bit arch that don't have a carry flag.
>> > > >> That is everything based on MIPS, including riscv.
>> > > >
>> > > > It may be worth looking at how glibc does it:
>> > > >
>> > > >
>> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329
>> > > fc302;hb=refs/heads/release/2.35/master#l46
>> > > >
>> > > > They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
>> > > > think memchr in the kernel should follow this.
>> > >
>> > > Also, if by chance this optimization is aimed for x86-64, it may be
>> > > worth adding an arch-specific version that uses ERMS.
>> >
>> > Don't believe everything you see in glibc.
>> > The common cases in the kernel are different from the ones they
>> > tend to optimise for..
>> >
>> > You might try using:
>> >         #define GEN_MASK(x) (x * (unsigned long)0x0101010101010101ull)
>> > for the mask and the two constants.
>> > Then make all the variables 'long'.
>> >
>> > I'm not at all sure what the test for fast multiply is about.
>> > It may be very historic, for modern cpu generating the 64bit
>> > constant is likely to be more problematic (check arm64).
>> > If the input value is 'unsigned char' (or masked) then the
>> > compiler may decide to do the repeated <<= itself.
>> >
>> >         David
>> >
>> > -
>> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>> > Registration No: 1397386 (Wales)
>>
>>
>> I have rewrite my code as the following. I use several macros
>
>
>
>  My gosh, have you checked what strscpy() does under the hood?
>
> Why can’t you just reuse parts of it?
>
>>
>> to
>> generate the mask and detect target char for 32-bit or 64-bit.
>> The  DETECT_CHAR macro is the same as this part:
>>
>> (val + 0xfefefefefefefeffull) & (~val & 0x8080808080808080ull)
>>
>> The 0xfefefefefefefeffull is the 2's complement of  0x0101010101010101ULL.
>> I perform subtraction here exactly.
>>
>> If the CPU architecture is unable to perform unaligned access
>> efficiently, the pre-alignment loop will align the string at first.
>>
>>
>> #define LBLOCKSIZE (sizeof(long))
>> #if BITS_PER_LONG == 64
>>
>> #define DETECT_CHAR(X)                                                         \
>> (((X) - 0x0101010101010101ULL) & ~(X) & 0x8080808080808080ULL)
>>
>> #define MEMCHR_MASK_GEN(mask)                                                  \
>> do {                                                                   \
>>     (mask) |= (mask) << 8;                                         \
>>     (mask) |= (mask) << 16;                                        \
>>     (mask) |= (mask) << 32;                                        \
>> } while (0)
>>
>> #else
>>
>> #define DETECT_CHAR(X) (((X)-0x01010101UL) & ~(X)&0x80808080UL)
>>
>> #define MEMCHR_MASK_GEN(mask)                                                  \
>> do {                                                                   \
>>     (mask) |= (mask) << 8;                                         \
>>     (mask) |= (mask) << 16;                                        \
>> } while (0)
>>
>> #endif
>>
>> void *memchr(const void *p, int c, size_t length)
>> {
>>     unsigned long mask, val;
>>     const void *end = p + length;
>>     c &= 0xff;
>> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>     while ((long)p & (sizeof(long) - 1)) {
>>         if (p >= end)
>>             return NULL;
>>         if (*(unsigned char *)p == c)
>>             return (void *)p;
>>         p++;
>>     }
>> #endif
>>     if (p <= end - LBLOCKSIZE) {
>>         mask = c;
>>         MEMCHR_MASK_GEN(mask);
>>
>>        for (; p <= end - LBLOCKSIZE; p += LBLOCKSIZE) {
>>               val = *(unsigned long *)p ^ mask;
>>               if (DETECT_CHAR(val))
>>                   break;
>>        }
>>     }
>>
>>     for (; p < end; p++)
>>        if (*(unsigned char *)p == c)
>>            return (void *)p;
>>
>>     return NULL;
>> }
>
>
>
> --
> With Best Regards,
> Andy Shevchenko
>
>

I reuse some parts of the strscpy and use "read_word_at_a_time"
to read strings and "has_zero" to find if the address of the target char
is in the current word.

Yu-Jen Chang


#if BITS_PER_LONG == 64

#define MEMCHR_MASK_GEN(mask)                                                  \
    do {                                                                   \
        (mask) |= (mask) << 8;                                         \
        (mask) |= (mask) << 16;                                        \
        (mask) |= (mask) << 32;                                        \
    } while (0)

#else

#define MEMCHR_MASK_GEN(mask)                                                  \
    do {                                                                   \
        (mask) |= (mask) << 8;                                         \
        (mask) |= (mask) << 16;                                        \
    } while (0)

#endif

void *memchr(const void *p, int c, size_t length)
{
    unsigned long mask;
    unsigned char *src = (unsigned char *) p;
    const void *end = p + length;
    const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
    size_t max = length;
    c &= 0xff;

    #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
/*
* If src is unaligned, don't cross a page boundary,
* since we don't know if the next page is mapped.
*/
    if ((long)p & (sizeof(long) - 1)) {
        size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));

        if (limit < max)
            max = limit;
    }

    #else
        /* If src is unaligned, don't do word-at-a-time. */
        if (((long) p) & (sizeof(long) - 1))
            max = 0;
    #endif
    mask = c;
    MEMCHR_MASK_GEN(mask);
    for ( ; max >= sizeof(unsigned long); max -= sizeof(unsigned
long), src += sizeof(unsigned long)) {
        unsigned long data, result;
        data = read_word_at_a_time(src);
        data = data ^ mask;
        if (has_zero(data, &result, &constants))
            break;

    }

    for (; src < end; src++)
        if (*src == c)
            return (void *)src;

    return NULL;
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] Optimize memchr()
  2022-07-10 14:28 [PATCH 0/2] Optimize memchr() Yu-Jen Chang
  2022-07-10 14:28 ` [PATCH 1/2] lib/string.c: Add a macro for memchr_inv() Yu-Jen Chang
  2022-07-10 14:28 ` [PATCH 2/2] lib/string.c: Optimize memchr() Yu-Jen Chang
@ 2022-08-12 19:06 ` Pavel Machek
  2022-08-15 10:59   ` David Laight
  2 siblings, 1 reply; 22+ messages in thread
From: Pavel Machek @ 2022-08-12 19:06 UTC (permalink / raw)
  To: Yu-Jen Chang; +Cc: andy, akinobu.mita, jserv, linux-kernel

Hi!

> This patche series optimized "memchr()" and add a macro for 
> "memchr_inv()" so that both funtions can use it to generate bit mask.
> 
> The original implementaion of "memchr()" is based on byte-wise comparison,
> which do not fully use 64-bit or 32-bit register in CPU. We implement a
> word-wise comparison so that at least 4 bytes can be compared at the same
> time. The optimized "memchr()" is nearly 4x faster than the original one
> for long strings. In Linux Kernel, we find that the length of the string

Well... how much slower is it for short strings?

> searched by "memchr()" is up to 512 bytes in drivers/misc/lkdtm/heap.c.
> In our test, the optimized version is about 20% faster if the target
> character is at the end of the string when going through a 512-byte
> string.

"What is the average length passed to memchr" would be more useful question.

Best regards,
								Pavel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 0/2] Optimize memchr()
  2022-08-12 19:06 ` [PATCH 0/2] " Pavel Machek
@ 2022-08-15 10:59   ` David Laight
  0 siblings, 0 replies; 22+ messages in thread
From: David Laight @ 2022-08-15 10:59 UTC (permalink / raw)
  To: 'Pavel Machek', Yu-Jen Chang
  Cc: andy, akinobu.mita, jserv, linux-kernel

From: Pavel Machek
> Sent: 12 August 2022 20:07
> 
> Hi!
> 
> > This patche series optimized "memchr()" and add a macro for
> > "memchr_inv()" so that both funtions can use it to generate bit mask.
> >
> > The original implementaion of "memchr()" is based on byte-wise comparison,
> > which do not fully use 64-bit or 32-bit register in CPU. We implement a
> > word-wise comparison so that at least 4 bytes can be compared at the same
> > time. The optimized "memchr()" is nearly 4x faster than the original one
> > for long strings. In Linux Kernel, we find that the length of the string
> 
> Well... how much slower is it for short strings?

And cold cache??

	David

> > searched by "memchr()" is up to 512 bytes in drivers/misc/lkdtm/heap.c.
> > In our test, the optimized version is about 20% faster if the target
> > character is at the end of the string when going through a 512-byte
> > string.
> 
> "What is the average length passed to memchr" would be more useful question.
> 
> Best regards,
> 								Pavel

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-08-15 10:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-10 14:28 [PATCH 0/2] Optimize memchr() Yu-Jen Chang
2022-07-10 14:28 ` [PATCH 1/2] lib/string.c: Add a macro for memchr_inv() Yu-Jen Chang
2022-07-10 14:28 ` [PATCH 2/2] lib/string.c: Optimize memchr() Yu-Jen Chang
2022-07-10 15:16   ` Joe Perches
2022-07-11 14:50     ` Yu-Jen Chang
2022-07-10 16:58   ` kernel test robot
2022-07-10 20:01   ` Andrey Semashev
2022-07-11 14:52     ` Yu-Jen Chang
2022-07-11 15:00       ` Andrey Semashev
2022-07-11 15:03         ` Andy Shevchenko
2022-07-12 14:58         ` Yu-Jen Chang
2022-07-12 15:08           ` Andy Shevchenko
2022-07-13  9:39           ` David Laight
2022-07-13  9:49             ` Andrey Semashev
2022-07-13 10:02               ` Andrey Semashev
2022-07-13 10:24                 ` David Laight
2022-07-22 16:08                   ` Yu-Jen Chang
     [not found]                     ` <CAHp75Vfy6wYqzT-T9aEjVEAQCZ_k=0qN8S8OwG3knbrC-oOkMQ@mail.gmail.com>
2022-07-29 15:42                       ` Yu-Jen Chang
2022-07-13  9:57             ` Andrey Semashev
2022-07-21  5:06   ` kernel test robot
2022-08-12 19:06 ` [PATCH 0/2] " Pavel Machek
2022-08-15 10:59   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).