[PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
@ 2022-05-11 16:03 Vincent Mailhol
  2022-05-11 16:03 ` [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
                   ` (11 more replies)
  0 siblings, 12 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-11 16:03 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich, Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.

I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.

** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).

** Changelog **

v1 -> v2:

  * Use the ORC unwinder for the produced assembly code in patch 1.

  * Rename the functions as follow:
     - __ffs_asm() -> variable_ffs()
     - __ffs_asm_not_zero() -> __variable_ffs()
     - ffz_asm() -> variable_ffs()

  * fit #define ffs(x) in a single line.

  * Correct the statistics for ffs() in patch 1 and add the statistics
    for __ffs() and ffz() in patch 2.

Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
@ 2022-05-11 16:03 ` Vincent Mailhol
  2022-05-11 20:56   ` Christophe JAILLET
  2022-05-11 21:35   ` Nick Desaulniers
  2022-05-11 16:03 ` [PATCH v2 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-11 16:03 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich, Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.

* Example *

Let's consider two dummy functions foo() and bar() as below:

| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| 	return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| 	return __builtin_ffs(CONST);
| }

GCC would produce below assembly code:

| 0000000000000000 <foo>:
|    0:	ba 00 00 00 01       	mov    $0x1000000,%edx
|    5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    a:	0f bc c2             	bsf    %edx,%eax
|    d:	83 c0 01             	add    $0x1,%eax
|   10:	c3                   	ret
|   11:	66 66 2e 0f 1f 84 00 	data16 cs nopw 0x0(%rax,%rax,1)
|   18:	00 00 00 00
|   1c:	0f 1f 40 00          	nopl   0x0(%rax)
|
| 0000000000000020 <bar>:
|   20:	b8 19 00 00 00       	mov    $0x19,%eax
|   25:	c3                   	ret

And clang would produce:

| 0000000000000000 <foo>:
|    0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
|    c:	83 c0 01             	add    $0x1,%eax
|    f:	c3                   	ret
|
| 0000000000000010 <bar>:
|   10:	b8 19 00 00 00       	mov    $0x19,%eax
|   15:	c3                   	ret

For both example, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant
expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.

As a side benefit, this patch also removes below -Wshadow warning:

| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
|   283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 3607

...and after:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the call to ffs() were using constant expression
and were optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

CC: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
  2022-05-11 16:03 ` [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-05-11 16:03 ` Vincent Mailhol
  2022-05-11 22:20   ` Nick Desaulniers
  2022-05-12  0:03 ` [PATCH v3 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-11 16:03 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich, Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607

...and after:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600

So, roughly 27.9% of the call to either __ffs() or ffz() were using
constant expression and were optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

CC: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..7cf5374ce403 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long __variable_ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 __variable_ffs(word))
+
+static __always_inline unsigned long __variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 __variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-11 16:03 ` [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-05-11 20:56   ` Christophe JAILLET
  2022-05-11 23:30     ` Vincent MAILHOL
  2022-05-11 21:35   ` Nick Desaulniers
  1 sibling, 1 reply; 68+ messages in thread
From: Christophe JAILLET @ 2022-05-11 20:56 UTC (permalink / raw)
  To: Vincent Mailhol, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich

Le 11/05/2022 à 18:03, Vincent Mailhol a écrit :
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
> 

[...]

> 
> ** Statistics **
> 
> On a allyesconfig, before applying this patch...:
> 
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 3607
> 
> ...and after:
> 
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 792
> 
> So, roughly 26.7% of the call to ffs() were using constant expression
> and were optimized out.
> 
> 
nitpicking: numbers look odd.
    3607 is the exact same number as in patch 2/2. (ok, could be)
    26.7% is surprising with these numbers. (I guess it is (total_before 
- remaining) / total_before x 100 = (3607-792)/36.07 = 78.0%)

(but patch looks great to me :)

CJ

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-11 16:03 ` [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-05-11 20:56   ` Christophe JAILLET
@ 2022-05-11 21:35   ` Nick Desaulniers
  2022-05-11 23:48     ` Vincent MAILHOL
  1 sibling, 1 reply; 68+ messages in thread
From: Nick Desaulniers @ 2022-05-11 21:35 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich

On Wed, May 11, 2022 at 9:03 AM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
>
> * Example *
>
> Let's consider two dummy functions foo() and bar() as below:
>
> | #include <linux/bitops.h>
> | #define CONST 0x01000000
> |
> | unsigned int foo(void)
> | {
> |       return ffs(CONST);
> | }
> |
> | unsigned int bar(void)
> | {
> |       return __builtin_ffs(CONST);
> | }
>
> GCC would produce below assembly code:

Thanks for the patch! LGTM.
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>

>
> | 0000000000000000 <foo>:
> |    0: ba 00 00 00 01          mov    $0x1000000,%edx
> |    5: b8 ff ff ff ff          mov    $0xffffffff,%eax
> |    a: 0f bc c2                bsf    %edx,%eax
> |    d: 83 c0 01                add    $0x1,%eax
> |   10: c3                      ret

This should be the end of foo.  I...actually don't know what's at the
end here. But I don't think the region from here...

> |   11: 66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
> |   18: 00 00 00 00
> |   1c: 0f 1f 40 00             nopl   0x0(%rax)

...to here is relevant.


> |
> | 0000000000000020 <bar>:
> |   20: b8 19 00 00 00          mov    $0x19,%eax
> |   25: c3                      ret
>
> And clang would produce:
>
> | 0000000000000000 <foo>:
> |    0: b8 ff ff ff ff          mov    $0xffffffff,%eax
> |    5: 0f bc 05 00 00 00 00    bsf    0x0(%rip),%eax        # c <foo+0xc>
> |    c: 83 c0 01                add    $0x1,%eax
> |    f: c3                      ret

Weird, so I just tried this:
```
$ cat /tmp/x.c
#define CONST 0x01000000

unsigned ffs (int x) {
  int r;
  asm("bsfl %1,%0"
      : "=r" (r)
      : "rm" (x), "0" (-1));
  return r;
}

unsigned int foo(void) {
  return ffs(CONST);
}

unsigned int bar(void) {
  return __builtin_ffs(CONST);
}
$ clang /tmp/x.c -O2 -o /tmp/x.o -c && llvm-objdump -dr /tmp/x.o
--disassemble-symbols=foo
...
0000000000000010 <foo>:
      10: b8 19 00 00 00                movl    $25, %eax
      15: c3                            retq
      16: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
```
but if we make `ffs` `static`, we get:
```
0000000000000000 <foo>:
       0: b8 ff ff ff ff                movl    $4294967295, %eax
 # imm = 0xFFFFFFFF
       5: 0f bc 05 00 00 00 00          bsfl    (%rip), %eax
 # 0xc <foo+0xc>
                0000000000000008:  R_X86_64_PC32        .LCPI0_0-0x4
       c: c3                            retq
       d: 0f 1f 00                      nopl    (%rax)
```
Which is very interesting to me; it looks like constant propagation
actually hurt optimization, we lost that this was a libcall which we
could have optimized.

As in LLVM does:
1. sink CONST into ffs; it's static and has one caller
2. delete x parameter; it's unused
3. now libcall optimization just sees a call to ffs with no params,
that doesn't match the signature of libc.

Your change should fix that since we don't even call a function named
ffs if we have a constant (explicitly, or via optimization). Filed
https://github.com/llvm/llvm-project/issues/55394
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-05-11 16:03 ` [PATCH v2 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-05-11 22:20   ` Nick Desaulniers
  2022-05-11 23:23     ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Desaulniers @ 2022-05-11 22:20 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich

On Wed, May 11, 2022 at 9:04 AM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
> is equivalent to (unsigned long)__builtin_ctzl(~x). Because
> __builting_ctzl() returns an int, a cast to (unsigned long) is
> necessary to avoid potential warnings on implicit casts.
>
> For x86_64, the current __ffs() and ffz() implementations do not
> produce optimized code when called with a constant expression. On the
> contrary, the __builtin_ctzl() gets simplified into a single
> instruction.
>
> However, for non constant expressions, the __ffs() and ffz() asm
> versions of the kernel remains slightly better than the code produced
> by GCC (it produces a useless instruction to clear eax).
>
> This patch uses the __builtin_constant_p() to select between the
> kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
> the argument is constant or not.
>
> ** Statistics **
>
> On a allyesconfig, before applying this patch...:
>
> | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> | 3607
>
> ...and after:
>
> | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> | 2600
>
> So, roughly 27.9% of the call to either __ffs() or ffz() were using
> constant expression and were optimized out.
>
> (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
>
> Note: on x86_64, the asm bsf instruction produces tzcnt when used with
> the ret prefix (which is why we grep tzcnt instead of bsf in above
> benchmark). c.f. [1]
>
> [1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
> http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com
>
> CC: Nick Desaulniers <ndesaulniers@google.com>
> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>

Patch LGTM, though I find the location of the double unscores in the
names slightly against my taste.

> ---
>  arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 6ed979547086..7cf5374ce403 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
>          ? constant_test_bit((nr), (addr))      \
>          : variable_test_bit((nr), (addr)))
>
> -/**
> - * __ffs - find first set bit in word
> - * @word: The word to search
> - *
> - * Undefined if no bit exists, so code should check against 0 first.
> - */
> -static __always_inline unsigned long __ffs(unsigned long word)
> +static __always_inline unsigned long __variable_ffs(unsigned long word)

How about `variable___ffs`? Patch 1/2 used `variable_ffs` for `ffs`?

>  {
>         asm("rep; bsf %1,%0"
>                 : "=r" (word)
> @@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
>         return word;
>  }
>
> -/**
> - * ffz - find first zero bit in word
> - * @word: The word to search
> - *
> - * Undefined if no zero exists, so code should check against ~0UL first.
> - */
> -static __always_inline unsigned long ffz(unsigned long word)
> +/**
> + * __ffs - find first set bit in word
> + * @word: The word to search
> + *
> + * Undefined if no bit exists, so code should check against 0 first.
> + */
> +#define __ffs(word)                            \
> +       (__builtin_constant_p(word) ?           \
> +        (unsigned long)__builtin_ctzl(word) :  \
> +        __variable_ffs(word))
> +
> +static __always_inline unsigned long __variable_ffz(unsigned long word)

`ffz` had no underscore. Regardless of `__ffs`, this should definitely
be `variable_ffz` IMO.

>  {
>         asm("rep; bsf %1,%0"
>                 : "=r" (word)
> @@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
>         return word;
>  }
>
> +/**
> + * ffz - find first zero bit in word
> + * @word: The word to search
> + *
> + * Undefined if no zero exists, so code should check against ~0UL first.
> + */
> +#define ffz(word)                              \
> +       (__builtin_constant_p(word) ?           \
> +        (unsigned long)__builtin_ctzl(~word) : \
> +        __variable_ffz(word))
> +
>  /*
>   * __fls: find last set bit in word
>   * @word: The word to search
> --
> 2.35.1
>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-05-11 22:20   ` Nick Desaulniers
@ 2022-05-11 23:23     ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-05-11 23:23 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich

On Thu. 12 May 2022 at 07:20, Nick Desaulniers <ndesaulniers@google.com> wrote:
> On Wed, May 11, 2022 at 9:04 AM Vincent Mailhol
> <mailhol.vincent@wanadoo.fr> wrote:
> >
> > __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
> > is equivalent to (unsigned long)__builtin_ctzl(~x). Because
> > __builting_ctzl() returns an int, a cast to (unsigned long) is
> > necessary to avoid potential warnings on implicit casts.
> >
> > For x86_64, the current __ffs() and ffz() implementations do not
> > produce optimized code when called with a constant expression. On the
> > contrary, the __builtin_ctzl() gets simplified into a single
> > instruction.
> >
> > However, for non constant expressions, the __ffs() and ffz() asm
> > versions of the kernel remains slightly better than the code produced
> > by GCC (it produces a useless instruction to clear eax).
> >
> > This patch uses the __builtin_constant_p() to select between the
> > kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
> > the argument is constant or not.
> >
> > ** Statistics **
> >
> > On a allyesconfig, before applying this patch...:
> >
> > | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> > | 3607
> >
> > ...and after:
> >
> > | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> > | 2600
> >
> > So, roughly 27.9% of the call to either __ffs() or ffz() were using
> > constant expression and were optimized out.
> >
> > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
> >
> > Note: on x86_64, the asm bsf instruction produces tzcnt when used with
> > the ret prefix (which is why we grep tzcnt instead of bsf in above
> > benchmark). c.f. [1]
> >
> > [1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
> > http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com
> >
> > CC: Nick Desaulniers <ndesaulniers@google.com>
> > Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
>
> Patch LGTM, though I find the location of the double unscores in the
> names slightly against my taste.
>
> > ---
> >  arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
> >  1 file changed, 24 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > index 6ed979547086..7cf5374ce403 100644
> > --- a/arch/x86/include/asm/bitops.h
> > +++ b/arch/x86/include/asm/bitops.h
> > @@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
> >          ? constant_test_bit((nr), (addr))      \
> >          : variable_test_bit((nr), (addr)))
> >
> > -/**
> > - * __ffs - find first set bit in word
> > - * @word: The word to search
> > - *
> > - * Undefined if no bit exists, so code should check against 0 first.
> > - */
> > -static __always_inline unsigned long __ffs(unsigned long word)
> > +static __always_inline unsigned long __variable_ffs(unsigned long word)
>
> How about `variable___ffs`? Patch 1/2 used `variable_ffs` for `ffs`?

On a first glance, having the triple underscores in the middle of the
name seemed odd. On second thought, it is more consistent with the
rest, and I finally like the idea. Will be adopted in v3.

> >  {
> >         asm("rep; bsf %1,%0"
> >                 : "=r" (word)
> > @@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
> >         return word;
> >  }
> >
> > -/**
> > - * ffz - find first zero bit in word
> > - * @word: The word to search
> > - *
> > - * Undefined if no zero exists, so code should check against ~0UL first.
> > - */
> > -static __always_inline unsigned long ffz(unsigned long word)
> > +/**
> > + * __ffs - find first set bit in word
> > + * @word: The word to search
> > + *
> > + * Undefined if no bit exists, so code should check against 0 first.
> > + */
> > +#define __ffs(word)                            \
> > +       (__builtin_constant_p(word) ?           \
> > +        (unsigned long)__builtin_ctzl(word) :  \
> > +        __variable_ffs(word))
> > +
> > +static __always_inline unsigned long __variable_ffz(unsigned long word)
>
> `ffz` had no underscore. Regardless of `__ffs`, this should definitely
> be `variable_ffz` IMO.

ACK.

> >  {
> >         asm("rep; bsf %1,%0"
> >                 : "=r" (word)
> > @@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
> >         return word;
> >  }
> >
> > +/**
> > + * ffz - find first zero bit in word
> > + * @word: The word to search
> > + *
> > + * Undefined if no zero exists, so code should check against ~0UL first.
> > + */
> > +#define ffz(word)                              \
> > +       (__builtin_constant_p(word) ?           \
> > +        (unsigned long)__builtin_ctzl(~word) : \
> > +        __variable_ffz(word))
> > +
> >  /*
> >   * __fls: find last set bit in word
> >   * @word: The word to search
> > --
> > 2.35.1
> >
>
>
> --
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-11 20:56   ` Christophe JAILLET
@ 2022-05-11 23:30     ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-05-11 23:30 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich

On Thu. 12 May 2022 at 05:56, Christophe JAILLET
<christophe.jaillet@wanadoo.fr> wrote:
> Le 11/05/2022 à 18:03, Vincent Mailhol a écrit :
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> >
>
> [...]
>
> >
> > ** Statistics **
> >
> > On a allyesconfig, before applying this patch...:
> >
> > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > | 3607
> >
> > ...and after:
> >
> > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > | 792
> >
> > So, roughly 26.7% of the call to ffs() were using constant expression
> > and were optimized out.
> >
> >
> nitpicking: numbers look odd.

And they are. Thanks for spotting the issue.

>     3607 is the exact same number as in patch 2/2. (ok, could be)
>     26.7% is surprising with these numbers. (I guess it is (total_before
> - remaining) / total_before x 100 = (3607-792)/36.07 = 78.0%)

The 3607 is incorrect (copy/paste issue, sorry). The correct figure is
1081. And:

(1081-792)/1081 = 26.7%

Will amend the comment and send v3 right away.

> (but patch looks great to me :)

Thanks! :)

> CJ

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-11 21:35   ` Nick Desaulniers
@ 2022-05-11 23:48     ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-05-11 23:48 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich

On Thu. 12 May 2022 at 06:35, Nick Desaulniers <ndesaulniers@google.com> wrote:
> On Wed, May 11, 2022 at 9:03 AM Vincent Mailhol
> <mailhol.vincent@wanadoo.fr> wrote:
> >
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> >
> > * Example *
> >
> > Let's consider two dummy functions foo() and bar() as below:
> >
> > | #include <linux/bitops.h>
> > | #define CONST 0x01000000
> > |
> > | unsigned int foo(void)
> > | {
> > |       return ffs(CONST);
> > | }
> > |
> > | unsigned int bar(void)
> > | {
> > |       return __builtin_ffs(CONST);
> > | }
> >
> > GCC would produce below assembly code:
>
> Thanks for the patch! LGTM.
> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
>
> >
> > | 0000000000000000 <foo>:
> > |    0: ba 00 00 00 01          mov    $0x1000000,%edx
> > |    5: b8 ff ff ff ff          mov    $0xffffffff,%eax
> > |    a: 0f bc c2                bsf    %edx,%eax
> > |    d: 83 c0 01                add    $0x1,%eax
> > |   10: c3                      ret
>
> This should be the end of foo.  I...actually don't know what's at the
> end here. But I don't think the region from here...
>
> > |   11: 66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
> > |   18: 00 00 00 00
> > |   1c: 0f 1f 40 00             nopl   0x0(%rax)
>
> ...to here is relevant.

I do not know either. I was hesitating to redact this part but finally
sent it to be verbatim.

I will redact this in v3.

> > |
> > | 0000000000000020 <bar>:
> > |   20: b8 19 00 00 00          mov    $0x19,%eax
> > |   25: c3                      ret
> >
> > And clang would produce:
> >
> > | 0000000000000000 <foo>:
> > |    0: b8 ff ff ff ff          mov    $0xffffffff,%eax
> > |    5: 0f bc 05 00 00 00 00    bsf    0x0(%rip),%eax        # c <foo+0xc>
> > |    c: 83 c0 01                add    $0x1,%eax
> > |    f: c3                      ret
>
> Weird, so I just tried this:
> ```
> $ cat /tmp/x.c
> #define CONST 0x01000000
>
> unsigned ffs (int x) {
>   int r;
>   asm("bsfl %1,%0"
>       : "=r" (r)
>       : "rm" (x), "0" (-1));
>   return r;
> }
>
> unsigned int foo(void) {
>   return ffs(CONST);
> }
>
> unsigned int bar(void) {
>   return __builtin_ffs(CONST);
> }
> $ clang /tmp/x.c -O2 -o /tmp/x.o -c && llvm-objdump -dr /tmp/x.o
> --disassemble-symbols=foo
> ...
> 0000000000000010 <foo>:
>       10: b8 19 00 00 00                movl    $25, %eax
>       15: c3                            retq
>       16: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
> ```
> but if we make `ffs` `static`, we get:
> ```
> 0000000000000000 <foo>:
>        0: b8 ff ff ff ff                movl    $4294967295, %eax
>  # imm = 0xFFFFFFFF
>        5: 0f bc 05 00 00 00 00          bsfl    (%rip), %eax
>  # 0xc <foo+0xc>
>                 0000000000000008:  R_X86_64_PC32        .LCPI0_0-0x4
>        c: c3                            retq
>        d: 0f 1f 00                      nopl    (%rax)
> ```
> Which is very interesting to me; it looks like constant propagation
> actually hurt optimization, we lost that this was a libcall which we
> could have optimized.
>
> As in LLVM does:
> 1. sink CONST into ffs; it's static and has one caller
> 2. delete x parameter; it's unused
> 3. now libcall optimization just sees a call to ffs with no params,
> that doesn't match the signature of libc.
>
> Your change should fix that since we don't even call a function named
> ffs if we have a constant (explicitly, or via optimization). Filed
> https://github.com/llvm/llvm-project/issues/55394

Great! Didn't realize my patch had so many side benefits.
Will add a one sentence remark in v3 and point to your message.

Thanks!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v3 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
  2022-05-11 16:03 ` [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-05-11 16:03 ` [PATCH v2 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-05-12  0:03 ` Vincent Mailhol
  2022-05-12  0:03   ` [PATCH v3 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-05-12  0:03   ` [PATCH v3 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-05-12  1:18 ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-12  0:03 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich, Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.

I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste type in statistics of patch 1. Number of occurences
    before patches are 1081 and not 3607 (percentage reduction of
    26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

v1 -> v2:

  * Use the ORC unwinder for the produced assembly code in patch 1.

  * Rename the functions as follow:
     - __ffs_asm() -> variable_ffs()
     - __ffs_asm_not_zero() -> __variable_ffs()
     - ffz_asm() -> variable_ffs()

  * fit #define ffs(x) in a single line.

  * Correct the statistics for ffs() in patch 1 and add the statistics
    for __ffs() and ffz() in patch 2.

Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v3 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-12  0:03 ` [PATCH v3 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-05-12  0:03   ` Vincent Mailhol
  2022-05-12  0:28     ` Nick Desaulniers
  2022-05-12  0:03   ` [PATCH v3 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-12  0:03 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich, Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.

* Example *

Let's consider two dummy functions foo() and bar() as below:

| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| 	return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| 	return __builtin_ffs(CONST);
| }

GCC would produce below assembly code:

| 0000000000000000 <foo>:
|    0:	ba 00 00 00 01       	mov    $0x1000000,%edx
|    5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    a:	0f bc c2             	bsf    %edx,%eax
|    d:	83 c0 01             	add    $0x1,%eax
|   10:	c3                   	ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
|   20:	b8 19 00 00 00       	mov    $0x19,%eax
|   25:	c3                   	ret

And clang would produce:

| 0000000000000000 <foo>:
|    0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
|    c:	83 c0 01             	add    $0x1,%eax
|    f:	c3                   	ret
|
| 0000000000000010 <bar>:
|   10:	b8 19 00 00 00       	mov    $0x19,%eax
|   15:	c3                   	ret

For both example, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant
expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.

As a side benefit, this patch also removes below -Wshadow warning:

| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
|   283 | static __always_inline int ffs(int x)

And finally, Nick Desaulniers pointed out in [2] that this also fixes
a constant propagation missed-optimization in clang.

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081

...and after:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

[2] https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v3 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-05-12  0:03 ` [PATCH v3 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
  2022-05-12  0:03   ` [PATCH v3 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-05-12  0:03   ` Vincent Mailhol
  2022-05-12  0:19     ` Nick Desaulniers
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-12  0:03 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich, Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607

...and after:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

CC: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..fb0d7cd9f957 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v3 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-05-12  0:03   ` [PATCH v3 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-05-12  0:19     ` Nick Desaulniers
  0 siblings, 0 replies; 68+ messages in thread
From: Nick Desaulniers @ 2022-05-12  0:19 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich

On Wed, May 11, 2022 at 5:04 PM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
> is equivalent to (unsigned long)__builtin_ctzl(~x). Because
> __builting_ctzl() returns an int, a cast to (unsigned long) is
> necessary to avoid potential warnings on implicit casts.
>
> For x86_64, the current __ffs() and ffz() implementations do not
> produce optimized code when called with a constant expression. On the
> contrary, the __builtin_ctzl() gets simplified into a single
> instruction.
>
> However, for non constant expressions, the __ffs() and ffz() asm
> versions of the kernel remains slightly better than the code produced
> by GCC (it produces a useless instruction to clear eax).
>
> This patch uses the __builtin_constant_p() to select between the
> kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
> the argument is constant or not.
>
> ** Statistics **
>
> On a allyesconfig, before applying this patch...:
>
> | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> | 3607
>
> ...and after:
>
> | $ objdump -d vmlinux.o | grep tzcnt | wc -l
> | 2600
>
> So, roughly 27.9% of the calls to either __ffs() or ffz() were using
> constant expressions and could be optimized out.
>
> (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
>
> Note: on x86_64, the asm bsf instruction produces tzcnt when used with
> the ret prefix (which is why we grep tzcnt instead of bsf in above
> benchmark). c.f. [1]
>
> [1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
> http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com
>
> CC: Nick Desaulniers <ndesaulniers@google.com>
> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>

Thanks for the patches!
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v3 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-12  0:03   ` [PATCH v3 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-05-12  0:28     ` Nick Desaulniers
  2022-05-12  1:18       ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Desaulniers @ 2022-05-12  0:28 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich

On Wed, May 11, 2022 at 5:04 PM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> And finally, Nick Desaulniers pointed out in [2] that this also fixes
> a constant propagation missed-optimization in clang.
>
> [2] https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

Regarding
https://github.com/llvm/llvm-project/issues/55394
it seems that functions with static linkage cannot be considered
library functions, so libcall optimization will not run on calls to
them. So the compiler might be able to do a better job for constants
if ffs() and friends indeed were not defined in a header as static
inline.  But that relies on the compiler knowing these tricks; I think
the kernel's approach is just fine (better in fact, because we should
inline these tiny functions, regardless of LTO), but like this series
shows, there may be room for improvement for other functions within
the kernel that are defined as static inline in headers that are
normally found in a libc.

So I no longer think there's a missed optimization here, but at this
point, it's not worth a respin of the series IMO to just let sleeping
dogs lie.

Unless the x86 maintainers wouldn't mind dropping that line and link
when applying?
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v3 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-12  0:28     ` Nick Desaulniers
@ 2022-05-12  1:18       ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-05-12  1:18 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich

On Thu. 12 May 2022 at 09:28, Nick Desaulniers <ndesaulniers@google.com> wrote:
> On Wed, May 11, 2022 at 5:04 PM Vincent Mailhol
> <mailhol.vincent@wanadoo.fr> wrote:
> >
> > And finally, Nick Desaulniers pointed out in [2] that this also fixes
> > a constant propagation missed-optimization in clang.
> >
> > [2] https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
>
> Regarding
> https://github.com/llvm/llvm-project/issues/55394
> it seems that functions with static linkage cannot be considered
> library functions, so libcall optimization will not run on calls to
> them. So the compiler might be able to do a better job for constants
> if ffs() and friends indeed were not defined in a header as static
> inline.  But that relies on the compiler knowing these tricks; I think
> the kernel's approach is just fine (better in fact, because we should
> inline these tiny functions, regardless of LTO), but like this series
> shows, there may be room for improvement for other functions within
> the kernel that are defined as static inline in headers that are
> normally found in a libc.
>
> So I no longer think there's a missed optimization here, but at this
> point, it's not worth a respin of the series IMO to just let sleeping
> dogs lie.
>
> Unless the x86 maintainers wouldn't mind dropping that line and link
> when applying?

Let me send the v4, this will save the x86 maintainers some manual
editing effort (add will add your "Review-by" tag in patch 2 while
doing so).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (2 preceding siblings ...)
  2022-05-12  0:03 ` [PATCH v3 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-05-12  1:18 ` Vincent Mailhol
  2022-05-12  1:18   ` [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
                     ` (2 more replies)
  2022-06-25  7:26 ` [RESEND PATCH " Vincent Mailhol
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-12  1:18 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET, Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.

I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v3 -> v4:

  * (no changes on code, only commit comment was modified)

  * Remove note and link to Nick's message in patch 1/2, c.f.:
  https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 2/2.


v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste typo in statistics of patch 1/2. Number of
    occurences before patches are 1081 and not 3607 (percentage
    reduction of 26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 1/2.

Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-12  1:18 ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-05-12  1:18   ` Vincent Mailhol
  2022-05-12  3:02     ` Joe Perches
  2022-05-12  1:18   ` [PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-05-23  9:22   ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent MAILHOL
  2 siblings, 1 reply; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-12  1:18 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET, Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.

* Example *

Let's consider two dummy functions foo() and bar() as below:

| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| 	return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| 	return __builtin_ffs(CONST);
| }

GCC would produce below assembly code:

| 0000000000000000 <foo>:
|    0:	ba 00 00 00 01       	mov    $0x1000000,%edx
|    5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    a:	0f bc c2             	bsf    %edx,%eax
|    d:	83 c0 01             	add    $0x1,%eax
|   10:	c3                   	ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
|   20:	b8 19 00 00 00       	mov    $0x19,%eax
|   25:	c3                   	ret

And clang would produce:

| 0000000000000000 <foo>:
|    0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
|    c:	83 c0 01             	add    $0x1,%eax
|    f:	c3                   	ret
|
| 0000000000000010 <bar>:
|   10:	b8 19 00 00 00       	mov    $0x19,%eax
|   15:	c3                   	ret

For both example, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant
expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.

As a side benefit, this patch also removes below -Wshadow warning:

| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
|   283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081

...and after:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-05-12  1:18 ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
  2022-05-12  1:18   ` [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-05-12  1:18   ` Vincent Mailhol
  2022-05-23  9:22   ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent MAILHOL
  2 siblings, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-05-12  1:18 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET, Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607

...and after:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..f88c55b8b37c 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-12  1:18   ` [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-05-12  3:02     ` Joe Perches
  2022-05-12  4:29       ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Joe Perches @ 2022-05-12  3:02 UTC (permalink / raw)
  To: Vincent Mailhol, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov
  Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET

On Thu, 2022-05-12 at 10:18 +0900, Vincent Mailhol wrote:
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
[]
> -static __always_inline int ffs(int x)
> +static __always_inline int variable_ffs(int x)
>  {
>  	int r;
>  
> @@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
>  	return r + 1;
>  }
>  
> +/**
> + * ffs - find first set bit in word
> + * @x: the word to search
> + *
> + * This is defined the same way as the libc and compiler builtin ffs
> + * routines, therefore differs in spirit from the other bitops.
> + *
> + * ffs(value) returns 0 if value is 0 or the position of the first
> + * set bit if value is nonzero. The first (least significant) bit
> + * is at position 1.
> + */
> +#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))

How about not defining another function and using parentheses around
the function definition to avoid the macro expansion like:

#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : ffs(x))

and

static __always_inline int (ffs)(int x)
{
	etc...
}



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-05-12  3:02     ` Joe Perches
@ 2022-05-12  4:29       ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-05-12  4:29 UTC (permalink / raw)
  To: Joe Perches
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET

On Thu. 12 May 2022 at 12:02, Joe Perches <joe@perches.com> wrote:
> On Thu, 2022-05-12 at 10:18 +0900, Vincent Mailhol wrote:
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> []
> > -static __always_inline int ffs(int x)
> > +static __always_inline int variable_ffs(int x)
> >  {
> >       int r;
> >
> > @@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
> >       return r + 1;
> >  }
> >
> > +/**
> > + * ffs - find first set bit in word
> > + * @x: the word to search
> > + *
> > + * This is defined the same way as the libc and compiler builtin ffs
> > + * routines, therefore differs in spirit from the other bitops.
> > + *
> > + * ffs(value) returns 0 if value is 0 or the position of the first
> > + * set bit if value is nonzero. The first (least significant) bit
> > + * is at position 1.
> > + */
> > +#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
>
> How about not defining another function and using parentheses around
> the function definition to avoid the macro expansion like:
>
> #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : ffs(x))
>
> and
>
> static __always_inline int (ffs)(int x)
> {
>         etc...
> }

Sorry, but I don’t really like this approach.

Main issue I see is that this code will emit a -Wshadow warning.

And using parentheses around the function definition just seems an
obscure hack to me. The variable_foo() gives me less headache. Was
this pattern ever used anywhere else in the kernel?


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-12  1:18 ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
  2022-05-12  1:18   ` [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-05-12  1:18   ` [PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-05-23  9:22   ` Vincent MAILHOL
  2 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-05-23  9:22 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86
  Cc: Dave Hansen, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET, Nick Desaulniers

On Thu. 12 May 2022 at 10:20, Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() function of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be optimized when used
> on constant expression.
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
>   * use kernel assembly for non constant expressions
>
>   * use compiler's __builtin function for constant expressions.
>
> I also think that the fls() and fls64() can be optimized in a similar
> way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> trivial so I want to focus on this series first. If it get accepted, I
> will then work on those two additionnal function.
>
>
> ** Statistics **
>
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).
>
>
> ** Changelog **
>
> v3 -> v4:
>
>   * (no changes on code, only commit comment was modified)
>
>   * Remove note and link to Nick's message in patch 1/2, c.f.:
>   https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
>
>   * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 2/2.
>
>
> v2 -> v3:
>
>   * Redacted out the instructions after ret and before next function
>     in the assembly output.
>
>   * Added a note and a link to Nick's message on the constant
>     propagation missed-optimization in clang:
>     https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
>
>   * Fix copy/paste typo in statistics of patch 1/2. Number of
>     occurences before patches are 1081 and not 3607 (percentage
>     reduction of 26.7% remains correct)
>
>   * Rename the functions as follow:
>     - __varible_ffs() -> variable___ffs()
>     - __variable_ffz() -> variable_ffz()
>
>   * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 1/2.
>
> Vincent Mailhol (2):
>   x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
>     expressions
>   x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
>     expressions

Hi Thomas, Ingo and Borislav,

Are there any chances for you to pick those two patches during this
week's merge windows?
https://lore.kernel.org/all/20220512011855.1189653-2-mailhol.vincent@wanadoo.fr/
https://lore.kernel.org/all/20220512011855.1189653-3-mailhol.vincent@wanadoo.fr/

Thank you!


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (3 preceding siblings ...)
  2022-05-12  1:18 ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-06-25  7:26 ` Vincent Mailhol
  2022-06-25  7:26   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-06-25  7:26   ` [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-07-23 15:15 ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-06-25  7:26 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Peter Zijlstra
  Cc: Dave Hansen, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET, Joe Perches, Josh Poimboeuf, Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v3 -> v4:

  * (no changes on code, only commit comment was modified)

  * Remove note and link to Nick's message in patch 1/2, c.f.:
  https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 2/2.


v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste typo in statistics of patch 1/2. Number of
    occurences before patches are 1081 and not 3607 (percentage
    reduction of 26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 1/2.

Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-06-25  7:26 ` [RESEND PATCH " Vincent Mailhol
@ 2022-06-25  7:26   ` Vincent Mailhol
  2022-06-25  7:26   ` [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-06-25  7:26 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Peter Zijlstra
  Cc: Dave Hansen, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET, Joe Perches, Josh Poimboeuf, Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.

* Example *

Let's consider two dummy functions foo() and bar() as below:

| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| 	return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| 	return __builtin_ffs(CONST);
| }

GCC would produce below assembly code:

| 0000000000000000 <foo>:
|    0:	ba 00 00 00 01       	mov    $0x1000000,%edx
|    5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    a:	0f bc c2             	bsf    %edx,%eax
|    d:	83 c0 01             	add    $0x1,%eax
|   10:	c3                   	ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
|   20:	b8 19 00 00 00       	mov    $0x19,%eax
|   25:	c3                   	ret

And clang would produce:

| 0000000000000000 <foo>:
|    0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
|    c:	83 c0 01             	add    $0x1,%eax
|    f:	c3                   	ret
|
| 0000000000000010 <bar>:
|   10:	b8 19 00 00 00       	mov    $0x19,%eax
|   15:	c3                   	ret

In both examples, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.

As a side benefit, this patch also removes below -Wshadow warning:

| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
|   283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081

...and after:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-06-25  7:26 ` [RESEND PATCH " Vincent Mailhol
  2022-06-25  7:26   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-06-25  7:26   ` Vincent Mailhol
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-06-25  7:26 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Peter Zijlstra
  Cc: Dave Hansen, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe JAILLET, Joe Perches, Josh Poimboeuf, Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607

...and after:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..f88c55b8b37c 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (4 preceding siblings ...)
  2022-06-25  7:26 ` [RESEND PATCH " Vincent Mailhol
@ 2022-07-23 15:15 ` Vincent Mailhol
  2022-07-23 15:15   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
                     ` (2 more replies)
  2022-08-12 11:44 ` [PATCH v5 " Vincent Mailhol
                   ` (5 subsequent siblings)
  11 siblings, 3 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-07-23 15:15 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Peter Zijlstra
  Cc: Dave Hansen, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() function of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v3 -> v4:

  * (no changes on code, only commit comment was modified)

  * Remove note and link to Nick's message in patch 1/2, c.f.:
  https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 2/2.


v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste typo in statistics of patch 1/2. Number of
    occurences before patches are 1081 and not 3607 (percentage
    reduction of 26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 1/2.

Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-07-23 15:15 ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-07-23 15:15   ` Vincent Mailhol
  2022-08-11 14:59     ` Borislav Petkov
  2022-07-23 15:15   ` [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-07-29 11:24   ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent MAILHOL
  2 siblings, 1 reply; 68+ messages in thread
From: Vincent Mailhol @ 2022-07-23 15:15 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Peter Zijlstra
  Cc: Dave Hansen, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to simplify the expression into a single instruction.

* Example *

Let's consider two dummy functions foo() and bar() as below:

| #include <linux/bitops.h>
| #define CONST 0x01000000
|
| unsigned int foo(void)
| {
| 	return ffs(CONST);
| }
|
| unsigned int bar(void)
| {
| 	return __builtin_ffs(CONST);
| }

GCC would produce below assembly code:

| 0000000000000000 <foo>:
|    0:	ba 00 00 00 01       	mov    $0x1000000,%edx
|    5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    a:	0f bc c2             	bsf    %edx,%eax
|    d:	83 c0 01             	add    $0x1,%eax
|   10:	c3                   	ret
<Instructions after ret and before next function were redacted>
|
| 0000000000000020 <bar>:
|   20:	b8 19 00 00 00       	mov    $0x19,%eax
|   25:	c3                   	ret

And clang would produce:

| 0000000000000000 <foo>:
|    0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
|    5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
|    c:	83 c0 01             	add    $0x1,%eax
|    f:	c3                   	ret
|
| 0000000000000010 <bar>:
|   10:	b8 19 00 00 00       	mov    $0x19,%eax
|   15:	c3                   	ret

In both examples, we clearly see the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

This patch uses the __builtin_constant_p() to select between the
kernel's ffs() and the __builtin_ffs() depending on whether the
argument is constant or not.

As a side benefit, this patch also removes below -Wshadow warning:

| ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
|   283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081

...and after:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-07-23 15:15 ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
  2022-07-23 15:15   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-07-23 15:15   ` Vincent Mailhol
  2022-07-29 11:24   ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent MAILHOL
  2 siblings, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-07-23 15:15 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Peter Zijlstra
  Cc: Dave Hansen, H . Peter Anvin, Nathan Chancellor, Tom Rix,
	linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() gets simplified into a single
instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

This patch uses the __builtin_constant_p() to select between the
kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether
the argument is constant or not.

** Statistics **

On a allyesconfig, before applying this patch...:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 3607

...and after:

| $ objdump -d vmlinux.o | grep tzcnt | wc -l
| 2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which is why we grep tzcnt instead of bsf in above
benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..f88c55b8b37c 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-07-23 15:15 ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
  2022-07-23 15:15   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-07-23 15:15   ` [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-07-29 11:24   ` Vincent MAILHOL
  2022-07-29 12:22     ` Borislav Petkov
  2 siblings, 1 reply; 68+ messages in thread
From: Vincent MAILHOL @ 2022-07-29 11:24 UTC (permalink / raw)
  To: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Peter Zijlstra, Dave Hansen
  Cc: H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm,
	David Howells, Jan Beulich, Christophe Jaillet, Joe Perches,
	Josh Poimboeuf

On Sun 24 Jul. 2022 at 00:15, Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() function of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be optimized when used
> on constant expression.
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
>   * use kernel assembly for non constant expressions
>
>   * use compiler's __builtin function for constant expressions.
>
>
> ** Statistics **
>
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).
>
>
> ** Changelog **
>
> v3 -> v4:
>
>   * (no changes on code, only commit comment was modified)
>
>   * Remove note and link to Nick's message in patch 1/2, c.f.:
>   https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
>
>   * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 2/2.
>
>
> v2 -> v3:
>
>   * Redacted out the instructions after ret and before next function
>     in the assembly output.
>
>   * Added a note and a link to Nick's message on the constant
>     propagation missed-optimization in clang:
>     https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
>
>   * Fix copy/paste typo in statistics of patch 1/2. Number of
>     occurences before patches are 1081 and not 3607 (percentage
>     reduction of 26.7% remains correct)
>
>   * Rename the functions as follow:
>     - __varible_ffs() -> variable___ffs()
>     - __variable_ffz() -> variable_ffz()
>
>   * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> in tag in patch 1/2.
>
> Vincent Mailhol (2):
>   x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
>     expressions
>   x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
>     expressions
>
>  arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
>  1 file changed, 38 insertions(+), 26 deletions(-)

Hi Thomas, Ingo, Borislav, Dave and Peter,

This patch series [1] has been waiting for more than two months
already. So far, I have not heard back from any of the x86 mainteners.
Do you see anything wrong with this series? If not, any chances to
have someone of you to pick it up?

[1] https://lore.kernel.org/all/20220625072645.251828-1-mailhol.vincent@wanadoo.fr/#t

Thank you,


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-07-29 11:24   ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent MAILHOL
@ 2022-07-29 12:22     ` Borislav Petkov
  2022-07-29 13:50       ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2022-07-29 12:22 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Fri, Jul 29, 2022 at 08:24:58PM +0900, Vincent MAILHOL wrote:
> This patch series [1] has been waiting for more than two months
> already. So far, I have not heard back from any of the x86 mainteners.
> Do you see anything wrong with this series? If not, any chances to
> have someone of you to pick it up?

They're on my todo list but you have to be patient. If you haven't
heard, we're still busy with this thing called retbleed.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-07-29 12:22     ` Borislav Petkov
@ 2022-07-29 13:50       ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-07-29 13:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Fri. 29 Jul. 2022 at 21:22, Borislav Petkov <bp@alien8.de> worte:
> On Fri, Jul 29, 2022 at 08:24:58PM +0900, Vincent MAILHOL wrote:
> > This patch series [1] has been waiting for more than two months
> > already. So far, I have not heard back from any of the x86 mainteners.
> > Do you see anything wrong with this series? If not, any chances to
> > have someone of you to pick it up?
>
> They're on my todo list but you have to be patient. If you haven't
> heard, we're still busy with this thing called retbleed.

Understood and thanks for the update!
If this is on your radar, then no more worries on my side. I wish you
courage and good luck for the retbleed fix!


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-07-23 15:15   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-08-11 14:59     ` Borislav Petkov
  2022-08-12 11:55       ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2022-08-11 14:59 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Sun, Jul 24, 2022 at 12:15:20AM +0900, Vincent Mailhol wrote:
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
> 
> * Example *
> 
> Let's consider two dummy functions foo() and bar() as below:
> 
> | #include <linux/bitops.h>
> | #define CONST 0x01000000

Those code examples you can simply indent with two spaces.

> In both examples, we clearly see the benefit of using __builtin_ffs()

Who's "we"?

Please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.

> instead of the kernel's asm implementation for constant expressions.
> 
> However, for non constant expressions, the ffs() asm version of the
> kernel remains better for x86_64 because, contrary to GCC, it doesn't
> emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> able optimize out the CMOV call).
> 
> This patch uses the __builtin_constant_p() to select between the

Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> kernel's ffs() and the __builtin_ffs() depending on whether the
> argument is constant or not.

In general, you don't have to say what the patch does - that should be
visible from the diff. The more important part is the *why*. And that
you do.

Rest looks ok.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v5 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (5 preceding siblings ...)
  2022-07-23 15:15 ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-08-12 11:44 ` Vincent Mailhol
  2022-08-12 11:44   ` [PATCH v5 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-08-12 11:44   ` [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-08-31  7:57 ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-08-12 11:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v4 -> v5:

  * (no changes on code, only commit comment was modified)

  * Rewrite the commit log:
    - Use two spaces instead of `| ' to indent code snippets.
    - Do not use `we'.
    - Do not use `this patch' in the commit description. Instead,
      use imperative tone.
  Link: https://lore.kernel.org/all/YvUZVYxbOMcZtR5G@zn.tnic/


v3 -> v4:

  * (no changes on code, only commit comment was modified)

  * Remove note and link to Nick's message in patch 1/2, c.f.:
  Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 2/2.


v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste typo in statistics of patch 1/2. Number of
    occurences before patches are 1081 and not 3607 (percentage
    reduction of 26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 1/2.

Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v5 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-08-12 11:44 ` [PATCH v5 " Vincent Mailhol
@ 2022-08-12 11:44   ` Vincent Mailhol
  2022-08-12 11:44   ` [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-08-12 11:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to fold the expression into a single instruction.

** Example **

Let's consider two dummy functions foo() and bar() as below:

  #include <linux/bitops.h>
  #define CONST 0x01000000

  unsigned int foo(void)
  {
  	return ffs(CONST);
  }

  unsigned int bar(void)
  {
  	return __builtin_ffs(CONST);
  }

GCC would produce below assembly code:

  0000000000000000 <foo>:
     0:	ba 00 00 00 01       	mov    $0x1000000,%edx
     5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     a:	0f bc c2             	bsf    %edx,%eax
     d:	83 c0 01             	add    $0x1,%eax
    10:	c3                   	ret
  <Instructions after ret and before next function were redacted>

  0000000000000020 <bar>:
    20:	b8 19 00 00 00       	mov    $0x19,%eax
    25:	c3                   	ret

And clang would produce:

  0000000000000000 <foo>:
     0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
     c:	83 c0 01             	add    $0x1,%eax
     f:	c3                   	ret

  0000000000000010 <bar>:
    10:	b8 19 00 00 00       	mov    $0x19,%eax
    15:	c3                   	ret

Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.

As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:

  ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
    283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  1081

...and after:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-12 11:44 ` [PATCH v5 " Vincent Mailhol
  2022-08-12 11:44   ` [PATCH v5 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-08-12 11:44   ` Vincent Mailhol
  2022-08-23 16:23     ` Borislav Petkov
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Mailhol @ 2022-08-12 11:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  3607

...and after:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which explain the use of `grep tzcnt' instead of `grep
bsf' in above benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
Link: http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..bd49aef87ab6 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-08-11 14:59     ` Borislav Petkov
@ 2022-08-12 11:55       ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-08-12 11:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

Hi Borislav,

On Thu. 11 Aug 2022 at 23:59, Borislav Petkov <bp@alien8.de> wrote:
> On Sun, Jul 24, 2022 at 12:15:20AM +0900, Vincent Mailhol wrote:
> > For x86_64, the current ffs() implementation does not produce
> > optimized code when called with a constant expression. On the
> > contrary, the __builtin_ffs() function of both GCC and clang is able
> > to simplify the expression into a single instruction.
> >
> > * Example *
> >
> > Let's consider two dummy functions foo() and bar() as below:
> >
> > | #include <linux/bitops.h>
> > | #define CONST 0x01000000
>
> Those code examples you can simply indent with two spaces.
>
> > In both examples, we clearly see the benefit of using __builtin_ffs()
>
> Who's "we"?
>
> Please use passive voice in your commit message: no "we" or "I", etc,
> and describe your changes in imperative mood.
>
> > instead of the kernel's asm implementation for constant expressions.
> >
> > However, for non constant expressions, the ffs() asm version of the
> > kernel remains better for x86_64 because, contrary to GCC, it doesn't
> > emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> > able optimize out the CMOV call).
> >
> > This patch uses the __builtin_constant_p() to select between the
>
> Avoid having "This patch" or "This commit" in the commit message. It is
> tautologically useless.
>
> Also, do
>
> $ git grep 'This patch' Documentation/process
>
> for more details.
>
> > kernel's ffs() and the __builtin_ffs() depending on whether the
> > argument is constant or not.
>
> In general, you don't have to say what the patch does - that should be
> visible from the diff. The more important part is the *why*. And that
> you do.
>
> Rest looks ok.

Thank you for the review!
I addressed all your comments and sent a v5:
https://lore.kernel.org/all/20220812114438.1574-1-mailhol.vincent@wanadoo.fr/


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-12 11:44   ` [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-08-23 16:23     ` Borislav Petkov
  2022-08-23 17:12       ` Nick Desaulniers
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2022-08-23 16:23 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Fri, Aug 12, 2022 at 08:44:38PM +0900, Vincent Mailhol wrote:
> __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x)

Are you sure about this?

My gcc documentation says:

"Built-in Function: int __builtin_ctz (unsigned int x)

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."

Note the undefined part.

Also,

__builtin_ctzl(0): 0x40
ffs(0): 0x0

I'm using the kernel ffs() version in a small program which is basically
a wrapper around BSF.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-23 16:23     ` Borislav Petkov
@ 2022-08-23 17:12       ` Nick Desaulniers
  2022-08-23 17:43         ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Desaulniers @ 2022-08-23 17:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vincent Mailhol, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Tue, Aug 23, 2022 at 9:23 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Aug 12, 2022 at 08:44:38PM +0900, Vincent Mailhol wrote:
> > __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x)
>
> Are you sure about this?
>
> My gcc documentation says:
>
> "Built-in Function: int __builtin_ctz (unsigned int x)
>
>     Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
>
> Note the undefined part.
>
> Also,
>
> __builtin_ctzl(0): 0x40
> ffs(0): 0x0
>
> I'm using the kernel ffs() version in a small program which is basically
> a wrapper around BSF.

Callers of these need to guard against zero input, as the pre-existing
comment notes:

>> Undefined if no bit exists, so code should check against 0 first.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-23 17:12       ` Nick Desaulniers
@ 2022-08-23 17:43         ` Borislav Petkov
  2022-08-23 20:31           ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2022-08-23 17:43 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Vincent Mailhol, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Tue, Aug 23, 2022 at 10:12:17AM -0700, Nick Desaulniers wrote:
> Callers of these need to guard against zero input, as the pre-existing
> comment notes:
> 
> >> Undefined if no bit exists, so code should check against 0 first.

This is just silly.

And then there's

 * ffs(value) returns 0 if value is 0 or the position of the first
 * set bit if value is nonzero. The first (least significant) bit
 * is at position 1.
 */
static __always_inline int ffs(int x)

Can we unify the two and move the guard against 0 inside the function
or, like ffs() does, have it return 0 if value is 0?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-23 17:43         ` Borislav Petkov
@ 2022-08-23 20:31           ` Vincent MAILHOL
  2022-08-24  8:43             ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent MAILHOL @ 2022-08-23 20:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Wed. 24 Aug. 2022 at 02:43, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Aug 23, 2022 at 10:12:17AM -0700, Nick Desaulniers wrote:
> > Callers of these need to guard against zero input, as the pre-existing
> > comment notes:
> >
> > >> Undefined if no bit exists, so code should check against 0 first.

If the fact that __ffs(0) is undefined is a concern, I can add a safety net:

  #define __ffs(word) \
          (__builtin_constant_p(word) ? \
                  (unsigned long)__builtin_ctzl(word) +
BUILD_BUG_ON_ZERO(word): \
                  variable___ffs(word))

It will only catch the constant expression but still better than
nothing (this comment also applies to the other functions undefined
when the argument is zero).

Also, if this aspect was unclear, I can add a comment in the commit
log to explain.

> This is just silly.
>
> And then there's
>
>  * ffs(value) returns 0 if value is 0 or the position of the first
>  * set bit if value is nonzero. The first (least significant) bit
>  * is at position 1.
>  */
> static __always_inline int ffs(int x)
>
> Can we unify the two and move the guard against 0 inside the function
> or, like ffs() does, have it return 0 if value is 0?

There is an index issue. __ffs() starts at 0 but ffs() starts at one.
i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
explains why __fss(0) is undefined.


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-23 20:31           ` Vincent MAILHOL
@ 2022-08-24  8:43             ` Borislav Petkov
  2022-08-24 12:10               ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2022-08-24  8:43 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Wed, Aug 24, 2022 at 05:31:20AM +0900, Vincent MAILHOL wrote:
> If the fact that __ffs(0) is undefined is a concern,

So what is of concern is I'm looking at those *ffs things and they look
like a real mess:

 * Undefined if no bit exists, so code should check against 0 first.
 */
static __always_inline unsigned long __ffs(unsigned long word)
{
        asm("rep; bsf %1,%0"

and that's TZCNT.

And nowhere in TZCNT's description does it talk about undefined behavior
- it is all defined.

So I have no clue what that comment is supposed to mean?

Then:

 * ffs - find first set bit in word
 * @x: the word to search
 *
 * This is defined the same way as the libc and compiler builtin ffs
 * routines, therefore differs in spirit from the other bitops.
 *
 * ffs(value) returns 0 if value is 0 or the position of the first
 * set bit if value is nonzero. The first (least significant) bit
 * is at position 1.

while

"Built-in Function: int __builtin_ctz (unsigned int x)

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."

as previously pasted.

So ffs() doesn't have undefined behavior either.

I guess it wants to say, it is undefined in the *respective* libc or
compiler helper implementation. And that should be explained.

> I can add a safety net:

Nah, no need. It seems this "behavior" has been the case a long time so
callers should know better (or have burned themselves properly :)).

> There is an index issue. __ffs() starts at 0 but ffs() starts at one.
> i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
> Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
> explains why __fss(0) is undefined.

I'd love to drop the undefined thing and start counting at 1 while
keeping the 0 case the special one.

But that ship has sailed a long time ago - look at all the __ffs() and
ffs() callers.

Back to your patch: I think the text should be fixed to say that both
ffs() and __ffs()'s kernel implementation doesn't have undefined results
but since it needs to adhere to the libc variants' API, it treats 0
differently. They surely can handle 0 as input.

I.e., I'd like to see a comment there explaining the whole difference
between ffs() and __ffs() so that people are aware.

Btw, pls do

s/variable___ffs/variable__ffs/g

Two underscores are just fine.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-24  8:43             ` Borislav Petkov
@ 2022-08-24 12:10               ` Vincent MAILHOL
  2022-08-24 13:24                 ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent MAILHOL @ 2022-08-24 12:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Wed. 24 Aug 2022 at 17:43, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 24, 2022 at 05:31:20AM +0900, Vincent MAILHOL wrote:
> > If the fact that __ffs(0) is undefined is a concern,
>
> So what is of concern is I'm looking at those *ffs things and they look
> like a real mess:

I agree that the thing is a mess. Especially the naming: adding
underscores when the behaviour is different is misleading. I think
that ctzl() would have been a better name than __ffs().

>  * Undefined if no bit exists, so code should check against 0 first.
>  */
> static __always_inline unsigned long __ffs(unsigned long word)
> {
>         asm("rep; bsf %1,%0"
>
> and that's TZCNT.

Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…

> And nowhere in TZCNT's description does it talk about undefined behavior
> - it is all defined.
>
> So I have no clue what that comment is supposed to mean?

It means that __ffs() is not a x86_64 specific function. Each
architecture is free to provide an optimized implementation and are
free to ignore __ffs(0) because this is undefined.
For ffs(0) to be defined, every architecture would have to produce the
same result, and this is not the case.

> Then:
>
>  * ffs - find first set bit in word
>  * @x: the word to search
>  *
>  * This is defined the same way as the libc and compiler builtin ffs
>  * routines, therefore differs in spirit from the other bitops.
>  *
>  * ffs(value) returns 0 if value is 0 or the position of the first
>  * set bit if value is nonzero. The first (least significant) bit
>  * is at position 1.
>
> while
>
> "Built-in Function: int __builtin_ctz (unsigned int x)
>
>     Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
>
> as previously pasted.
>
> So ffs() doesn't have undefined behavior either.
>
> I guess it wants to say, it is undefined in the *respective* libc or
> compiler helper implementation. And that should be explained.
>
> > I can add a safety net:
>
> Nah, no need. It seems this "behavior" has been the case a long time so
> callers should know better (or have burned themselves properly :)).
>
> > There is an index issue. __ffs() starts at 0 but ffs() starts at one.
> > i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
> > Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
> > explains why __fss(0) is undefined.
>
> I'd love to drop the undefined thing and start counting at 1 while
> keeping the 0 case the special one.
>
> But that ship has sailed a long time ago - look at all the __ffs() and
> ffs() callers.

ACK. I do not believe that this is something which can be changed now.
At least, I am not willing to start such a crusade.

> Back to your patch: I think the text should be fixed to say that both
> ffs() and __ffs()'s kernel implementation doesn't have undefined results

NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for
x86_64 and BSF instruction for x86). Even if x86_64 and x86 had the
same behaviour that would still not be OK as it may fool developers
into believing that __ffs(0) is defined kernel wide and would result
in non portable code.

> but since it needs to adhere to the libc variants' API, it treats 0
> differently. They surely can handle 0 as input.
>
> I.e., I'd like to see a comment there explaining the whole difference
> between ffs() and __ffs() so that people are aware.

This would be helpful but the priority would then be to modify asm-generic:
https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bitops/__ffs.h#L11

Regardless, I do not think that the comment of __ffs() and ffs() is
related to this patch series.

> Btw, pls do
>
> s/variable___ffs/variable__ffs/g
>
> Two underscores are just fine.

OK for me. The rationale was to name it variable_<function_name>()
thus the three underscores. But I will also be happy with two
underscores.


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-24 12:10               ` Vincent MAILHOL
@ 2022-08-24 13:24                 ` Borislav Petkov
  2022-08-26 21:32                   ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2022-08-24 13:24 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Wed, Aug 24, 2022 at 09:10:59PM +0900, Vincent MAILHOL wrote:
> Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…

Not x86 - some old models which do not understand TZCNT, I'm being told.
And I'm being also told, "Intel and AMD disagree on what BSF does when
passed 0". So this is more mess.

> It means that __ffs() is not a x86_64 specific function. Each

No, not that. The comment "Undefined if no bit exists".

On my machine, __ffs(0) - the way it is implemented:

	rep; bsf %1,%0

is well-defined:

"If the input operand is zero, CF is set to 1 and the size (in bits) of
the input operand is written to the destination register. Otherwise, CF
is cleared."

Leading to

__ffs(0): 0x40

i.e., input operand of 64 bits.

So on this particular x86 implementation, TZCNT(0) is well defined.

So I'd like for that "undefined" thing to be expanded upon and
explained. Something along the lines of "the libc/compiler primitives'
*ffs(0) is undefined. Our inline asm helpers adhere to that behavior
even if the result they return for input operand of 0 is very well
defined."

Now, if there are some machines which do not adhere to the current hw
behavior, then they should be ALTERNATIVEd.

Better?

> > Back to your patch: I think the text should be fixed to say that both
> > ffs() and __ffs()'s kernel implementation doesn't have undefined results
> 
> NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for

NACK, SCHMACK. Read my mail again: "I think the text should be fixed".
The *text* - not __ffs(0) itself. The *text* should be fixed to explain
what undefined means. See above too.

IOW, to start explaining this humongous mess I've scratched the surface
of.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-24 13:24                 ` Borislav Petkov
@ 2022-08-26 21:32                   ` Vincent MAILHOL
  2022-09-07  4:06                     ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent MAILHOL @ 2022-08-26 21:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Wed. 24 Aug. 2022 at 22:24, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 24, 2022 at 09:10:59PM +0900, Vincent MAILHOL wrote:
> > Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…
>
> Not x86 - some old models which do not understand TZCNT, I'm being told.

ACK.

> And I'm being also told, "Intel and AMD disagree on what BSF does when
> passed 0". So this is more mess.

ACK.

> > It means that __ffs() is not a x86_64 specific function. Each
>
> No, not that. The comment "Undefined if no bit exists".
>
> On my machine, __ffs(0) - the way it is implemented:
>
>         rep; bsf %1,%0
>
> is well-defined:
>
> "If the input operand is zero, CF is set to 1 and the size (in bits) of
> the input operand is written to the destination register. Otherwise, CF
> is cleared."

It is well defined on *your* machine.

On some other machines, it might be undefined:
"If the content of the source operand is 0, the content of the
destination operand is undefined."
https://www.felixcloutier.com/x86/bsf

> Leading to
>
> __ffs(0): 0x40
>
> i.e., input operand of 64 bits.
>
> So on this particular x86 implementation, TZCNT(0) is well defined.

It is here where I do not follow you. OK that on most of the recent
machines, the compiler will emit a TZCNT and that this instruction is
well defined for zero. But on some older machines, it will emit BSF,
and on a subset of those machines, BSF(0) might be undefined.

> So I'd like for that "undefined" thing to be expanded upon and
> explained. Something along the lines of "the libc/compiler primitives'
> *ffs(0) is undefined. Our inline asm helpers adhere to that behavior
> even if the result they return for input operand of 0 is very well
> defined."
>
> Now, if there are some machines which do not adhere to the current hw
> behavior, then they should be ALTERNATIVEd.
>
> Better?
>
> > > Back to your patch: I think the text should be fixed to say that both
> > > ffs() and __ffs()'s kernel implementation doesn't have undefined results
> >
> > NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for
>
> NACK, SCHMACK. Read my mail again: "I think the text should be fixed".
> The *text* - not __ffs(0) itself. The *text* should be fixed to explain
> what undefined means. See above too.
>
> IOW, to start explaining this humongous mess I've scratched the surface
> of.

Agree that this is only the surface. But, my patch series is about
constant folding, not about the text of *ffs(). Here, I just *move*
the existing text, I did not modify anything.
Can we agree that this is a separate topic? I do not think I am the
good person to fix that mess (and in all honesty, I am not a domain
expert in this domain and I am afraid I would just make you lose your
time if I had to work on this).


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (6 preceding siblings ...)
  2022-08-12 11:44 ` [PATCH v5 " Vincent Mailhol
@ 2022-08-31  7:57 ` Vincent Mailhol
  2022-08-31  7:57   ` [PATCH v6 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
                     ` (2 more replies)
  2022-09-05  0:37 ` [PATCH v7 " Vincent Mailhol
                   ` (3 subsequent siblings)
  11 siblings, 3 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-08-31  7:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v5 -> v6:
  * Rename variable___ffs() into variable__ffs() (two underscores
    instead of three)


v4 -> v5:

  * (no changes on code, only commit comment was modified)

  * Rewrite the commit log:
    - Use two spaces instead of `| ' to indent code snippets.
    - Do not use `we'.
    - Do not use `this patch' in the commit description. Instead,
      use imperative tone.
  Link: https://lore.kernel.org/all/YvUZVYxbOMcZtR5G@zn.tnic/


v3 -> v4:

  * (no changes on code, only commit comment was modified)

  * Remove note and link to Nick's message in patch 1/2, c.f.:
  Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 2/2.


v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste typo in statistics of patch 1/2. Number of
    occurences before patches are 1081 and not 3607 (percentage
    reduction of 26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 1/2.

Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v6 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-08-31  7:57 ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-08-31  7:57   ` Vincent Mailhol
  2022-08-31  7:57   ` [PATCH v6 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-08-31  8:51   ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Yury Norov
  2 siblings, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-08-31  7:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to fold the expression into a single instruction.

** Example **

Let's consider two dummy functions foo() and bar() as below:

  #include <linux/bitops.h>
  #define CONST 0x01000000

  unsigned int foo(void)
  {
  	return ffs(CONST);
  }

  unsigned int bar(void)
  {
  	return __builtin_ffs(CONST);
  }

GCC would produce below assembly code:

  0000000000000000 <foo>:
     0:	ba 00 00 00 01       	mov    $0x1000000,%edx
     5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     a:	0f bc c2             	bsf    %edx,%eax
     d:	83 c0 01             	add    $0x1,%eax
    10:	c3                   	ret
  <Instructions after ret and before next function were redacted>

  0000000000000020 <bar>:
    20:	b8 19 00 00 00       	mov    $0x19,%eax
    25:	c3                   	ret

And clang would produce:

  0000000000000000 <foo>:
     0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
     c:	83 c0 01             	add    $0x1,%eax
     f:	c3                   	ret

  0000000000000010 <bar>:
    10:	b8 19 00 00 00       	mov    $0x19,%eax
    15:	c3                   	ret

Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.

As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:

  ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
    283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  1081

...and after:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a288ecd230ab..6ed979547086 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -310,6 +299,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-31  7:57 ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
  2022-08-31  7:57   ` [PATCH v6 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-08-31  7:57   ` Vincent Mailhol
  2022-08-31  8:51   ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Yury Norov
  2 siblings, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-08-31  7:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  3607

...and after:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which explain the use of `grep tzcnt' instead of `grep
bsf' in above benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
Link: http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..bd49aef87ab6 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable__ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable__ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-08-31  7:57 ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
  2022-08-31  7:57   ` [PATCH v6 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-08-31  7:57   ` [PATCH v6 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-08-31  8:51   ` Yury Norov
  2022-09-01  3:49     ` Yury Norov
  2022-09-02  1:19     ` Vincent MAILHOL
  2 siblings, 2 replies; 68+ messages in thread
From: Yury Norov @ 2022-08-31  8:51 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Borislav Petkov, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	x86, Peter Zijlstra, Dave Hansen, H . Peter Anvin,
	Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells,
	Jan Beulich, Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() functions of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be folded when used
> with constant expressions.
> 
> This series relies on __builtin_constant_p to select the optimal solution:
> 
>   * use kernel assembly for non constant expressions
> 
>   * use compiler's __builtin function for constant expressions.
> 
> 
> ** Statistics **
> 
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).

Hi Vincent,

Can you please add a test for this? We've recently added a very similar
test_bitmap_const_eval() in lib/test_bitmap.c.

dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
assertions")

Would be nice to have something like this for ffs() and ffz() in
lib/test_bitops.c.

Please keep me in loop in case of new versions.

Thanks,
Yury


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-08-31  8:51   ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Yury Norov
@ 2022-09-01  3:49     ` Yury Norov
  2022-09-01 10:30       ` Vincent MAILHOL
  2022-09-02  1:19     ` Vincent MAILHOL
  1 sibling, 1 reply; 68+ messages in thread
From: Yury Norov @ 2022-09-01  3:49 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Borislav Petkov, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	x86, Peter Zijlstra, Dave Hansen, H . Peter Anvin,
	Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells,
	Jan Beulich, Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > The compilers provide some builtin expression equivalent to the ffs(),
> > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > assembly which produces better code than the builtin
> > functions. However, such assembly code can not be folded when used
> > with constant expressions.
> > 
> > This series relies on __builtin_constant_p to select the optimal solution:
> > 
> >   * use kernel assembly for non constant expressions
> > 
> >   * use compiler's __builtin function for constant expressions.
> > 
> > 
> > ** Statistics **
> > 
> > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > of __ffs() and ffz() calls (details of the calculation in each patch).
> 
> Hi Vincent,
> 
> Can you please add a test for this? We've recently added a very similar
> test_bitmap_const_eval() in lib/test_bitmap.c.
> 
> dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> assertions")
> 
> Would be nice to have something like this for ffs() and ffz() in
> lib/test_bitops.c.
> 
> Please keep me in loop in case of new versions.

Also, what about fls? Is there any difference with ffs/ffz wrt compile
time optimizations? If not, would be great if the series will take
care of it too.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-01  3:49     ` Yury Norov
@ 2022-09-01 10:30       ` Vincent MAILHOL
  2022-09-01 14:19         ` Yury Norov
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent MAILHOL @ 2022-09-01 10:30 UTC (permalink / raw)
  To: Yury Norov
  Cc: Borislav Petkov, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	x86, Peter Zijlstra, Dave Hansen, H . Peter Anvin,
	Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells,
	Jan Beulich, Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Tue. 1 sept. 2022 at 12:49, Yury Norov <yury.norov@gmail.com> wrote:
> On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > The compilers provide some builtin expression equivalent to the ffs(),
> > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > assembly which produces better code than the builtin
> > > functions. However, such assembly code can not be folded when used
> > > with constant expressions.
> > >
> > > This series relies on __builtin_constant_p to select the optimal solution:
> > >
> > >   * use kernel assembly for non constant expressions
> > >
> > >   * use compiler's __builtin function for constant expressions.
> > >
> > >
> > > ** Statistics **
> > >
> > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > of __ffs() and ffz() calls (details of the calculation in each patch).
> >
> > Hi Vincent,
> >
> > Can you please add a test for this? We've recently added a very similar
> > test_bitmap_const_eval() in lib/test_bitmap.c.
> >
> > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > assertions")
> >
> > Would be nice to have something like this for ffs() and ffz() in
> > lib/test_bitops.c.
> >
> > Please keep me in loop in case of new versions.

Hi Yury,

My patch only takes care of the x86 architecture. Assuming some other
architectures are not optimized yet, adding such a test might break
some builds. I am fine with adding the test, however, I will not write
patches for the other architecture because I do not have the
environment to compile and test it.

Does it still make sense to add the test before fixing all the architectures?

> Also, what about fls? Is there any difference with ffs/ffz wrt compile
> time optimizations? If not, would be great if the series will take
> care of it too.

Agree. The fls() and fls64() can use __builtin_ctz() and
__builtin_ctzll(). However, those two functions are a bit less
trivial. I wanted to have this first series approved first before
working on *fls*().


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-01 10:30       ` Vincent MAILHOL
@ 2022-09-01 14:19         ` Yury Norov
  2022-09-01 17:06           ` Nick Desaulniers
  2022-09-02  0:41           ` Vincent MAILHOL
  0 siblings, 2 replies; 68+ messages in thread
From: Yury Norov @ 2022-09-01 14:19 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Borislav Petkov, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	x86, Peter Zijlstra, Dave Hansen, H . Peter Anvin,
	Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells,
	Jan Beulich, Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Thu, Sep 01, 2022 at 07:30:10PM +0900, Vincent MAILHOL wrote:
> On Tue. 1 sept. 2022 at 12:49, Yury Norov <yury.norov@gmail.com> wrote:
> > On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > > The compilers provide some builtin expression equivalent to the ffs(),
> > > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > > assembly which produces better code than the builtin
> > > > functions. However, such assembly code can not be folded when used
> > > > with constant expressions.
> > > >
> > > > This series relies on __builtin_constant_p to select the optimal solution:
> > > >
> > > >   * use kernel assembly for non constant expressions
> > > >
> > > >   * use compiler's __builtin function for constant expressions.
> > > >
> > > >
> > > > ** Statistics **
> > > >
> > > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > > of __ffs() and ffz() calls (details of the calculation in each patch).
> > >
> > > Hi Vincent,
> > >
> > > Can you please add a test for this? We've recently added a very similar
> > > test_bitmap_const_eval() in lib/test_bitmap.c.
> > >
> > > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > > assertions")
> > >
> > > Would be nice to have something like this for ffs() and ffz() in
> > > lib/test_bitops.c.
> > >
> > > Please keep me in loop in case of new versions.
> 
> Hi Yury,
> 
> My patch only takes care of the x86 architecture.

OK, I just realized that you started submitting this at least back in May. 

For me, v6 is good enough and well-described. So, for the series:
Reviewed-by: Yury Norov <yury.norov@gmail.com>

How are you going to merge it? If you haven't a specific tree in mind 
already, I can take it in my bitmap tree because  ffs and ffz are closely
related to find_bit() functions.

> Assuming some other
> architectures are not optimized yet, adding such a test might break
> some builds. I am fine with adding the test, however, I will not write
> patches for the other architecture because I do not have the
> environment to compile and test it.
> 
> Does it still make sense to add the test before fixing all the architectures?

All-arches fix should begin with changing the ffs design. Namely, there
should be a generic ffs() in include/linux/bitops.h, and arch-specific
arch__ffs() in arch/xxx/include/asm/bitops.h; like we do for the set_bit()
family. I have a feeling that it's far beyond the scope of your series.

The test is a different story. Good tests are always welcome, even if
they don't cover all the arches.

> > Also, what about fls? Is there any difference with ffs/ffz wrt compile
> > time optimizations? If not, would be great if the series will take
> > care of it too.
> 
> Agree. The fls() and fls64() can use __builtin_ctz() and
> __builtin_ctzll(). However, those two functions are a bit less
> trivial. I wanted to have this first series approved first before
> working on *fls*().

OK, the test and fls() can be a matter of a follow-up series, taking
into account how long are these 2 patches moving.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-01 14:19         ` Yury Norov
@ 2022-09-01 17:06           ` Nick Desaulniers
  2022-09-02  5:34             ` Borislav Petkov
  2022-09-02  0:41           ` Vincent MAILHOL
  1 sibling, 1 reply; 68+ messages in thread
From: Nick Desaulniers @ 2022-09-01 17:06 UTC (permalink / raw)
  To: Yury Norov
  Cc: Vincent MAILHOL, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	x86, Peter Zijlstra, Dave Hansen, H . Peter Anvin,
	Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells,
	Jan Beulich, Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Thu, Sep 1, 2022 at 7:19 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Thu, Sep 01, 2022 at 07:30:10PM +0900, Vincent MAILHOL wrote:
> > On Tue. 1 sept. 2022 at 12:49, Yury Norov <yury.norov@gmail.com> wrote:
> > > On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > > > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > > > The compilers provide some builtin expression equivalent to the ffs(),
> > > > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > > > assembly which produces better code than the builtin
> > > > > functions. However, such assembly code can not be folded when used
> > > > > with constant expressions.
> > > > >
> > > > > This series relies on __builtin_constant_p to select the optimal solution:
> > > > >
> > > > >   * use kernel assembly for non constant expressions
> > > > >
> > > > >   * use compiler's __builtin function for constant expressions.
> > > > >
> > > > >
> > > > > ** Statistics **
> > > > >
> > > > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > > > of __ffs() and ffz() calls (details of the calculation in each patch).
> > > >
> > > > Hi Vincent,
> > > >
> > > > Can you please add a test for this? We've recently added a very similar
> > > > test_bitmap_const_eval() in lib/test_bitmap.c.
> > > >
> > > > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > > > assertions")
> > > >
> > > > Would be nice to have something like this for ffs() and ffz() in
> > > > lib/test_bitops.c.
> > > >
> > > > Please keep me in loop in case of new versions.
> >
> > Hi Yury,
> >
> > My patch only takes care of the x86 architecture.
>
> OK, I just realized that you started submitting this at least back in May.
>
> For me, v6 is good enough and well-described. So, for the series:
> Reviewed-by: Yury Norov <yury.norov@gmail.com>
>
> How are you going to merge it? If you haven't a specific tree in mind
> already, I can take it in my bitmap tree because  ffs and ffz are closely
> related to find_bit() functions.

Unless Boris feels strong enough to nack the series, I think Yury
accepting the series is the way to go.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-01 14:19         ` Yury Norov
  2022-09-01 17:06           ` Nick Desaulniers
@ 2022-09-02  0:41           ` Vincent MAILHOL
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-09-02  0:41 UTC (permalink / raw)
  To: Yury Norov
  Cc: Borislav Petkov, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	x86, Peter Zijlstra, Dave Hansen, H . Peter Anvin,
	Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells,
	Jan Beulich, Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Thu. 1 Sep. 2022 at 23:19, Yury Norov <yury.norov@gmail.com> wrote:
> On Thu, Sep 01, 2022 at 07:30:10PM +0900, Vincent MAILHOL wrote:
> > On Tue. 1 sept. 2022 at 12:49, Yury Norov <yury.norov@gmail.com> wrote:
> > > On Wed, Aug 31, 2022 at 01:54:01AM -0700, Yury Norov wrote:
> > > > On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
> > > > > The compilers provide some builtin expression equivalent to the ffs(),
> > > > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > > > assembly which produces better code than the builtin
> > > > > functions. However, such assembly code can not be folded when used
> > > > > with constant expressions.
> > > > >
> > > > > This series relies on __builtin_constant_p to select the optimal solution:
> > > > >
> > > > >   * use kernel assembly for non constant expressions
> > > > >
> > > > >   * use compiler's __builtin function for constant expressions.
> > > > >
> > > > >
> > > > > ** Statistics **
> > > > >
> > > > > Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> > > > > of __ffs() and ffz() calls (details of the calculation in each patch).
> > > >
> > > > Hi Vincent,
> > > >
> > > > Can you please add a test for this? We've recently added a very similar
> > > > test_bitmap_const_eval() in lib/test_bitmap.c.
> > > >
> > > > dc34d5036692c ("lib: test_bitmap: add compile-time optimization/evaluations
> > > > assertions")
> > > >
> > > > Would be nice to have something like this for ffs() and ffz() in
> > > > lib/test_bitops.c.
> > > >
> > > > Please keep me in loop in case of new versions.
> >
> > Hi Yury,
> >
> > My patch only takes care of the x86 architecture.
>
> OK, I just realized that you started submitting this at least back in May.
>
> For me, v6 is good enough and well-described. So, for the series:
> Reviewed-by: Yury Norov <yury.norov@gmail.com>

Thanks for the review!

> How are you going to merge it? If you haven't a specific tree in mind
> already, I can take it in my bitmap tree because  ffs and ffz are closely
> related to find_bit() functions.

I never thought of a specific tree. I just CCed the x86 architecture
maintainers according to get_maintainer.pl and was expecting it to go
through the x86/asm branch of the tip tree. But I am perfectly fine if
it goes through your tree.

So same as Nick's comment below, unless Borislav still has concern on
the v6, please take it in your tree.

> > Assuming some other
> > architectures are not optimized yet, adding such a test might break
> > some builds. I am fine with adding the test, however, I will not write
> > patches for the other architecture because I do not have the
> > environment to compile and test it.
> >
> > Does it still make sense to add the test before fixing all the architectures?
>
> All-arches fix should begin with changing the ffs design. Namely, there
> should be a generic ffs() in include/linux/bitops.h,

Currently, the generic ffl, ffs, flz are under:
/include/asm-generic/bitops

especially, here is the generic ffs():
https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bitops/ffs.h

Isn't this sufficient?

> and arch-specific
> arch__ffs() in arch/xxx/include/asm/bitops.h; like we do for the set_bit()
> family. I have a feeling that it's far beyond the scope of your series.
>
> The test is a different story. Good tests are always welcome, even if
> they don't cover all the arches.

ACK. I will add the test in a different patch *after* this series gets
accepted. But to be clear, I will not fix other architectures.

> > > Also, what about fls? Is there any difference with ffs/ffz wrt compile
> > > time optimizations? If not, would be great if the series will take
> > > care of it too.
> >
> > Agree. The fls() and fls64() can use __builtin_ctz() and
> > __builtin_ctzll(). However, those two functions are a bit less
> > trivial. I wanted to have this first series approved first before
> > working on *fls*().
>
> OK, the test and fls() can be a matter of a follow-up series, taking
> into account how long are these 2 patches moving.

ACK.

> Thanks,
> Yury

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-08-31  8:51   ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Yury Norov
  2022-09-01  3:49     ` Yury Norov
@ 2022-09-02  1:19     ` Vincent MAILHOL
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-09-02  1:19 UTC (permalink / raw)
  To: Yury Norov
  Cc: Borislav Petkov, Nick Desaulniers, Thomas Gleixner, Ingo Molnar,
	x86, Peter Zijlstra, Dave Hansen, H . Peter Anvin,
	Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells,
	Jan Beulich, Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Tue. 31 Aug 2022 at 17:51, Yury Norov <yury.norov@gmail.com> wrote:
> On Wed, Aug 31, 2022 at 04:57:40PM +0900, Vincent Mailhol wrote:
(...)
> Please keep me in loop in case of new versions.

A side comment, but if you want not to miss such discussion again in
the future, why not do this?

diff --git a/MAINTAINERS b/MAINTAINERS
index 56af0182a93b..f6caf4252799 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3602,6 +3602,7 @@ M:        Yury Norov <yury.norov@gmail.com>
 R:     Andy Shevchenko <andriy.shevchenko@linux.intel.com>
 R:     Rasmus Villemoes <linux@rasmusvillemoes.dk>
 S:     Maintained
+F:     arch/*/include/asm/bitops.h
 F:     include/linux/bitmap.h
 F:     include/linux/cpumask.h
 F:     include/linux/find.h

N.B. this is just a suggestion, please feel free to discard it.

> Thanks,
> Yury

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-01 17:06           ` Nick Desaulniers
@ 2022-09-02  5:34             ` Borislav Petkov
  0 siblings, 0 replies; 68+ messages in thread
From: Borislav Petkov @ 2022-09-02  5:34 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Yury Norov, Vincent MAILHOL, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Thu, Sep 01, 2022 at 10:06:48AM -0700, Nick Desaulniers wrote:
> Unless Boris feels strong enough to nack the series, I think Yury
> accepting the series is the way to go.

arch/x86/ patches go through the tip tree unless there's a compelling
reason to carry them through another. I don't see one here.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (7 preceding siblings ...)
  2022-08-31  7:57 ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
@ 2022-09-05  0:37 ` Vincent Mailhol
  2022-09-05  0:37   ` [PATCH v7 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
                     ` (2 more replies)
  2022-09-07  9:09 ` [PATCH v8 " Vincent Mailhol
                   ` (2 subsequent siblings)
  11 siblings, 3 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-09-05  0:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov,
	Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v6 -> v7:

  * (no changes on code, only commit tag was modified)

  * Add Reviewed-by: Yury Norov <yury.norov@gmail.com> in both patches


v5 -> v6:
  * Rename variable___ffs() into variable__ffs() (two underscores
    instead of three)


v4 -> v5:

  * (no changes on code, only commit comment was modified)

  * Rewrite the commit log:
    - Use two spaces instead of `| ' to indent code snippets.
    - Do not use `we'.
    - Do not use `this patch' in the commit description. Instead,
      use imperative tone.
  Link: https://lore.kernel.org/all/YvUZVYxbOMcZtR5G@zn.tnic/


v3 -> v4:

  * (no changes on code, only commit comment was modified)

  * Remove note and link to Nick's message in patch 1/2, c.f.:
  Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 2/2.


v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste typo in statistics of patch 1/2. Number of
    occurences before patches are 1081 and not 3607 (percentage
    reduction of 26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 1/2.


Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v7 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-09-05  0:37 ` [PATCH v7 " Vincent Mailhol
@ 2022-09-05  0:37   ` Vincent Mailhol
  2022-09-05  0:37   ` [PATCH v7 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-09-06 18:26   ` [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers
  2 siblings, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-09-05  0:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov,
	Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() function of both GCC and clang is able
to fold the expression into a single instruction.

** Example **

Let's consider two dummy functions foo() and bar() as below:

  #include <linux/bitops.h>
  #define CONST 0x01000000

  unsigned int foo(void)
  {
  	return ffs(CONST);
  }

  unsigned int bar(void)
  {
  	return __builtin_ffs(CONST);
  }

GCC would produce below assembly code:

  0000000000000000 <foo>:
     0:	ba 00 00 00 01       	mov    $0x1000000,%edx
     5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     a:	0f bc c2             	bsf    %edx,%eax
     d:	83 c0 01             	add    $0x1,%eax
    10:	c3                   	ret
  <Instructions after ret and before next function were redacted>

  0000000000000020 <bar>:
    20:	b8 19 00 00 00       	mov    $0x19,%eax
    25:	c3                   	ret

And clang would produce:

  0000000000000000 <foo>:
     0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
     c:	83 c0 01             	add    $0x1,%eax
     f:	c3                   	ret

  0000000000000010 <bar>:
    10:	b8 19 00 00 00       	mov    $0x19,%eax
    15:	c3                   	ret

Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.

As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:

  ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
    283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  1081

...and after:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 0fe9de58af31..879238e5a6a0 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -292,18 +292,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -333,6 +322,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v7 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-09-05  0:37 ` [PATCH v7 " Vincent Mailhol
  2022-09-05  0:37   ` [PATCH v7 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-09-05  0:37   ` Vincent Mailhol
  2022-09-06 18:26   ` [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers
  2 siblings, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-09-05  0:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov,
	Vincent Mailhol

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  3607

...and after:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which explain the use of `grep tzcnt' instead of `grep
bsf' in above benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
Link: http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 879238e5a6a0..08eca2938f27 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -247,13 +247,7 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
 					  variable_test_bit(nr, addr);
 }
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable__ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -261,13 +255,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable__ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -275,6 +274,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-05  0:37 ` [PATCH v7 " Vincent Mailhol
  2022-09-05  0:37   ` [PATCH v7 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-09-05  0:37   ` [PATCH v7 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
@ 2022-09-06 18:26   ` Nick Desaulniers
  2022-09-07  7:04     ` Nick Desaulniers
  2 siblings, 1 reply; 68+ messages in thread
From: Nick Desaulniers @ 2022-09-06 18:26 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov

On Sun, Sep 4, 2022 at 5:38 PM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> The compilers provide some builtin expression equivalent to the ffs(),
> __ffs() and ffz() functions of the kernel. The kernel uses optimized
> assembly which produces better code than the builtin
> functions. However, such assembly code can not be folded when used
> with constant expressions.

Another tact which may help additional sources other than just the
Linux kernel; it seems that compilers should be able to fold this.

Vincent, if you're interested in making such an optimization in LLVM,
we'd welcome the contribution, and I'd be happy to show you where to
make such changes within LLVM; please let me know off thread.

If not, at the least, we should file feature requests in both:
* https://github.com/llvm/llvm-project/issues
* https://gcc.gnu.org/bugzilla/

>
> This series relies on __builtin_constant_p to select the optimal solution:
>
>   * use kernel assembly for non constant expressions
>
>   * use compiler's __builtin function for constant expressions.
>
>
> ** Statistics **
>
> Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
> of __ffs() and ffz() calls (details of the calculation in each patch).
>
>
> ** Changelog **
>
> v6 -> v7:
>
>   * (no changes on code, only commit tag was modified)
>
>   * Add Reviewed-by: Yury Norov <yury.norov@gmail.com> in both patches
>
>
> v5 -> v6:
>   * Rename variable___ffs() into variable__ffs() (two underscores
>     instead of three)
>
>
> v4 -> v5:
>
>   * (no changes on code, only commit comment was modified)
>
>   * Rewrite the commit log:
>     - Use two spaces instead of `| ' to indent code snippets.
>     - Do not use `we'.
>     - Do not use `this patch' in the commit description. Instead,
>       use imperative tone.
>   Link: https://lore.kernel.org/all/YvUZVYxbOMcZtR5G@zn.tnic/
>
>
> v3 -> v4:
>
>   * (no changes on code, only commit comment was modified)
>
>   * Remove note and link to Nick's message in patch 1/2, c.f.:
>   Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/
>
>   * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
>     patch 2/2.
>
>
> v2 -> v3:
>
>   * Redacted out the instructions after ret and before next function
>     in the assembly output.
>
>   * Added a note and a link to Nick's message on the constant
>     propagation missed-optimization in clang:
>     Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/
>
>   * Fix copy/paste typo in statistics of patch 1/2. Number of
>     occurences before patches are 1081 and not 3607 (percentage
>     reduction of 26.7% remains correct)
>
>   * Rename the functions as follow:
>     - __varible_ffs() -> variable___ffs()
>     - __variable_ffz() -> variable_ffz()
>
>   * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
>     patch 1/2.
>
>
> Vincent Mailhol (2):
>   x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
>     expressions
>   x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
>     expressions
>
>  arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
>  1 file changed, 38 insertions(+), 26 deletions(-)
>
> --
> 2.35.1
>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-08-26 21:32                   ` Vincent MAILHOL
@ 2022-09-07  4:06                     ` Borislav Petkov
  2022-09-07  5:35                       ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2022-09-07  4:06 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf

On Sat, Aug 27, 2022 at 06:32:05AM +0900, Vincent MAILHOL wrote:
> Agree that this is only the surface. But, my patch series is about
> constant folding, not about the text of *ffs(). Here, I just *move*
> the existing text, I did not modify anything.
> Can we agree that this is a separate topic?

Sure we can.

But then you can't start your commit message with:

"__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x)."

which will bring unenlightened readers like me into the very same mess.

So at least mention that there's a difference between the kernel
implementation using hw insns which are well defined on some machines
and what the glibc API does. So that at least people are aware that
there's something dangerous to be cautious about.

Ok?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-09-07  4:06                     ` Borislav Petkov
@ 2022-09-07  5:35                       ` Vincent MAILHOL
  2022-09-07  8:50                         ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent MAILHOL @ 2022-09-07  5:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov

On Wed. 7 Sep 2022 at 13:06, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Aug 27, 2022 at 06:32:05AM +0900, Vincent MAILHOL wrote:
> > Agree that this is only the surface. But, my patch series is about
> > constant folding, not about the text of *ffs(). Here, I just *move*
> > the existing text, I did not modify anything.
> > Can we agree that this is a separate topic?
>
> Sure we can.
>
> But then you can't start your commit message with:
>
> "__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
> is equivalent to (unsigned long)__builtin_ctzl(~x)."
>
> which will bring unenlightened readers like me into the very same mess.
>
> So at least mention that there's a difference between the kernel
> implementation using hw insns which are well defined on some machines
> and what the glibc API does. So that at least people are aware that
> there's something dangerous to be cautious about.
>
> Ok?

OK.

I rephrased the beginning of the commit message as below:


If x is not 0, __ffs(x) is equivalent to:
  (unsigned long)__builtin_ctzl(x)
And if x is not ~0UL, ffz(x) is equivalent to:
  (unsigned long)__builtin_ctzl(~x)
Because __builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

Concerning the edge cases, __builtin_ctzl(0) is always undefined,
whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
the processor. Regardless, for both functions, developers are asked to
check against 0 or ~0UL so replacing __ffs() or ffz() by
__builting_ctzl() is safe.



Does this solve the issue? If yes, I will prepare the v8 right away.

Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-06 18:26   ` [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers
@ 2022-09-07  7:04     ` Nick Desaulniers
  2022-09-07  7:49       ` Vincent MAILHOL
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Desaulniers @ 2022-09-07  7:04 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov

On Tue, Sep 6, 2022 at 11:26 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Sun, Sep 4, 2022 at 5:38 PM Vincent Mailhol
> <mailhol.vincent@wanadoo.fr> wrote:
> >
> > The compilers provide some builtin expression equivalent to the ffs(),
> > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > assembly which produces better code than the builtin
> > functions. However, such assembly code can not be folded when used
> > with constant expressions.
>
> Another tact which may help additional sources other than just the
> Linux kernel; it seems that compilers should be able to fold this.
>
> Vincent, if you're interested in making such an optimization in LLVM,
> we'd welcome the contribution, and I'd be happy to show you where to
> make such changes within LLVM; please let me know off thread.

Oh right, it already does.
https://github.com/llvm/llvm-project/blob/ea953b9d9a65c202985a79f1f95da115829baef6/llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp#L2635
I see what's happening. Constant propagation sinks constants into a
specialized version of ffs when there's only 1 callsite in a given
translation unit (or multiple call sites with the same constant).
Then dead argument elimination removes the argument, so libcall
optimization thinks this isn't the ffs(int) you're looking for, and
skips it.  Nice.
https://github.com/llvm/llvm-project/issues/57599
I guess ffs() is usually forward declared in strings.h, so we don't
have such a static inline definition available to constant
prop/specialize in normal C code.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-09-07  7:04     ` Nick Desaulniers
@ 2022-09-07  7:49       ` Vincent MAILHOL
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent MAILHOL @ 2022-09-07  7:49 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov

On Wed. 7 Sep. 2022 at 16:04, Nick Desaulniers <ndesaulniers@google.com> wrote:
> On Tue, Sep 6, 2022 at 11:26 AM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Sun, Sep 4, 2022 at 5:38 PM Vincent Mailhol
> > <mailhol.vincent@wanadoo.fr> wrote:
> > >
> > > The compilers provide some builtin expression equivalent to the ffs(),
> > > __ffs() and ffz() functions of the kernel. The kernel uses optimized
> > > assembly which produces better code than the builtin
> > > functions. However, such assembly code can not be folded when used
> > > with constant expressions.
> >
> > Another tact which may help additional sources other than just the
> > Linux kernel; it seems that compilers should be able to fold this.

Initially, I thought that you were suggesting folding the asm code
(which doesn’t seem trivial at all).

> > Vincent, if you're interested in making such an optimization in LLVM,
> > we'd welcome the contribution, and I'd be happy to show you where to
> > make such changes within LLVM; please let me know off thread.
>
> Oh right, it already does.
> https://github.com/llvm/llvm-project/blob/ea953b9d9a65c202985a79f1f95da115829baef6/llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp#L2635
> I see what's happening. Constant propagation sinks constants into a
> specialized version of ffs when there's only 1 callsite in a given
> translation unit (or multiple call sites with the same constant).
> Then dead argument elimination removes the argument, so libcall
> optimization thinks this isn't the ffs(int) you're looking for, and
> skips it.

Isn’t it a wise decision to skip it? How should the optimization be
able to decide that the redefined ffs() is equivalent to
__builtin_ffs()?

More generally, if I write my own foo() which shadows a
__builtin_foo() function, the two functions might do something totally
different and I would be pissed off if the compiler decided to
constant-fold my foo().

Dummy example:

===================
char *s;

/* ffs: fast forward string
 * @i: how many bytes to move forward
 *
 * Move forward the global s pointer by @i or strlen(s) (whoever is smaller).
 *
 * Return: how many bytes we move forward.
 */
int ffs(int i)
{
        int len = strlen(s);
        int forward = i < len ? i : len;

        s += forward;
        return forward;
}
===================

How would you instruct the compiler to constant-fold the kernel’s
ffs() but not fold above dummy ffs()?

> Nice.
> https://github.com/llvm/llvm-project/issues/57599
> I guess ffs() is usually forward declared in strings.h, so we don't
> have such a static inline definition available to constant
> prop/specialize in normal C code.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-09-07  5:35                       ` Vincent MAILHOL
@ 2022-09-07  8:50                         ` Borislav Petkov
  0 siblings, 0 replies; 68+ messages in thread
From: Borislav Petkov @ 2022-09-07  8:50 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov

On Wed, Sep 07, 2022 at 02:35:41PM +0900, Vincent MAILHOL wrote:
> I rephrased the beginning of the commit message as below:
> 
> 
> If x is not 0, __ffs(x) is equivalent to:
>   (unsigned long)__builtin_ctzl(x)
> And if x is not ~0UL, ffz(x) is equivalent to:
>   (unsigned long)__builtin_ctzl(~x)
> Because __builting_ctzl() returns an int, a cast to (unsigned long) is
> necessary to avoid potential warnings on implicit casts.
> 
> Concerning the edge cases, __builtin_ctzl(0) is always undefined,
> whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
> the processor. Regardless, for both functions, developers are asked to
> check against 0 or ~0UL so replacing __ffs() or ffz() by
> __builting_ctzl() is safe.
> 
> 
> 
> Does this solve the issue?

Yes, that sounds better.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v8 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (8 preceding siblings ...)
  2022-09-05  0:37 ` [PATCH v7 " Vincent Mailhol
@ 2022-09-07  9:09 ` Vincent Mailhol
  2022-09-07  9:09   ` [PATCH v8 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
  2022-09-07  9:09   ` [PATCH v8 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  2022-09-20 20:00 ` [tip: x86/asm] x86/asm/bitops: Use __builtin_ctzl() " tip-bot2 for Vincent Mailhol
  2022-09-20 20:00 ` [tip: x86/asm] x86/asm/bitops: Use __builtin_ffs() " tip-bot2 for Vincent Mailhol
  11 siblings, 2 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-09-07  9:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov,
	Vincent Mailhol

The compilers provide some builtin expression equivalent to the ffs(),
__ffs() and ffz() functions of the kernel. The kernel uses optimized
assembly which produces better code than the builtin
functions. However, such assembly code can not be folded when used
with constant expressions.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.


** Statistics **

Patch 1/2 optimizes 26.7% of ffs() calls and patch 2/2 optimizes 27.9%
of __ffs() and ffz() calls (details of the calculation in each patch).


** Changelog **

v7 -> v8:

  * (no changes on code, only commit comment was modified)

  * Rewrite introduction of patch 2/2 to add nuances on the
    define/undefined behaviors of __builting_clzl(0), __ffs(0) and
    ffz(~0UL).


v6 -> v7:

  * (no changes on code, only commit tag was modified)

  * Add Reviewed-by: Yury Norov <yury.norov@gmail.com> in both patches


v5 -> v6:
  * Rename variable___ffs() into variable__ffs() (two underscores
    instead of three)


v4 -> v5:

  * (no changes on code, only commit comment was modified)

  * Rewrite the commit log:
    - Use two spaces instead of `| ' to indent code snippets.
    - Do not use `we'.
    - Do not use `this patch' in the commit description. Instead,
      use imperative tone.
  Link: https://lore.kernel.org/all/YvUZVYxbOMcZtR5G@zn.tnic/


v3 -> v4:

  * (no changes on code, only commit comment was modified)

  * Remove note and link to Nick's message in patch 1/2, c.f.:
  Link: https://lore.kernel.org/all/CAKwvOdnnDaiJcV1gr9vV+ya-jWxx7+2KJNTDThyFctVDOgt9zQ@mail.gmail.com/

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 2/2.


v2 -> v3:

  * Redacted out the instructions after ret and before next function
    in the assembly output.

  * Added a note and a link to Nick's message on the constant
    propagation missed-optimization in clang:
    Link: https://lore.kernel.org/all/CAKwvOdnH_gYv4qRN9pKY7jNTQK95xNeH1w1KZJJmvCkh8xJLBg@mail.gmail.com/

  * Fix copy/paste typo in statistics of patch 1/2. Number of
    occurences before patches are 1081 and not 3607 (percentage
    reduction of 26.7% remains correct)

  * Rename the functions as follow:
    - __varible_ffs() -> variable___ffs()
    - __variable_ffz() -> variable_ffz()

  * Add Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> tag in
    patch 1/2.


Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 64 +++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 26 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v8 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions
  2022-09-07  9:09 ` [PATCH v8 " Vincent Mailhol
@ 2022-09-07  9:09   ` Vincent Mailhol
  2022-09-07  9:09   ` [PATCH v8 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-09-07  9:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov,
	Vincent Mailhol

For x86_64, the current ffs() implementation does not produce
optimized code when called with a constant expression. On the
contrary, the __builtin_ffs() functions of both GCC and clang are able
to fold the expression into a single instruction.

** Example **

Let's consider two dummy functions foo() and bar() as below:

  #include <linux/bitops.h>
  #define CONST 0x01000000

  unsigned int foo(void)
  {
  	return ffs(CONST);
  }

  unsigned int bar(void)
  {
  	return __builtin_ffs(CONST);
  }

GCC would produce below assembly code:

  0000000000000000 <foo>:
     0:	ba 00 00 00 01       	mov    $0x1000000,%edx
     5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     a:	0f bc c2             	bsf    %edx,%eax
     d:	83 c0 01             	add    $0x1,%eax
    10:	c3                   	ret
  <Instructions after ret and before next function were redacted>

  0000000000000020 <bar>:
    20:	b8 19 00 00 00       	mov    $0x19,%eax
    25:	c3                   	ret

And clang would produce:

  0000000000000000 <foo>:
     0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
     c:	83 c0 01             	add    $0x1,%eax
     f:	c3                   	ret

  0000000000000010 <bar>:
    10:	b8 19 00 00 00       	mov    $0x19,%eax
    15:	c3                   	ret

Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the ffs() asm version of the
kernel remains better for x86_64 because, contrary to GCC, it doesn't
emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
able optimize out the CMOV call).

Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.

As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:

  ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
    283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  1081

...and after:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 0fe9de58af31..879238e5a6a0 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -292,18 +292,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -333,6 +322,19 @@ static __always_inline int ffs(int x)
 	return r + 1;
 }
 
+/**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
 /**
  * fls - find last set bit in word
  * @x: the word to search
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v8 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions
  2022-09-07  9:09 ` [PATCH v8 " Vincent Mailhol
  2022-09-07  9:09   ` [PATCH v8 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
@ 2022-09-07  9:09   ` Vincent Mailhol
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Mailhol @ 2022-09-07  9:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, Ingo Molnar, x86,
	Peter Zijlstra, Dave Hansen, H . Peter Anvin, Nathan Chancellor,
	Tom Rix, linux-kernel, llvm, David Howells, Jan Beulich,
	Christophe Jaillet, Joe Perches, Josh Poimboeuf, Yury Norov,
	Vincent Mailhol

If x is not 0, __ffs(x) is equivalent to:
  (unsigned long)__builtin_ctzl(x)
And if x is not ~0UL, ffz(x) is equivalent to:
  (unsigned long)__builtin_ctzl(~x)
Because __builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

Concerning the edge cases, __builtin_ctzl(0) is always undefined,
whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
the processor. Regardless, for both functions, developers are asked to
check against 0 or ~0UL so replacing __ffs() or ffz() by
__builting_ctzl() is safe.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  3607

...and after:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which explain the use of `grep tzcnt' instead of `grep
bsf' in above benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
Link: http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 879238e5a6a0..95591310c080 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -247,13 +247,7 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
 					  variable_test_bit(nr, addr);
 }

-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable__ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -261,13 +255,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }

-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable__ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -275,6 +274,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }

+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: x86/asm] x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (9 preceding siblings ...)
  2022-09-07  9:09 ` [PATCH v8 " Vincent Mailhol
@ 2022-09-20 20:00 ` tip-bot2 for Vincent Mailhol
  2022-09-20 20:00 ` [tip: x86/asm] x86/asm/bitops: Use __builtin_ffs() " tip-bot2 for Vincent Mailhol
  11 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Vincent Mailhol @ 2022-09-20 20:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Mailhol, Borislav Petkov, Nick Desaulniers, Yury Norov,
	x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     fdb6649ab7c142e497539a471e573c2593b9c923
Gitweb:        https://git.kernel.org/tip/fdb6649ab7c142e497539a471e573c2593b9c923
Author:        Vincent Mailhol <mailhol.vincent@wanadoo.fr>
AuthorDate:    Wed, 07 Sep 2022 18:09:35 +09:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Tue, 20 Sep 2022 15:35:37 +02:00

x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions

If x is not 0, __ffs(x) is equivalent to:
  (unsigned long)__builtin_ctzl(x)
And if x is not ~0UL, ffz(x) is equivalent to:
  (unsigned long)__builtin_ctzl(~x)
Because __builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

Concerning the edge cases, __builtin_ctzl(0) is always undefined,
whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
the processor. Regardless, for both functions, developers are asked to
check against 0 or ~0UL so replacing __ffs() or ffz() by
__builting_ctzl() is safe.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  3607

...and after:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the BSF instruction produces TZCNT when used with the
REP prefix (which explain the use of `grep tzcnt' instead of `grep bsf'
in above benchmark). c.f. [1]

[1] e26a44a2d618 ("x86: Use REP BSF unconditionally")

  [ bp: Massage commit message. ]

Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/20220511160319.1045812-1-mailhol.vincent@wanadoo.fr
---
 arch/x86/include/asm/bitops.h | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 879238e..2edf684 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -247,17 +247,30 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
 					  variable_test_bit(nr, addr);
 }

+static __always_inline unsigned long variable__ffs(unsigned long word)
+{
+	asm("rep; bsf %1,%0"
+		: "=r" (word)
+		: "rm" (word));
+	return word;
+}
+
 /**
  * __ffs - find first set bit in word
  * @word: The word to search
  *
  * Undefined if no bit exists, so code should check against 0 first.
  */
-static __always_inline unsigned long __ffs(unsigned long word)
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable__ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
-		: "rm" (word));
+		: "r" (~word));
 	return word;
 }

@@ -267,13 +280,10 @@ static __always_inline unsigned long __ffs(unsigned long word)
  *
  * Undefined if no zero exists, so code should check against ~0UL first.
  */
-static __always_inline unsigned long ffz(unsigned long word)
-{
-	asm("rep; bsf %1,%0"
-		: "=r" (word)
-		: "r" (~word));
-	return word;
-}
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))

 /*
  * __fls: find last set bit in word

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: x86/asm] x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions
  2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
                   ` (10 preceding siblings ...)
  2022-09-20 20:00 ` [tip: x86/asm] x86/asm/bitops: Use __builtin_ctzl() " tip-bot2 for Vincent Mailhol
@ 2022-09-20 20:00 ` tip-bot2 for Vincent Mailhol
  11 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Vincent Mailhol @ 2022-09-20 20:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Mailhol, Borislav Petkov, Nick Desaulniers, Yury Norov,
	x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     146034fed6ee75ec09cf8f996165e2296ceae0bb
Gitweb:        https://git.kernel.org/tip/146034fed6ee75ec09cf8f996165e2296ceae0bb
Author:        Vincent Mailhol <mailhol.vincent@wanadoo.fr>
AuthorDate:    Wed, 07 Sep 2022 18:09:34 +09:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Tue, 20 Sep 2022 15:31:17 +02:00

x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions

For x86_64, the current ffs() implementation does not produce optimized
code when called with a constant expression. On the contrary, the
__builtin_ffs() functions of both GCC and clang are able to fold the
expression into a single instruction.

** Example **

Consider two dummy functions foo() and bar() as below:

  #include <linux/bitops.h>
  #define CONST 0x01000000

  unsigned int foo(void)
  {
  	return ffs(CONST);
  }

  unsigned int bar(void)
  {
  	return __builtin_ffs(CONST);
  }

GCC would produce below assembly code:

  0000000000000000 <foo>:
     0:	ba 00 00 00 01       	mov    $0x1000000,%edx
     5:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     a:	0f bc c2             	bsf    %edx,%eax
     d:	83 c0 01             	add    $0x1,%eax
    10:	c3                   	ret
  <Instructions after ret and before next function were redacted>

  0000000000000020 <bar>:
    20:	b8 19 00 00 00       	mov    $0x19,%eax
    25:	c3                   	ret

And clang would produce:

  0000000000000000 <foo>:
     0:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
     5:	0f bc 05 00 00 00 00 	bsf    0x0(%rip),%eax        # c <foo+0xc>
     c:	83 c0 01             	add    $0x1,%eax
     f:	c3                   	ret

  0000000000000010 <bar>:
    10:	b8 19 00 00 00       	mov    $0x19,%eax
    15:	c3                   	ret

Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.

However, for non constant expressions, the kernel's ffs() asm version
remains better for x86_64 because, contrary to GCC, it doesn't emit the
CMOV assembly instruction, c.f. [1] (noticeably, clang is able optimize
out the CMOV call).

Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.

As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:

  ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
    283 | static __always_inline int ffs(int x)

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  1081

...and after:

  $ objdump -d vmlinux.o | grep bsf | wc -l
  792

So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")

  [ bp: Massage commit message. ]

Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/20220511160319.1045812-1-mailhol.vincent@wanadoo.fr
---
 arch/x86/include/asm/bitops.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 0fe9de5..879238e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -292,18 +292,7 @@ static __always_inline unsigned long __fls(unsigned long word)
 #undef ADDR
 
 #ifdef __KERNEL__
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-static __always_inline int ffs(int x)
+static __always_inline int variable_ffs(int x)
 {
 	int r;
 
@@ -334,6 +323,19 @@ static __always_inline int ffs(int x)
 }
 
 /**
+ * ffs - find first set bit in word
+ * @x: the word to search
+ *
+ * This is defined the same way as the libc and compiler builtin ffs
+ * routines, therefore differs in spirit from the other bitops.
+ *
+ * ffs(value) returns 0 if value is 0 or the position of the first
+ * set bit if value is nonzero. The first (least significant) bit
+ * is at position 1.
+ */
+#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
+
+/**
  * fls - find last set bit in word
  * @x: the word to search
  *

^ permalink raw reply related	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2022-09-20 20:00 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-11 16:03 [PATCH v2 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol
2022-05-11 16:03 ` [PATCH v2 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-05-11 20:56   ` Christophe JAILLET
2022-05-11 23:30     ` Vincent MAILHOL
2022-05-11 21:35   ` Nick Desaulniers
2022-05-11 23:48     ` Vincent MAILHOL
2022-05-11 16:03 ` [PATCH v2 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-05-11 22:20   ` Nick Desaulniers
2022-05-11 23:23     ` Vincent MAILHOL
2022-05-12  0:03 ` [PATCH v3 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
2022-05-12  0:03   ` [PATCH v3 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-05-12  0:28     ` Nick Desaulniers
2022-05-12  1:18       ` Vincent MAILHOL
2022-05-12  0:03   ` [PATCH v3 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-05-12  0:19     ` Nick Desaulniers
2022-05-12  1:18 ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
2022-05-12  1:18   ` [PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-05-12  3:02     ` Joe Perches
2022-05-12  4:29       ` Vincent MAILHOL
2022-05-12  1:18   ` [PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-05-23  9:22   ` [PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent MAILHOL
2022-06-25  7:26 ` [RESEND PATCH " Vincent Mailhol
2022-06-25  7:26   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-06-25  7:26   ` [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-07-23 15:15 ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
2022-07-23 15:15   ` [RESEND PATCH v4 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-08-11 14:59     ` Borislav Petkov
2022-08-12 11:55       ` Vincent MAILHOL
2022-07-23 15:15   ` [RESEND PATCH v4 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-07-29 11:24   ` [RESEND PATCH v4 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent MAILHOL
2022-07-29 12:22     ` Borislav Petkov
2022-07-29 13:50       ` Vincent MAILHOL
2022-08-12 11:44 ` [PATCH v5 " Vincent Mailhol
2022-08-12 11:44   ` [PATCH v5 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-08-12 11:44   ` [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-08-23 16:23     ` Borislav Petkov
2022-08-23 17:12       ` Nick Desaulniers
2022-08-23 17:43         ` Borislav Petkov
2022-08-23 20:31           ` Vincent MAILHOL
2022-08-24  8:43             ` Borislav Petkov
2022-08-24 12:10               ` Vincent MAILHOL
2022-08-24 13:24                 ` Borislav Petkov
2022-08-26 21:32                   ` Vincent MAILHOL
2022-09-07  4:06                     ` Borislav Petkov
2022-09-07  5:35                       ` Vincent MAILHOL
2022-09-07  8:50                         ` Borislav Petkov
2022-08-31  7:57 ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Vincent Mailhol
2022-08-31  7:57   ` [PATCH v6 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-08-31  7:57   ` [PATCH v6 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-08-31  8:51   ` [PATCH v6 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Yury Norov
2022-09-01  3:49     ` Yury Norov
2022-09-01 10:30       ` Vincent MAILHOL
2022-09-01 14:19         ` Yury Norov
2022-09-01 17:06           ` Nick Desaulniers
2022-09-02  5:34             ` Borislav Petkov
2022-09-02  0:41           ` Vincent MAILHOL
2022-09-02  1:19     ` Vincent MAILHOL
2022-09-05  0:37 ` [PATCH v7 " Vincent Mailhol
2022-09-05  0:37   ` [PATCH v7 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-09-05  0:37   ` [PATCH v7 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-09-06 18:26   ` [PATCH v7 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers
2022-09-07  7:04     ` Nick Desaulniers
2022-09-07  7:49       ` Vincent MAILHOL
2022-09-07  9:09 ` [PATCH v8 " Vincent Mailhol
2022-09-07  9:09   ` [PATCH v8 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol
2022-09-07  9:09   ` [PATCH v8 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol
2022-09-20 20:00 ` [tip: x86/asm] x86/asm/bitops: Use __builtin_ctzl() " tip-bot2 for Vincent Mailhol
2022-09-20 20:00 ` [tip: x86/asm] x86/asm/bitops: Use __builtin_ffs() " tip-bot2 for Vincent Mailhol

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.