[PATCH v1 0/2] x86/asm/bitops: optimize fls functions for constant expressions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/2] x86/asm/bitops: optimize fls functions for constant expressions
@ 2022-11-06  9:51 Vincent Mailhol
  2022-11-06  9:51 ` [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Vincent Mailhol @ 2022-11-06  9:51 UTC (permalink / raw)
  To: x86, Ingo Molnar, Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, linux-kernel, Yury Norov,
	llvm, Vincent Mailhol

The compilers provide some builtin expression equivalent to the fls(),
__fls() and fls64() functions of the kernel.

The kernel's x86 implementation relies on assembly code. This assembly
code can not be folded when used with constant expressions.

This series replaces the kernel assembly by a builtin equivalent when
appropriate. It is a follow-up on this previous series:

  https://lore.kernel.org/all/20220907090935.919-1-mailhol.vincent@wanadoo.fr/

in which I promised to also modify the fls() functions.

Vincent Mailhol (2):
  x86/asm/bitops: Replace __fls() by its generic builtin implementation
  x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions

 arch/x86/include/asm/bitops.h | 71 ++++++++++++++++++++---------------
 1 file changed, 41 insertions(+), 30 deletions(-)

-- 
2.37.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation
  2022-11-06  9:51 [PATCH v1 0/2] x86/asm/bitops: optimize fls functions for constant expressions Vincent Mailhol
@ 2022-11-06  9:51 ` Vincent Mailhol
  2022-11-07  9:38   ` Peter Zijlstra
  2022-11-10 19:06   ` Nick Desaulniers
  2022-11-06  9:51 ` [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions Vincent Mailhol
  2022-11-25  2:33 ` [PATCH v2 0/2] x86/asm/bitops: optimize fls functions for " Vincent Mailhol
  2 siblings, 2 replies; 14+ messages in thread
From: Vincent Mailhol @ 2022-11-06  9:51 UTC (permalink / raw)
  To: x86, Ingo Molnar, Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, linux-kernel, Yury Norov,
	llvm, Vincent Mailhol, Borislav Petkov

Below snippet:

  #include <linux/bitops.h>

  unsigned int foo(unsigned long word)
  {
  	return __fls(word);
  }

produces this on GCC 12.1.0:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	48 89 fb             	mov    %rdi,%rbx
     d:	e8 00 00 00 00       	call   12 <foo+0x12>
    12:	48 0f bd c3          	bsr    %rbx,%rax
    16:	5b                   	pop    %rbx
    17:	31 ff                	xor    %edi,%edi
    19:	e9 00 00 00 00       	jmp    1e <foo+0x1e>

and that on clang 14.0.6:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	50                   	push   %rax
     b:	48 89 fb             	mov    %rdi,%rbx
     e:	e8 00 00 00 00       	call   13 <foo+0x13>
    13:	48 89 1c 24          	mov    %rbx,(%rsp)
    17:	48 0f bd 04 24       	bsr    (%rsp),%rax
    1c:	48 83 c4 08          	add    $0x8,%rsp
    20:	5b                   	pop    %rbx
    21:	c3                   	ret

The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
produces the exact same code on GCC and below code on clang:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	48 89 fb             	mov    %rdi,%rbx
     d:	e8 00 00 00 00       	call   12 <foo+0x12>
    12:	48 0f bd c3          	bsr    %rbx,%rax
    16:	5b                   	pop    %rbx
    17:	c3                   	ret

The builtin implementation is better for two reasons:

  1/ it saves two instructions on clang (a push and a stack pointer
     decrement) because of a useless tentative to save rax.

  2/ when used on constant expressions, the compiler is only able to
     fold the builtin version (c.f. [2]).

For those two reasons, replace the assembly implementation by its
builtin counterpart.

[1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h

[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <bp@suse.de>
CC: Nick Desaulniers <ndesaulniers@google.com>
CC: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 2edf68475fec..a31453d7686d 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -285,19 +285,7 @@ static __always_inline unsigned long variable_ffz(unsigned long word)
 	 (unsigned long)__builtin_ctzl(~word) :	\
 	 variable_ffz(word))
 
-/*
- * __fls: find last set bit in word
- * @word: The word to search
- *
- * Undefined if no set bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __fls(unsigned long word)
-{
-	asm("bsr %1,%0"
-	    : "=r" (word)
-	    : "rm" (word));
-	return word;
-}
+#include <asm-generic/bitops/builtin-__fls.h>
 
 #undef ADDR
 
-- 
2.37.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions
  2022-11-06  9:51 [PATCH v1 0/2] x86/asm/bitops: optimize fls functions for constant expressions Vincent Mailhol
  2022-11-06  9:51 ` [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
@ 2022-11-06  9:51 ` Vincent Mailhol
  2022-11-10 19:01   ` Nick Desaulniers
  2022-11-25  2:33 ` [PATCH v2 0/2] x86/asm/bitops: optimize fls functions for " Vincent Mailhol
  2 siblings, 1 reply; 14+ messages in thread
From: Vincent Mailhol @ 2022-11-06  9:51 UTC (permalink / raw)
  To: x86, Ingo Molnar, Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, linux-kernel, Yury Norov,
	llvm, Vincent Mailhol, Borislav Petkov

GCC and clang offers the __builtin_clz(x) and __builtin_clzll(x)
functions which return the number of leading zero bits in
x. c.f. [1]. By a simple subtraction, we can derive below equivalences:

* For fls:
  Aside of the x = 0 special case, fls(x) is equivalent to
  BITS_PER_TYPE(x) - __builtin_clz(x).

* For fls64:
  Aside of the x = 0 special case, fls64(x) is equivalent to
  BITS_PER_TYPE(x) - __builtin_clzll(x). __builtin_clzll() takes an
  unsigned long long as argument. We choose this version because
  BITS_PER_LONG_LONG is defined as 64 bits for all architecture making
  this flavor the most portable one. A BUILD_BUG_ON() safety net is
  added.

When used on constant expressions, the compiler is only able to fold
the builtin version (c.f. [2]). However, for non constant expressions,
the kernel inline assembly results in better code for both GCC and
clang.

Use __builtin_constant_p() to select between the kernel's
fls()/fls64() __builtin_clz()/__builtin_clzll() depending on whether
the argument is constant or not.

While renaming fls64() to variable_fls64(), change the argument type
from __64 to u64 because we are not in an uapi header.

[1] Built-in Functions Provided by GCC:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <bp@suse.de>
CC: Nick Desaulniers <ndesaulniers@google.com>
CC: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 57 ++++++++++++++++++++++++-----------
 1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a31453d7686d..58fb2fc49760 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -333,18 +333,15 @@ static __always_inline int variable_ffs(int x)
  */
 #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
 
-/**
- * fls - find last set bit in word
- * @x: the word to search
- *
- * This is defined in a similar way as the libc and compiler builtin
- * ffs, but returns the position of the most significant set bit.
- *
- * fls(value) returns 0 if value is 0 or the position of the last
- * set bit if value is nonzero. The last (most significant) bit is
- * at position 32.
- */
-static __always_inline int fls(unsigned int x)
+static __always_inline int constant_fls(unsigned int x)
+{
+	if (!x)
+		return 0;
+
+	return BITS_PER_TYPE(x) - __builtin_clz(x);
+}
+
+static __always_inline int variable_fls(unsigned int x)
 {
 	int r;
 
@@ -375,18 +372,30 @@ static __always_inline int fls(unsigned int x)
 }
 
 /**
- * fls64 - find last set bit in a 64-bit word
+ * fls - find last set bit in word
  * @x: the word to search
  *
  * This is defined in a similar way as the libc and compiler builtin
- * ffsll, but returns the position of the most significant set bit.
+ * ffs, but returns the position of the most significant set bit.
  *
- * fls64(value) returns 0 if value is 0 or the position of the last
+ * fls(value) returns 0 if value is 0 or the position of the last
  * set bit if value is nonzero. The last (most significant) bit is
- * at position 64.
+ * at position 32.
  */
+#define fls(x) (__builtin_constant_p(x) ? constant_fls(x) : variable_fls(x))
+
 #ifdef CONFIG_X86_64
-static __always_inline int fls64(__u64 x)
+static __always_inline int constant_fls64(u64 x)
+{
+	BUILD_BUG_ON(sizeof(unsigned long long) != sizeof(x));
+
+	if (!x)
+		return 0;
+
+	return BITS_PER_TYPE(x) - __builtin_clzll(x);
+}
+
+static __always_inline int variable_fls64(u64 x)
 {
 	int bitpos = -1;
 	/*
@@ -399,6 +408,20 @@ static __always_inline int fls64(__u64 x)
 	    : "rm" (x));
 	return bitpos + 1;
 }
+
+/**
+ * fls64 - find last set bit in a 64-bit word
+ * @x: the word to search
+ *
+ * This is defined in a similar way as the libc and compiler builtin
+ * ffsll, but returns the position of the most significant set bit.
+ *
+ * fls64(value) returns 0 if value is 0 or the position of the last
+ * set bit if value is nonzero. The last (most significant) bit is
+ * at position 64.
+ */
+#define fls64(x) \
+	(__builtin_constant_p(x) ? constant_fls64(x) : variable_fls64(x))
 #else
 #include <asm-generic/bitops/fls64.h>
 #endif
-- 
2.37.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation
  2022-11-06  9:51 ` [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
@ 2022-11-07  9:38   ` Peter Zijlstra
  2022-11-07 12:19     ` Vincent MAILHOL
  2022-11-10 19:20     ` Nick Desaulniers
  2022-11-10 19:06   ` Nick Desaulniers
  1 sibling, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2022-11-07  9:38 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: x86, Ingo Molnar, Borislav Petkov, Nick Desaulniers,
	Thomas Gleixner, linux-kernel, Yury Norov, llvm, Borislav Petkov

On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> The builtin implementation is better for two reasons:
> 
>   1/ it saves two instructions on clang (a push and a stack pointer
>      decrement) because of a useless tentative to save rax.

I'm thinking this is the same old clang-sucks-at-"rm" constraints and
*really* should not be a reason to change things. Clang should get fixed
already.

>   2/ when used on constant expressions, the compiler is only able to
>      fold the builtin version (c.f. [2]).
> 
> For those two reasons, replace the assembly implementation by its
> builtin counterpart.
> 
> [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> 
> [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

I would much prefer consistently with 146034fed6ee.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation
  2022-11-07  9:38   ` Peter Zijlstra
@ 2022-11-07 12:19     ` Vincent MAILHOL
  2022-11-10 19:20     ` Nick Desaulniers
  1 sibling, 0 replies; 14+ messages in thread
From: Vincent MAILHOL @ 2022-11-07 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Ingo Molnar, Borislav Petkov, Nick Desaulniers,
	Thomas Gleixner, linux-kernel, Yury Norov, llvm, Borislav Petkov

On Mon. 7 Nov. 2022 at 18:38, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> > The builtin implementation is better for two reasons:
> >
> >   1/ it saves two instructions on clang (a push and a stack pointer
> >      decrement) because of a useless tentative to save rax.
>
> I'm thinking this is the same old clang-sucks-at-"rm" constraints and
> *really* should not be a reason to change things. Clang should get fixed
> already.
>
> >   2/ when used on constant expressions, the compiler is only able to
> >      fold the builtin version (c.f. [2]).
> >
> > For those two reasons, replace the assembly implementation by its
> > builtin counterpart.
> >
> > [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> >
> > [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> I would much prefer consistently with 146034fed6ee.

There is one big difference between 146034fed6ee and this patch. For
the ffs() functions, the x86 asm produces *better* code so there is a
reason to keep the x86 asm.
The clang missed optimization is not the core reason for me to send
this patch. The purpose of the x86 asm code is to be more performant
than the generic implementation, isn't it? Let's imagine for a moment
that the x86 asm and the builtin produced exactly the same output,
then what would be the reason for keeping the x86 asm version?

My point is not that x86 asm is worse, but that x86 asm isn't better.
The clang missed optimization is one additional reason for this patch,
not the main one.

Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions
  2022-11-06  9:51 ` [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions Vincent Mailhol
@ 2022-11-10 19:01   ` Nick Desaulniers
  2022-11-11  1:57     ` Vincent MAILHOL
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Desaulniers @ 2022-11-10 19:01 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: x86, Ingo Molnar, Borislav Petkov, Thomas Gleixner, linux-kernel,
	Yury Norov, llvm, Borislav Petkov

On Sun, Nov 6, 2022 at 1:51 AM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
>  #ifdef CONFIG_X86_64
> -static __always_inline int fls64(__u64 x)
> +static __always_inline int constant_fls64(u64 x)
> +{
> +       BUILD_BUG_ON(sizeof(unsigned long long) != sizeof(x));

Thanks for the patches! They LGTM; but why do we need this BUILD_BUG_ON here?

> +
> +       if (!x)
> +               return 0;
> +
> +       return BITS_PER_TYPE(x) - __builtin_clzll(x);
> +}

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation
  2022-11-06  9:51 ` [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
  2022-11-07  9:38   ` Peter Zijlstra
@ 2022-11-10 19:06   ` Nick Desaulniers
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Desaulniers @ 2022-11-10 19:06 UTC (permalink / raw)
  To: Vincent Mailhol
  Cc: x86, Ingo Molnar, Borislav Petkov, Thomas Gleixner, linux-kernel,
	Yury Norov, llvm, Borislav Petkov

On Sun, Nov 6, 2022 at 1:51 AM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> Below snippet:
>
>   #include <linux/bitops.h>
>
>   unsigned int foo(unsigned long word)
>   {
>         return __fls(word);
>   }
>
> produces this on GCC 12.1.0:
>
>   0000000000000000 <foo>:
>      0: f3 0f 1e fa             endbr64
>      4: e8 00 00 00 00          call   9 <foo+0x9>
>      9: 53                      push   %rbx
>      a: 48 89 fb                mov    %rdi,%rbx
>      d: e8 00 00 00 00          call   12 <foo+0x12>
>     12: 48 0f bd c3             bsr    %rbx,%rax
>     16: 5b                      pop    %rbx
>     17: 31 ff                   xor    %edi,%edi
>     19: e9 00 00 00 00          jmp    1e <foo+0x1e>
>
> and that on clang 14.0.6:
>
>   0000000000000000 <foo>:
>      0: f3 0f 1e fa             endbr64
>      4: e8 00 00 00 00          call   9 <foo+0x9>
>      9: 53                      push   %rbx
>      a: 50                      push   %rax
>      b: 48 89 fb                mov    %rdi,%rbx
>      e: e8 00 00 00 00          call   13 <foo+0x13>
>     13: 48 89 1c 24             mov    %rbx,(%rsp)
>     17: 48 0f bd 04 24          bsr    (%rsp),%rax
>     1c: 48 83 c4 08             add    $0x8,%rsp
>     20: 5b                      pop    %rbx
>     21: c3                      ret
>
> The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
> produces the exact same code on GCC and below code on clang:
>
>   0000000000000000 <foo>:
>      0: f3 0f 1e fa             endbr64
>      4: e8 00 00 00 00          call   9 <foo+0x9>
>      9: 53                      push   %rbx
>      a: 48 89 fb                mov    %rdi,%rbx
>      d: e8 00 00 00 00          call   12 <foo+0x12>
>     12: 48 0f bd c3             bsr    %rbx,%rax
>     16: 5b                      pop    %rbx
>     17: c3                      ret
>
> The builtin implementation is better for two reasons:
>
>   1/ it saves two instructions on clang (a push and a stack pointer
>      decrement) because of a useless tentative to save rax.
>
>   2/ when used on constant expressions, the compiler is only able to
>      fold the builtin version (c.f. [2]).
>
> For those two reasons, replace the assembly implementation by its
> builtin counterpart.
>
> [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
>
> [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> CC: Borislav Petkov <bp@suse.de>
> CC: Nick Desaulniers <ndesaulniers@google.com>
> CC: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>

LGTM; thanks for the patch!
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>

> ---
>  arch/x86/include/asm/bitops.h | 14 +-------------
>  1 file changed, 1 insertion(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 2edf68475fec..a31453d7686d 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -285,19 +285,7 @@ static __always_inline unsigned long variable_ffz(unsigned long word)
>          (unsigned long)__builtin_ctzl(~word) : \
>          variable_ffz(word))
>
> -/*
> - * __fls: find last set bit in word
> - * @word: The word to search
> - *
> - * Undefined if no set bit exists, so code should check against 0 first.
> - */
> -static __always_inline unsigned long __fls(unsigned long word)
> -{
> -       asm("bsr %1,%0"
> -           : "=r" (word)
> -           : "rm" (word));
> -       return word;
> -}
> +#include <asm-generic/bitops/builtin-__fls.h>
>
>  #undef ADDR
>
> --
> 2.37.4
>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation
  2022-11-07  9:38   ` Peter Zijlstra
  2022-11-07 12:19     ` Vincent MAILHOL
@ 2022-11-10 19:20     ` Nick Desaulniers
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Desaulniers @ 2022-11-10 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Mailhol, x86, Ingo Molnar, Borislav Petkov,
	Thomas Gleixner, linux-kernel, Yury Norov, llvm, Borislav Petkov

On Mon, Nov 7, 2022 at 1:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> > The builtin implementation is better for two reasons:
> >
> >   1/ it saves two instructions on clang (a push and a stack pointer
> >      decrement) because of a useless tentative to save rax.
>
> I'm thinking this is the same old clang-sucks-at-"rm" constraints and
> *really* should not be a reason to change things. Clang should get fixed
> already.

Well messing up constant folding for all compilers absolutely should
be a reason!

I did get a chance to speak with some colleagues more about this at
the LLVM developer meeting during the past 2 days.  We have some ideas
on approaches that might work.  There's some higher priority features
we're working on first, but I suspect we'll be able to visit that
issue soon.  It's a pretty tricky dance between instruction selection
and register allocation.

>
> >   2/ when used on constant expressions, the compiler is only able to
> >      fold the builtin version (c.f. [2]).
> >
> > For those two reasons, replace the assembly implementation by its
> > builtin counterpart.
> >
> > [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> >
> > [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> I would much prefer consistently with 146034fed6ee.

The bottom of this file arch/x86/include/asm/bitops.h is full of
#include <asm-generic/bitops/*.h>

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions
  2022-11-10 19:01   ` Nick Desaulniers
@ 2022-11-11  1:57     ` Vincent MAILHOL
  2022-11-11  3:36       ` Yury Norov
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent MAILHOL @ 2022-11-11  1:57 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: x86, Ingo Molnar, Borislav Petkov, Thomas Gleixner, linux-kernel,
	Yury Norov, llvm, Borislav Petkov

On Fri. 11 Nov. 2022 at 04:01, Nick Desaulniers <ndesaulniers@google.com> wrote:
> On Sun, Nov 6, 2022 at 1:51 AM Vincent Mailhol
> <mailhol.vincent@wanadoo.fr> wrote:
> >
> >  #ifdef CONFIG_X86_64
> > -static __always_inline int fls64(__u64 x)
> > +static __always_inline int constant_fls64(u64 x)
> > +{
> > +       BUILD_BUG_ON(sizeof(unsigned long long) != sizeof(x));
>
> Thanks for the patches! They LGTM; but why do we need this BUILD_BUG_ON here?

There is no absolute need for sure.

Call this a paranoiac check and you will be correct. My reasoning for still
using it is that:

  1/ It is a compile time check, so no runtime penalty.
  2/ Strictly speaking, the C standard doesn't guarantee 'u64' and
     'unsigned long long int' to be the same (and you can argue that in clang
     and gcc long long is always 64 bits on all platforms and one more time
     you will be correct).
  3/ It serves as a documentation to say: "hey I am using the clz long long
     version on a u64 and I know what I am doing."

If you want me to remove it, OK for me. Let me know.

> > +
> > +       if (!x)
> > +               return 0;
> > +
> > +       return BITS_PER_TYPE(x) - __builtin_clzll(x);
> > +}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions
  2022-11-11  1:57     ` Vincent MAILHOL
@ 2022-11-11  3:36       ` Yury Norov
  2022-11-11  7:02         ` Vincent MAILHOL
  0 siblings, 1 reply; 14+ messages in thread
From: Yury Norov @ 2022-11-11  3:36 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Nick Desaulniers, x86, Ingo Molnar, Borislav Petkov,
	Thomas Gleixner, linux-kernel, llvm, Borislav Petkov

On Fri, Nov 11, 2022 at 10:57:17AM +0900, Vincent MAILHOL wrote:
> On Fri. 11 Nov. 2022 at 04:01, Nick Desaulniers <ndesaulniers@google.com> wrote:
> > On Sun, Nov 6, 2022 at 1:51 AM Vincent Mailhol
> > <mailhol.vincent@wanadoo.fr> wrote:
> > >
> > >  #ifdef CONFIG_X86_64
> > > -static __always_inline int fls64(__u64 x)
> > > +static __always_inline int constant_fls64(u64 x)
> > > +{
> > > +       BUILD_BUG_ON(sizeof(unsigned long long) != sizeof(x));
> >
> > Thanks for the patches! They LGTM; but why do we need this BUILD_BUG_ON here?
> 
> There is no absolute need for sure.
> 
> Call this a paranoiac check and you will be correct. My reasoning for still
> using it is that:
> 
>   1/ It is a compile time check, so no runtime penalty.
>   2/ Strictly speaking, the C standard doesn't guarantee 'u64' and
>      'unsigned long long int' to be the same (and you can argue that in clang
>      and gcc long long is always 64 bits on all platforms and one more time
>      you will be correct).
>   3/ It serves as a documentation to say: "hey I am using the clz long long
>      version on a u64 and I know what I am doing."
> 
> If you want me to remove it, OK for me. Let me know.

In fact, compiler's typecheck would be more strict than your BUG().
For example, your check allows pointers, but compiler will complain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions
  2022-11-11  3:36       ` Yury Norov
@ 2022-11-11  7:02         ` Vincent MAILHOL
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent MAILHOL @ 2022-11-11  7:02 UTC (permalink / raw)
  To: Yury Norov
  Cc: Nick Desaulniers, x86, Ingo Molnar, Borislav Petkov,
	Thomas Gleixner, linux-kernel, llvm, Borislav Petkov

On Fri. 11 nov. 2022 à 12:36, Yury Norov <yury.norov@gmail.com> a écrit :
> On Fri, Nov 11, 2022 at 10:57:17AM +0900, Vincent MAILHOL wrote:
> > On Fri. 11 Nov. 2022 at 04:01, Nick Desaulniers <ndesaulniers@google.com> wrote:
> > > On Sun, Nov 6, 2022 at 1:51 AM Vincent Mailhol
> > > <mailhol.vincent@wanadoo.fr> wrote:
> > > >
> > > >  #ifdef CONFIG_X86_64
> > > > -static __always_inline int fls64(__u64 x)
> > > > +static __always_inline int constant_fls64(u64 x)
> > > > +{
> > > > +       BUILD_BUG_ON(sizeof(unsigned long long) != sizeof(x));
> > >
> > > Thanks for the patches! They LGTM; but why do we need this BUILD_BUG_ON here?
> >
> > There is no absolute need for sure.
> >
> > Call this a paranoiac check and you will be correct. My reasoning for still
> > using it is that:
> >
> >   1/ It is a compile time check, so no runtime penalty.
> >   2/ Strictly speaking, the C standard doesn't guarantee 'u64' and
> >      'unsigned long long int' to be the same (and you can argue that in clang
> >      and gcc long long is always 64 bits on all platforms and one more time
> >      you will be correct).
> >   3/ It serves as a documentation to say: "hey I am using the clz long long
> >      version on a u64 and I know what I am doing."
> >
> > If you want me to remove it, OK for me. Let me know.
>
> In fact, compiler's typecheck would be more strict than your BUG().
> For example, your check allows pointers, but compiler will complain.

Here, x is a scalar, so in that specific case, it is equivalent. But
the compiler type check is more explicit and more natural because it
would work if ported to other contexts as you pointed. So it is a good
idea.

The check would become:
        static_assert(__builtin_types_compatible_p(typeof(x),
                                                   unsigned long long int));

Alternatively, we could use the assert_arg_type() macro from
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/irq_stack.h#L126
which does exactly the same thing.

But in such a case, it would be better to first move assert_arg_type()
to a more appropriate header (maybe linux/build_bug.h ?)


Yours sincerely,
Vincent Mailhol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/2] x86/asm/bitops: optimize fls functions for constant expressions
  2022-11-06  9:51 [PATCH v1 0/2] x86/asm/bitops: optimize fls functions for constant expressions Vincent Mailhol
  2022-11-06  9:51 ` [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
  2022-11-06  9:51 ` [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions Vincent Mailhol
@ 2022-11-25  2:33 ` Vincent Mailhol
  2022-11-25  2:33   ` [PATCH v2 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
  2022-11-25  2:33   ` [PATCH v2 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions Vincent Mailhol
  2 siblings, 2 replies; 14+ messages in thread
From: Vincent Mailhol @ 2022-11-25  2:33 UTC (permalink / raw)
  To: x86, Ingo Molnar, Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, linux-kernel, Yury Norov,
	llvm, Peter Zijlstra, Vincent Mailhol

The compilers provide some builtin expression equivalent to the fls(),
__fls() and fls64() functions of the kernel.

The kernel's x86 implementation relies on assembly code. This assembly
code can not be folded when used with constant expressions.

This series replaces the kernel assembly by a builtin equivalent when
appropriate. It is a follow-up on this previous series:

  https://lore.kernel.org/all/20220907090935.919-1-mailhol.vincent@wanadoo.fr/

in which I promised to also modify the fls() functions.

* Changelog *

v1 -> v2:

  * [patch 1/2] add Reviewed-by: Nick Desaulniers tag.

  * [patch 2/2] replace:
       BUILD_BUG_ON(sizeof(unsigned long long) != sizeof(x));
    by:
       BUILD_BUG_ON(!__same_type(x, unsigned long long));
    and slightly modify the commit description.

Vincent Mailhol (2):
  x86/asm/bitops: Replace __fls() by its generic builtin implementation
  x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions

 arch/x86/include/asm/bitops.h | 70 ++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 30 deletions(-)

-- 
2.37.4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation
  2022-11-25  2:33 ` [PATCH v2 0/2] x86/asm/bitops: optimize fls functions for " Vincent Mailhol
@ 2022-11-25  2:33   ` Vincent Mailhol
  2022-11-25  2:33   ` [PATCH v2 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions Vincent Mailhol
  1 sibling, 0 replies; 14+ messages in thread
From: Vincent Mailhol @ 2022-11-25  2:33 UTC (permalink / raw)
  To: x86, Ingo Molnar, Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, linux-kernel, Yury Norov,
	llvm, Peter Zijlstra, Vincent Mailhol, Borislav Petkov

Below snippet:

  #include <linux/bitops.h>

  unsigned int foo(unsigned long word)
  {
  	return __fls(word);
  }

produces this on GCC 12.1.0:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	48 89 fb             	mov    %rdi,%rbx
     d:	e8 00 00 00 00       	call   12 <foo+0x12>
    12:	48 0f bd c3          	bsr    %rbx,%rax
    16:	5b                   	pop    %rbx
    17:	31 ff                	xor    %edi,%edi
    19:	e9 00 00 00 00       	jmp    1e <foo+0x1e>

and that on clang 14.0.6:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	50                   	push   %rax
     b:	48 89 fb             	mov    %rdi,%rbx
     e:	e8 00 00 00 00       	call   13 <foo+0x13>
    13:	48 89 1c 24          	mov    %rbx,(%rsp)
    17:	48 0f bd 04 24       	bsr    (%rsp),%rax
    1c:	48 83 c4 08          	add    $0x8,%rsp
    20:	5b                   	pop    %rbx
    21:	c3                   	ret

The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
produces the exact same code on GCC and below code on clang:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	48 89 fb             	mov    %rdi,%rbx
     d:	e8 00 00 00 00       	call   12 <foo+0x12>
    12:	48 0f bd c3          	bsr    %rbx,%rax
    16:	5b                   	pop    %rbx
    17:	c3                   	ret

The builtin implementation is better for two reasons:

  1/ it saves two instructions on clang (a push and a stack pointer
     decrement).

  2/ when used on constant expressions, the compiler is only able to
     fold the builtin version (c.f. [2]).

For those two reasons, replace the assembly implementation by its
builtin counterpart.

[1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <bp@suse.de>
CC: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
---
 arch/x86/include/asm/bitops.h | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 2edf68475fec..a31453d7686d 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -285,19 +285,7 @@ static __always_inline unsigned long variable_ffz(unsigned long word)
 	 (unsigned long)__builtin_ctzl(~word) :	\
 	 variable_ffz(word))
 
-/*
- * __fls: find last set bit in word
- * @word: The word to search
- *
- * Undefined if no set bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __fls(unsigned long word)
-{
-	asm("bsr %1,%0"
-	    : "=r" (word)
-	    : "rm" (word));
-	return word;
-}
+#include <asm-generic/bitops/builtin-__fls.h>
 
 #undef ADDR
 
-- 
2.37.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions
  2022-11-25  2:33 ` [PATCH v2 0/2] x86/asm/bitops: optimize fls functions for " Vincent Mailhol
  2022-11-25  2:33   ` [PATCH v2 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
@ 2022-11-25  2:33   ` Vincent Mailhol
  1 sibling, 0 replies; 14+ messages in thread
From: Vincent Mailhol @ 2022-11-25  2:33 UTC (permalink / raw)
  To: x86, Ingo Molnar, Borislav Petkov
  Cc: Nick Desaulniers, Thomas Gleixner, linux-kernel, Yury Norov,
	llvm, Peter Zijlstra, Vincent Mailhol, Borislav Petkov

GCC and clang offers the __builtin_clz(x) and __builtin_clzll(x)
functions which return the number of leading zero bits in
x. c.f. [1]. By a simple subtraction, we can derive below equivalences:

* For fls:
  Aside of the x = 0 special case, fls(x) is equivalent to
  BITS_PER_TYPE(x) - __builtin_clz(x).

* For fls64: Aside of the x = 0 special case, fls64(x) is equivalent
  to BITS_PER_TYPE(x) - __builtin_clzll(x).
  __builtin_clzll() takes an unsigned long long as argument instead of
  a u64 which is fine because those two types are equivalent. Regardless,
  add a BUILD_BUG_ON() safety net is added to formally assert that u64
  and unsigned long long are the same.

When used on constant expressions, the compiler is only able to fold
the builtin version (c.f. [2]). However, for non constant expressions,
the kernel inline assembly results in better code for both GCC and
clang.

Use __builtin_constant_p() to select between the kernel's
fls()/fls64() and __builtin_clz()/__builtin_clzll() depending on
whether the argument is constant or not.

While renaming fls64() to variable_fls64(), change the argument type
from __64 to u64 because we are not in an uapi header.

[1] Built-in Functions Provided by GCC:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <bp@suse.de>
CC: Nick Desaulniers <ndesaulniers@google.com>
CC: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 56 ++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index a31453d7686d..d10e774af343 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -333,18 +333,15 @@ static __always_inline int variable_ffs(int x)
  */
 #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
 
-/**
- * fls - find last set bit in word
- * @x: the word to search
- *
- * This is defined in a similar way as the libc and compiler builtin
- * ffs, but returns the position of the most significant set bit.
- *
- * fls(value) returns 0 if value is 0 or the position of the last
- * set bit if value is nonzero. The last (most significant) bit is
- * at position 32.
- */
-static __always_inline int fls(unsigned int x)
+static __always_inline int constant_fls(unsigned int x)
+{
+	if (!x)
+		return 0;
+
+	return BITS_PER_TYPE(x) - __builtin_clz(x);
+}
+
+static __always_inline int variable_fls(unsigned int x)
 {
 	int r;
 
@@ -375,18 +372,29 @@ static __always_inline int fls(unsigned int x)
 }
 
 /**
- * fls64 - find last set bit in a 64-bit word
+ * fls - find last set bit in word
  * @x: the word to search
  *
  * This is defined in a similar way as the libc and compiler builtin
- * ffsll, but returns the position of the most significant set bit.
+ * ffs, but returns the position of the most significant set bit.
  *
- * fls64(value) returns 0 if value is 0 or the position of the last
+ * fls(value) returns 0 if value is 0 or the position of the last
  * set bit if value is nonzero. The last (most significant) bit is
- * at position 64.
+ * at position 32.
  */
+#define fls(x) (__builtin_constant_p(x) ? constant_fls(x) : variable_fls(x))
+
 #ifdef CONFIG_X86_64
-static __always_inline int fls64(__u64 x)
+static __always_inline int constant_fls64(u64 x)
+{
+	if (!x)
+		return 0;
+
+	BUILD_BUG_ON(!__same_type(x, unsigned long long));
+	return BITS_PER_TYPE(x) - __builtin_clzll(x);
+}
+
+static __always_inline int variable_fls64(u64 x)
 {
 	int bitpos = -1;
 	/*
@@ -399,6 +407,20 @@ static __always_inline int fls64(__u64 x)
 	    : "rm" (x));
 	return bitpos + 1;
 }
+
+/**
+ * fls64 - find last set bit in a 64-bit word
+ * @x: the word to search
+ *
+ * This is defined in a similar way as the libc and compiler builtin
+ * ffsll, but returns the position of the most significant set bit.
+ *
+ * fls64(value) returns 0 if value is 0 or the position of the last
+ * set bit if value is nonzero. The last (most significant) bit is
+ * at position 64.
+ */
+#define fls64(x) \
+	(__builtin_constant_p(x) ? constant_fls64(x) : variable_fls64(x))
 #else
 #include <asm-generic/bitops/fls64.h>
 #endif
-- 
2.37.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-11-25  2:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-06  9:51 [PATCH v1 0/2] x86/asm/bitops: optimize fls functions for constant expressions Vincent Mailhol
2022-11-06  9:51 ` [PATCH v1 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
2022-11-07  9:38   ` Peter Zijlstra
2022-11-07 12:19     ` Vincent MAILHOL
2022-11-10 19:20     ` Nick Desaulniers
2022-11-10 19:06   ` Nick Desaulniers
2022-11-06  9:51 ` [PATCH v1 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions Vincent Mailhol
2022-11-10 19:01   ` Nick Desaulniers
2022-11-11  1:57     ` Vincent MAILHOL
2022-11-11  3:36       ` Yury Norov
2022-11-11  7:02         ` Vincent MAILHOL
2022-11-25  2:33 ` [PATCH v2 0/2] x86/asm/bitops: optimize fls functions for " Vincent Mailhol
2022-11-25  2:33   ` [PATCH v2 1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation Vincent Mailhol
2022-11-25  2:33   ` [PATCH v2 2/2] x86/asm/bitops: Use __builtin_clz*() to evaluate constant expressions Vincent Mailhol

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.