* [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions @ 2022-05-10 14:25 Vincent Mailhol 2022-05-10 14:25 ` [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Vincent Mailhol @ 2022-05-10 14:25 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Nick Desaulniers, Tom Rix, linux-kernel, llvm, Vincent Mailhol The compilers provides some builtin expression equivalent to the ffs(), __ffs() and ffz() function of the kernel. The kernel uses optimized assembly which produces better code than the builtin functions. However, such assembly code can not be optimized when used on constant expression. This series relies on __builtin_constant_p to select the optimal solution: * use kernel assembly for non constant expressions * use compiler's __builtin function for constant expressions. I also think that the fls() and fls64() can be optimized in a similar way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less trivial so I want to focus on this series first. If it get accepted, I will then work on those two additionnal function. ** Statistics ** On a allyesconfig, before applying this series, I get: | $ objdump -d vmlinux.o | grep bsf | wc -l | 1081 After applying this series: | $ objdump -d vmlinux.o | grep bsf | wc -l | 792 So, roughly 26.7% of the call to either ffs() or __ffs() were using constant expression and can be optimized (I did not produce the figures for ffz()). (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1) Vincent Mailhol (2): x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++-------------- 1 file changed, 40 insertions(+), 25 deletions(-) -- 2.35.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions 2022-05-10 14:25 [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol @ 2022-05-10 14:25 ` Vincent Mailhol 2022-05-10 22:29 ` Nick Desaulniers 2022-05-10 14:25 ` [PATCH 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol 2022-05-10 22:14 ` [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers 2 siblings, 1 reply; 9+ messages in thread From: Vincent Mailhol @ 2022-05-10 14:25 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Nick Desaulniers, Tom Rix, linux-kernel, llvm, Vincent Mailhol For x86_64, the current ffs() implementation does not produce optimized code when called with a constant expression. On the contrary, the __builtin_ffs() function of both GCC and clang is able to simplify the expression into a single instruction. * Example * Let's consider two dummy functions foo() and bar() as below: | #include <linux/bitops.h> | #define CONST 0x01000000 | | unsigned int foo(void) | { | return ffs(CONST); | } | | unsigned int bar(void) | { | return __builtin_ffs(CONST); | } GCC would produce below assembly code: | 0000000000000000 <foo>: | 0: b8 ff ff ff ff mov $0xffffffff,%eax | 5: 0f bc c7 bsf %edi,%eax | 8: 83 c0 01 add $0x1,%eax | b: c3 ret | c: 0f 1f 40 00 nopl 0x0(%rax) | | 0000000000000010 <bar>: | 10: b8 19 00 00 00 mov $0x19,%eax | 15: c3 ret And clang would produce: | 0000000000000000 <foo>: | 0: 55 push %rbp | 1: 48 89 e5 mov %rsp,%rbp | 4: b8 ff ff ff ff mov $0xffffffff,%eax | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10> | 10: ff c0 inc %eax | 12: 5d pop %rbp | 13: c3 ret | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) | 1b: 00 00 00 | 1e: 66 90 xchg %ax,%ax | | 0000000000000020 <bar>: | 20: 55 push %rbp | 21: 48 89 e5 mov %rsp,%rbp | 24: b8 19 00 00 00 mov $0x19,%eax | 29: 5d pop %rbp | 2a: c3 ret For both examples, we clearly see the benefit of using __builtin_ffs() instead of the kernel's asm implementation for constant expressions. However, for non constant expressions, the ffs() asm version of the kernel remains better for x86_64 because, contrary to GCC, it doesn't emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is able optimize out the CMOV call). This patch uses the __builtin_constant_p() to select between the kernel's ffs() and the __builtin_ffs() depending on whether the argument is constant or not. As a side benefit, this patch also removes below -Wshadow warning: | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow] | 283 | static __always_inline int ffs(int x) [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()") http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> --- arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h index a288ecd230ab..535a7a358c14 100644 --- a/arch/x86/include/asm/bitops.h +++ b/arch/x86/include/asm/bitops.h @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word) #undef ADDR #ifdef __KERNEL__ -/** - * ffs - find first set bit in word - * @x: the word to search - * - * This is defined the same way as the libc and compiler builtin ffs - * routines, therefore differs in spirit from the other bitops. - * - * ffs(value) returns 0 if value is 0 or the position of the first - * set bit if value is nonzero. The first (least significant) bit - * is at position 1. - */ -static __always_inline int ffs(int x) +static __always_inline int __ffs_asm(int x) { int r; @@ -310,6 +299,22 @@ static __always_inline int ffs(int x) return r + 1; } +/** + * ffs - find first set bit in word + * @x: the word to search + * + * This is defined the same way as the libc and compiler builtin ffs + * routines, therefore differs in spirit from the other bitops. + * + * ffs(value) returns 0 if value is 0 or the position of the first + * set bit if value is nonzero. The first (least significant) bit + * is at position 1. + */ +#define ffs(x) \ + (__builtin_constant_p(x) ? \ + __builtin_ffs(x) : \ + __ffs_asm(x)) + /** * fls - find last set bit in word * @x: the word to search -- 2.35.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions 2022-05-10 14:25 ` [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol @ 2022-05-10 22:29 ` Nick Desaulniers 2022-05-10 23:54 ` Vincent MAILHOL 0 siblings, 1 reply; 9+ messages in thread From: Nick Desaulniers @ 2022-05-10 22:29 UTC (permalink / raw) To: Vincent Mailhol Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol <mailhol.vincent@wanadoo.fr> wrote: > > For x86_64, the current ffs() implementation does not produce > optimized code when called with a constant expression. On the > contrary, the __builtin_ffs() function of both GCC and clang is able > to simplify the expression into a single instruction. > > * Example * > > Let's consider two dummy functions foo() and bar() as below: > > | #include <linux/bitops.h> > | #define CONST 0x01000000 > | > | unsigned int foo(void) > | { > | return ffs(CONST); > | } > | > | unsigned int bar(void) > | { > | return __builtin_ffs(CONST); > | } > > GCC would produce below assembly code: > > | 0000000000000000 <foo>: > | 0: b8 ff ff ff ff mov $0xffffffff,%eax > | 5: 0f bc c7 bsf %edi,%eax > | 8: 83 c0 01 add $0x1,%eax > | b: c3 ret > | c: 0f 1f 40 00 nopl 0x0(%rax) > | > | 0000000000000010 <bar>: > | 10: b8 19 00 00 00 mov $0x19,%eax > | 15: c3 ret > > And clang would produce: > > | 0000000000000000 <foo>: > | 0: 55 push %rbp > | 1: 48 89 e5 mov %rsp,%rbp > | 4: b8 ff ff ff ff mov $0xffffffff,%eax > | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10> > | 10: ff c0 inc %eax > | 12: 5d pop %rbp > | 13: c3 ret > | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > | 1b: 00 00 00 > | 1e: 66 90 xchg %ax,%ax > | > | 0000000000000020 <bar>: > | 20: 55 push %rbp > | 21: 48 89 e5 mov %rsp,%rbp > | 24: b8 19 00 00 00 mov $0x19,%eax > | 29: 5d pop %rbp > | 2a: c3 ret Right, we need to allocate registers to move the inputs into the asm block, and the results back out. Inline asm is analogous to a call with a custom calling convention, where we don't look into the body of the inline asm. Does -fomit-frame-pointer clean make these snippets clearer, or did you not build with -O2? Consider using those flags if so, since we generally prefer the ORC unwinder on x86, not the frame pointer unwinder. If the compilers are forcing a frame pointer when using the builtins once optimizations are enabled, that's a problem (that we've seen in the past with the builtins for reading eflags with clang; now fixed). > > For both examples, we clearly see the benefit of using __builtin_ffs() > instead of the kernel's asm implementation for constant expressions. > > However, for non constant expressions, the ffs() asm version of the > kernel remains better for x86_64 because, contrary to GCC, it doesn't > emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is > able optimize out the CMOV call). > > This patch uses the __builtin_constant_p() to select between the > kernel's ffs() and the __builtin_ffs() depending on whether the > argument is constant or not. > > > As a side benefit, this patch also removes below -Wshadow warning: > > | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow] > | 283 | static __always_inline int ffs(int x) Nice! :) > > [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()") > http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk + David, author of ca3d30cc02f7. I was wondering if this applied to more than just x86, but I see now that some architectures just include include/asm-generic/bitops/builtin-ffs.h into their arch/*/include/asm/bitops.h. It's only when we want to beat the compiler for non-ICE expressions. Patch LGTM; just minor comments on commit message, naming, and formatting. > > > Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> > --- > arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------ > 1 file changed, 17 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h > index a288ecd230ab..535a7a358c14 100644 > --- a/arch/x86/include/asm/bitops.h > +++ b/arch/x86/include/asm/bitops.h > @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word) > #undef ADDR > > #ifdef __KERNEL__ > -/** > - * ffs - find first set bit in word > - * @x: the word to search > - * > - * This is defined the same way as the libc and compiler builtin ffs > - * routines, therefore differs in spirit from the other bitops. > - * > - * ffs(value) returns 0 if value is 0 or the position of the first > - * set bit if value is nonzero. The first (least significant) bit > - * is at position 1. > - */ > -static __always_inline int ffs(int x) > +static __always_inline int __ffs_asm(int x) How about variable_ffs rather than __ffs_asm? Let's try to stick with the convention used by test_bit? > { > int r; > > @@ -310,6 +299,22 @@ static __always_inline int ffs(int x) > return r + 1; > } > > +/** > + * ffs - find first set bit in word > + * @x: the word to search > + * > + * This is defined the same way as the libc and compiler builtin ffs > + * routines, therefore differs in spirit from the other bitops. > + * > + * ffs(value) returns 0 if value is 0 or the position of the first > + * set bit if value is nonzero. The first (least significant) bit > + * is at position 1. > + */ > +#define ffs(x) \ > + (__builtin_constant_p(x) ? \ > + __builtin_ffs(x) : \ > + __ffs_asm(x)) > + I think this whole #define can fit on one line? If not, perhaps the BCP can start on the initial line? Otherwise it looks like the then/else clauses are indented by 1 tab followed by 1 space. Consider just using tabs. > /** > * fls - find last set bit in word > * @x: the word to search > -- > 2.35.1 > -- Thanks, ~Nick Desaulniers ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions 2022-05-10 22:29 ` Nick Desaulniers @ 2022-05-10 23:54 ` Vincent MAILHOL 2022-05-11 15:38 ` Vincent MAILHOL 0 siblings, 1 reply; 9+ messages in thread From: Vincent MAILHOL @ 2022-05-10 23:54 UTC (permalink / raw) To: Nick Desaulniers Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells On Wed. 11 May 2022 at 07:29, Nick Desaulniers <ndesaulniers@google.com> wrote: > On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol > <mailhol.vincent@wanadoo.fr> wrote: > > > > For x86_64, the current ffs() implementation does not produce > > optimized code when called with a constant expression. On the > > contrary, the __builtin_ffs() function of both GCC and clang is able > > to simplify the expression into a single instruction. > > > > * Example * > > > > Let's consider two dummy functions foo() and bar() as below: > > > > | #include <linux/bitops.h> > > | #define CONST 0x01000000 > > | > > | unsigned int foo(void) > > | { > > | return ffs(CONST); > > | } > > | > > | unsigned int bar(void) > > | { > > | return __builtin_ffs(CONST); > > | } > > > > GCC would produce below assembly code: > > > > | 0000000000000000 <foo>: > > | 0: b8 ff ff ff ff mov $0xffffffff,%eax > > | 5: 0f bc c7 bsf %edi,%eax > > | 8: 83 c0 01 add $0x1,%eax > > | b: c3 ret > > | c: 0f 1f 40 00 nopl 0x0(%rax) > > | > > | 0000000000000010 <bar>: > > | 10: b8 19 00 00 00 mov $0x19,%eax > > | 15: c3 ret > > > > And clang would produce: > > > > | 0000000000000000 <foo>: > > | 0: 55 push %rbp > > | 1: 48 89 e5 mov %rsp,%rbp > > | 4: b8 ff ff ff ff mov $0xffffffff,%eax > > | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10> > > | 10: ff c0 inc %eax > > | 12: 5d pop %rbp > > | 13: c3 ret > > | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > > | 1b: 00 00 00 > > | 1e: 66 90 xchg %ax,%ax > > | > > | 0000000000000020 <bar>: > > | 20: 55 push %rbp > > | 21: 48 89 e5 mov %rsp,%rbp > > | 24: b8 19 00 00 00 mov $0x19,%eax > > | 29: 5d pop %rbp > > | 2a: c3 ret > > Right, we need to allocate registers to move the inputs into the asm > block, and the results back out. Inline asm is analogous to a call > with a custom calling convention, where we don't look into the body of > the inline asm. > > Does -fomit-frame-pointer clean make these snippets clearer, or did > you not build with -O2? Consider using those flags if so, since we > generally prefer the ORC unwinder on x86, not the frame pointer > unwinder. If the compilers are forcing a frame pointer when using the > builtins once optimizations are enabled, that's a problem (that we've > seen in the past with the builtins for reading eflags with clang; now > fixed). I have not played with those parameters yet, so short answer, I am using the kernel default (above assembly was compiled with Kbuild). You are touching a few topics I am not familiar with, I need some research on this before answering you in more detail. > > > > For both examples, we clearly see the benefit of using __builtin_ffs() > > instead of the kernel's asm implementation for constant expressions. > > > > However, for non constant expressions, the ffs() asm version of the > > kernel remains better for x86_64 because, contrary to GCC, it doesn't > > emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is > > able optimize out the CMOV call). > > > > This patch uses the __builtin_constant_p() to select between the > > kernel's ffs() and the __builtin_ffs() depending on whether the > > argument is constant or not. > > > > > > As a side benefit, this patch also removes below -Wshadow warning: > > > > | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow] > > | 283 | static __always_inline int ffs(int x) > > Nice! :) Thanks. For the record, fixing the -Wshadow is my real motivation. I am pissed when the header files through some W=12 warnings. Once this patch is applied, there will be one last annoying W=2 warning to clear in order to only see local warnings and not random spam from headers when doing a W=12 (at least on x86_64). But because those kinds of W=2 fixes aren't so popular, I figured it would be better to offer something else. I first checked if GCC produces less optimized code than the kernel assembly: that was still the case. I then looked at the GCC mailing list to see if discussion on this topic existed. Didn't find it but found instead that GCC could optimize constant expressions. And voilà how I came to the creation of this patch. > > > > [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()") > > http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk > > + David, author of ca3d30cc02f7. I was wondering if this applied to > more than just x86, but I see now that some architectures just include > include/asm-generic/bitops/builtin-ffs.h into their > arch/*/include/asm/bitops.h. It's only when we want to beat the > compiler for non-ICE expressions. Yes, I did a quick research, the majority of the architectures already rely on the builtin function. Would need to give a deeper look to track if anyone else other than x86 also uses assembly. Also, this potentially may apply to builtin functions other than the ffs family. Just did not find any other cases so far. > Patch LGTM; just minor comments on commit message, naming, and formatting. > > > > > > > Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> > > --- > > arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------ > > 1 file changed, 17 insertions(+), 12 deletions(-) > > > > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h > > index a288ecd230ab..535a7a358c14 100644 > > --- a/arch/x86/include/asm/bitops.h > > +++ b/arch/x86/include/asm/bitops.h > > @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word) > > #undef ADDR > > > > #ifdef __KERNEL__ > > -/** > > - * ffs - find first set bit in word > > - * @x: the word to search > > - * > > - * This is defined the same way as the libc and compiler builtin ffs > > - * routines, therefore differs in spirit from the other bitops. > > - * > > - * ffs(value) returns 0 if value is 0 or the position of the first > > - * set bit if value is nonzero. The first (least significant) bit > > - * is at position 1. > > - */ > > -static __always_inline int ffs(int x) > > +static __always_inline int __ffs_asm(int x) > > How about variable_ffs rather than __ffs_asm? Let's try to stick with > the convention used by test_bit? ACK. I will also follow this comment for path 2/2 and use __variable_ffs instead of __ffs_asm_not_zero there. > > { > > int r; > > > > @@ -310,6 +299,22 @@ static __always_inline int ffs(int x) > > return r + 1; > > } > > > > +/** > > + * ffs - find first set bit in word > > + * @x: the word to search > > + * > > + * This is defined the same way as the libc and compiler builtin ffs > > + * routines, therefore differs in spirit from the other bitops. > > + * > > + * ffs(value) returns 0 if value is 0 or the position of the first > > + * set bit if value is nonzero. The first (least significant) bit > > + * is at position 1. > > + */ > > +#define ffs(x) \ > > + (__builtin_constant_p(x) ? \ > > + __builtin_ffs(x) : \ > > + __ffs_asm(x)) > > + > > I think this whole #define can fit on one line? I split it into multiple lines to be consistent with other macros in the file. I have no objections to doing it on a single line (I will just check if this is within the 80 characters limit and if so, will do a single line in v2). > If not, perhaps the > BCP can start on the initial line? Otherwise it looks like the > then/else clauses are indented by 1 tab followed by 1 space. Consider > just using tabs. Right, I inadvertently added a space after the tab of the first line and the editor auto indentation repeated the patterns on the other lines. With that said, I will prepare the v2. I will send it within two days I think (can not do it right now). > > /** > > * fls - find last set bit in word > > * @x: the word to search > > -- > > 2.35.1 > > > > > -- > Thanks, > ~Nick Desaulniers ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions 2022-05-10 23:54 ` Vincent MAILHOL @ 2022-05-11 15:38 ` Vincent MAILHOL 0 siblings, 0 replies; 9+ messages in thread From: Vincent MAILHOL @ 2022-05-11 15:38 UTC (permalink / raw) To: Nick Desaulniers Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm, David Howells On Wed. 11 May 2022 at 08:54, Vincent MAILHOL <mailhol.vincent@wanadoo.fr> wrote: > On Wed. 11 May 2022 at 07:29, Nick Desaulniers <ndesaulniers@google.com> wrote: > > On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol > > <mailhol.vincent@wanadoo.fr> wrote: > > > > > > For x86_64, the current ffs() implementation does not produce > > > optimized code when called with a constant expression. On the > > > contrary, the __builtin_ffs() function of both GCC and clang is able > > > to simplify the expression into a single instruction. > > > > > > * Example * > > > > > > Let's consider two dummy functions foo() and bar() as below: > > > > > > | #include <linux/bitops.h> > > > | #define CONST 0x01000000 > > > | > > > | unsigned int foo(void) > > > | { > > > | return ffs(CONST); > > > | } > > > | > > > | unsigned int bar(void) > > > | { > > > | return __builtin_ffs(CONST); > > > | } > > > > > > GCC would produce below assembly code: > > > > > > | 0000000000000000 <foo>: > > > | 0: b8 ff ff ff ff mov $0xffffffff,%eax > > > | 5: 0f bc c7 bsf %edi,%eax > > > | 8: 83 c0 01 add $0x1,%eax > > > | b: c3 ret > > > | c: 0f 1f 40 00 nopl 0x0(%rax) > > > | > > > | 0000000000000010 <bar>: > > > | 10: b8 19 00 00 00 mov $0x19,%eax > > > | 15: c3 ret > > > > > > And clang would produce: > > > > > > | 0000000000000000 <foo>: > > > | 0: 55 push %rbp > > > | 1: 48 89 e5 mov %rsp,%rbp > > > | 4: b8 ff ff ff ff mov $0xffffffff,%eax > > > | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10> > > > | 10: ff c0 inc %eax > > > | 12: 5d pop %rbp > > > | 13: c3 ret > > > | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > > > | 1b: 00 00 00 > > > | 1e: 66 90 xchg %ax,%ax > > > | > > > | 0000000000000020 <bar>: > > > | 20: 55 push %rbp > > > | 21: 48 89 e5 mov %rsp,%rbp > > > | 24: b8 19 00 00 00 mov $0x19,%eax > > > | 29: 5d pop %rbp > > > | 2a: c3 ret > > > > Right, we need to allocate registers to move the inputs into the asm > > block, and the results back out. Inline asm is analogous to a call > > with a custom calling convention, where we don't look into the body of > > the inline asm. > > > > Does -fomit-frame-pointer clean make these snippets clearer, or did > > you not build with -O2? Consider using those flags if so, since we > > generally prefer the ORC unwinder on x86, not the frame pointer > > unwinder. If the compilers are forcing a frame pointer when using the > > builtins once optimizations are enabled, that's a problem (that we've > > seen in the past with the builtins for reading eflags with clang; now > > fixed). > > I have not played with those parameters yet, so short answer, I > am using the kernel default (above assembly was compiled with > Kbuild). You are touching a few topics I am not familiar with, I > need some research on this before answering you in more detail. I got the answer: actually, I was using an allnoconfig when I generated the above assembly code. And it appears that allnoconfig has -fno-omit-frame-pointer by default. I will activate the ORC unwinder and update the snippets in v2. > > > > > > For both examples, we clearly see the benefit of using __builtin_ffs() > > > instead of the kernel's asm implementation for constant expressions. > > > > > > However, for non constant expressions, the ffs() asm version of the > > > kernel remains better for x86_64 because, contrary to GCC, it doesn't > > > emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is > > > able optimize out the CMOV call). > > > > > > This patch uses the __builtin_constant_p() to select between the > > > kernel's ffs() and the __builtin_ffs() depending on whether the > > > argument is constant or not. > > > > > > > > > As a side benefit, this patch also removes below -Wshadow warning: > > > > > > | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow] > > > | 283 | static __always_inline int ffs(int x) > > > > Nice! :) > > Thanks. > > For the record, fixing the -Wshadow is my real motivation. I am > pissed when the header files through some W=12 warnings. Once > this patch is applied, there will be one last annoying W=2 > warning to clear in order to only see local warnings and not > random spam from headers when doing a W=12 (at least on x86_64). > > But because those kinds of W=2 fixes aren't so popular, I figured > it would be better to offer something else. I first checked if > GCC produces less optimized code than the kernel assembly: that > was still the case. I then looked at the GCC mailing list to see > if discussion on this topic existed. Didn't find it but found > instead that GCC could optimize constant expressions. And voilà > how I came to the creation of this patch. > > > > > > > [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()") > > > http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk > > > > + David, author of ca3d30cc02f7. I was wondering if this applied to > > more than just x86, but I see now that some architectures just include > > include/asm-generic/bitops/builtin-ffs.h into their > > arch/*/include/asm/bitops.h. It's only when we want to beat the > > compiler for non-ICE expressions. > > Yes, I did a quick research, the majority of the architectures already > rely on the builtin function. > Would need to give a deeper look to track if anyone else other than > x86 also uses assembly. > > Also, this potentially may apply to builtin functions other than > the ffs family. Just did not find any other cases so far. > > > Patch LGTM; just minor comments on commit message, naming, and formatting. > > > > > > > > > > > Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> > > > --- > > > arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------ > > > 1 file changed, 17 insertions(+), 12 deletions(-) > > > > > > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h > > > index a288ecd230ab..535a7a358c14 100644 > > > --- a/arch/x86/include/asm/bitops.h > > > +++ b/arch/x86/include/asm/bitops.h > > > @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word) > > > #undef ADDR > > > > > > #ifdef __KERNEL__ > > > -/** > > > - * ffs - find first set bit in word > > > - * @x: the word to search > > > - * > > > - * This is defined the same way as the libc and compiler builtin ffs > > > - * routines, therefore differs in spirit from the other bitops. > > > - * > > > - * ffs(value) returns 0 if value is 0 or the position of the first > > > - * set bit if value is nonzero. The first (least significant) bit > > > - * is at position 1. > > > - */ > > > -static __always_inline int ffs(int x) > > > +static __always_inline int __ffs_asm(int x) > > > > How about variable_ffs rather than __ffs_asm? Let's try to stick with > > the convention used by test_bit? > > ACK. I will also follow this comment for path 2/2 and use > __variable_ffs instead of __ffs_asm_not_zero there. > > > > { > > > int r; > > > > > > @@ -310,6 +299,22 @@ static __always_inline int ffs(int x) > > > return r + 1; > > > } > > > > > > +/** > > > + * ffs - find first set bit in word > > > + * @x: the word to search > > > + * > > > + * This is defined the same way as the libc and compiler builtin ffs > > > + * routines, therefore differs in spirit from the other bitops. > > > + * > > > + * ffs(value) returns 0 if value is 0 or the position of the first > > > + * set bit if value is nonzero. The first (least significant) bit > > > + * is at position 1. > > > + */ > > > +#define ffs(x) \ > > > + (__builtin_constant_p(x) ? \ > > > + __builtin_ffs(x) : \ > > > + __ffs_asm(x)) > > > + > > > > I think this whole #define can fit on one line? > > I split it into multiple lines to be consistent with other macros > in the file. I have no objections to doing it on a single line (I > will just check if this is within the 80 characters limit and if > so, will do a single line in v2). > > > If not, perhaps the > > BCP can start on the initial line? Otherwise it looks like the > > then/else clauses are indented by 1 tab followed by 1 space. Consider > > just using tabs. > > Right, I inadvertently added a space after the tab of the first > line and the editor auto indentation repeated the patterns on the > other lines. > > With that said, I will prepare the v2. I will send it within two > days I think (can not do it right now). > > > > /** > > > * fls - find last set bit in word > > > * @x: the word to search > > > -- > > > 2.35.1 > > > > > > > > > -- > > Thanks, > > ~Nick Desaulniers ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions 2022-05-10 14:25 [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol 2022-05-10 14:25 ` [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol @ 2022-05-10 14:25 ` Vincent Mailhol 2022-05-10 22:14 ` [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers 2 siblings, 0 replies; 9+ messages in thread From: Vincent Mailhol @ 2022-05-10 14:25 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov Cc: Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Nick Desaulniers, Tom Rix, linux-kernel, llvm, Vincent Mailhol __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x) is equivalent to (unsigned long)__builtin_ctzl(~x). Because __builting_ctzl() returns an int, a cast to (unsigned long) is necessary to avoid potential warnings on implicit casts. For x86_64, the current __ffs() and ffz() implementations do not produce optimized code when called with a constant expression. On the contrary, the __builtin_ctzl() gets simplified into a single instruction. However, for non constant expressions, the __ffs() and ffz() asm versions of the kernel remains slightly better than the code produced by GCC (it produces a useless instruction to clear eax). This patch uses the __builtin_constant_p() to select between the kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether the argument is constant or not. Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> --- Usually, helper functions are prefixed by two underscores. But here, __ffs() is already prefixed, instead of creating an ____ffs(), I renamed it instead to __ffs_asm_not_zero(). The _not_zero() suffix is to differentiate it from the __ffs_asm() helper function of the previous patch. If someone has better inspiration of the naming for this patch and the previous one, please let me know(). --- arch/x86/include/asm/bitops.h | 36 ++++++++++++++++++++++------------- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h index 535a7a358c14..694d0d100399 100644 --- a/arch/x86/include/asm/bitops.h +++ b/arch/x86/include/asm/bitops.h @@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l ? constant_test_bit((nr), (addr)) \ : variable_test_bit((nr), (addr))) -/** - * __ffs - find first set bit in word - * @word: The word to search - * - * Undefined if no bit exists, so code should check against 0 first. - */ -static __always_inline unsigned long __ffs(unsigned long word) +static __always_inline unsigned long __ffs_asm_not_zero(unsigned long word) { asm("rep; bsf %1,%0" : "=r" (word) @@ -239,12 +233,17 @@ static __always_inline unsigned long __ffs(unsigned long word) } /** - * ffz - find first zero bit in word - * @word: The word to search - * - * Undefined if no zero exists, so code should check against ~0UL first. - */ -static __always_inline unsigned long ffz(unsigned long word) + * __ffs - find first set bit in word + * @word: The word to search + * + * Undefined if no bit exists, so code should check against 0 first. + */ +#define __ffs(word) \ + (__builtin_constant_p(word) ? \ + (unsigned long)__builtin_ctzl(word) : \ + __ffs_asm_not_zero(word)) + +static __always_inline unsigned long __ffz_asm(unsigned long word) { asm("rep; bsf %1,%0" : "=r" (word) @@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word) return word; } +/** + * ffz - find first zero bit in word + * @word: The word to search + * + * Undefined if no zero exists, so code should check against ~0UL first. + */ +#define ffz(word) \ + (__builtin_constant_p(word) ? \ + (unsigned long)__builtin_ctzl(~word) : \ + __ffz_asm(word)) + /* * __fls: find last set bit in word * @word: The word to search -- 2.35.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions 2022-05-10 14:25 [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol 2022-05-10 14:25 ` [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol 2022-05-10 14:25 ` [PATCH 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol @ 2022-05-10 22:14 ` Nick Desaulniers 2022-05-10 23:24 ` Vincent MAILHOL 2 siblings, 1 reply; 9+ messages in thread From: Nick Desaulniers @ 2022-05-10 22:14 UTC (permalink / raw) To: Vincent Mailhol Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol <mailhol.vincent@wanadoo.fr> wrote: > > The compilers provides some builtin expression equivalent to the > ffs(), __ffs() and ffz() function of the kernel. The kernel uses > optimized assembly which produces better code than the builtin > functions. However, such assembly code can not be optimized when used > on constant expression. > > This series relies on __builtin_constant_p to select the optimal solution: > > * use kernel assembly for non constant expressions > > * use compiler's __builtin function for constant expressions. > > I also think that the fls() and fls64() can be optimized in a similar > way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less > trivial so I want to focus on this series first. If it get accepted, I > will then work on those two additionnal function. > > > ** Statistics ** > > On a allyesconfig, before applying this series, I get: > > | $ objdump -d vmlinux.o | grep bsf | wc -l > | 1081 > > After applying this series: > > | $ objdump -d vmlinux.o | grep bsf | wc -l > | 792 > > So, roughly 26.7% of the call to either ffs() or __ffs() were using > constant expression and can be optimized (I did not produce the > figures for ffz()). These stats are interesting; consider putting them on patch 1/2 commit message though (in addition to the cover letter). (Sending thoughts on 1/2 next). > > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1) Here's the same measure of x86_64 allyesconfig (./scripts/config -d CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT LLVM (~clang-15): Before: $ objdump -d vmlinux.o | grep bsf | wc -l 1454 After: $ objdump -d vmlinux.o | grep bsf | wc -l 1070 -26.4% :) > > > Vincent Mailhol (2): > x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant > expressions > x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant > expressions > > arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++-------------- > 1 file changed, 40 insertions(+), 25 deletions(-) > > -- > 2.35.1 > -- Thanks, ~Nick Desaulniers ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions 2022-05-10 22:14 ` [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers @ 2022-05-10 23:24 ` Vincent MAILHOL 2022-05-11 14:46 ` Vincent MAILHOL 0 siblings, 1 reply; 9+ messages in thread From: Vincent MAILHOL @ 2022-05-10 23:24 UTC (permalink / raw) To: Nick Desaulniers Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm On Wed. 11 May 2022 at 07:14, Nick Desaulniers <ndesaulniers@google.com> wrote: > On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol > <mailhol.vincent@wanadoo.fr> wrote: > > > > The compilers provides some builtin expression equivalent to the > > ffs(), __ffs() and ffz() function of the kernel. The kernel uses > > optimized assembly which produces better code than the builtin > > functions. However, such assembly code can not be optimized when used > > on constant expression. > > > > This series relies on __builtin_constant_p to select the optimal solution: > > > > * use kernel assembly for non constant expressions > > > > * use compiler's __builtin function for constant expressions. > > > > I also think that the fls() and fls64() can be optimized in a similar > > way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less > > trivial so I want to focus on this series first. If it get accepted, I > > will then work on those two additionnal function. > > > > > > ** Statistics ** > > > > On a allyesconfig, before applying this series, I get: > > > > | $ objdump -d vmlinux.o | grep bsf | wc -l > > | 1081 > > > > After applying this series: > > > > | $ objdump -d vmlinux.o | grep bsf | wc -l > > | 792 > > > > So, roughly 26.7% of the call to either ffs() or __ffs() were using > > constant expression and can be optimized (I did not produce the > > figures for ffz()). > > These stats are interesting; consider putting them on patch 1/2 commit > message though (in addition to the cover letter). (Sending thoughts on > 1/2 next). The fact is that patch 1/2 changes ffs() and patch 2/2 changes __ffs(). For v2, I will run the stats on each patch separately in order not to mix the results. > > > > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1) > > Here's the same measure of x86_64 allyesconfig (./scripts/config -d > CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT > LLVM (~clang-15): > > Before: > $ objdump -d vmlinux.o | grep bsf | wc -l > 1454 > > After: > $ objdump -d vmlinux.o | grep bsf | wc -l > 1070 > > -26.4% :) Roughly same ratio. I am just surprise that the absolute number are different: * GCC before: 1081, after 792 * clang before 1454, after 1070 I wonder why clang produces more bsf instructions than GCC? Also, on a side note, I am not the first one to realize that __builtin_ffs() is able to optimize the constant variable. Some people already used it to locally: | $ git grep __builtin_ffs | wc -l | 80 > > > > > > Vincent Mailhol (2): > > x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant > > expressions > > x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant > > expressions > > > > arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++-------------- > > 1 file changed, 40 insertions(+), 25 deletions(-) > > > > -- > > 2.35.1 > > > > > -- > Thanks, > ~Nick Desaulniers ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions 2022-05-10 23:24 ` Vincent MAILHOL @ 2022-05-11 14:46 ` Vincent MAILHOL 0 siblings, 0 replies; 9+ messages in thread From: Vincent MAILHOL @ 2022-05-11 14:46 UTC (permalink / raw) To: Nick Desaulniers Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Nathan Chancellor, Tom Rix, linux-kernel, llvm On Wed. 11 mai 2022 at 08:24, Vincent MAILHOL <mailhol.vincent@wanadoo.fr> wrote: > On Wed. 11 May 2022 at 07:14, Nick Desaulniers <ndesaulniers@google.com> wrote: > > On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol > > <mailhol.vincent@wanadoo.fr> wrote: > > > > > > The compilers provides some builtin expression equivalent to the > > > ffs(), __ffs() and ffz() function of the kernel. The kernel uses > > > optimized assembly which produces better code than the builtin > > > functions. However, such assembly code can not be optimized when used > > > on constant expression. > > > > > > This series relies on __builtin_constant_p to select the optimal solution: > > > > > > * use kernel assembly for non constant expressions > > > > > > * use compiler's __builtin function for constant expressions. > > > > > > I also think that the fls() and fls64() can be optimized in a similar > > > way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less > > > trivial so I want to focus on this series first. If it get accepted, I > > > will then work on those two additionnal function. > > > > > > > > > ** Statistics ** > > > > > > On a allyesconfig, before applying this series, I get: > > > > > > | $ objdump -d vmlinux.o | grep bsf | wc -l > > > | 1081 > > > > > > After applying this series: > > > > > > | $ objdump -d vmlinux.o | grep bsf | wc -l > > > | 792 > > > > > > So, roughly 26.7% of the call to either ffs() or __ffs() were using > > > constant expression and can be optimized (I did not produce the > > > figures for ffz()). > > > > These stats are interesting; consider putting them on patch 1/2 commit > > message though (in addition to the cover letter). (Sending thoughts on > > 1/2 next). > > The fact is that patch 1/2 changes ffs() and patch 2/2 changes > __ffs(). For v2, I will run the stats on each patch separately in > order not to mix the results. > > > > > > > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1) > > > > Here's the same measure of x86_64 allyesconfig (./scripts/config -d > > CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT > > LLVM (~clang-15): > > > > Before: > > $ objdump -d vmlinux.o | grep bsf | wc -l > > 1454 > > > > After: > > $ objdump -d vmlinux.o | grep bsf | wc -l > > 1070 > > > > -26.4% :) > > Roughly same ratio. I am just surprise that the absolute number > are different: > > * GCC before: 1081, after 792 > * clang before 1454, after 1070 > > I wonder why clang produces more bsf instructions than GCC? Did not find the answer yet, but while looking at this, I found another interesting thing: on x86_64 the bsf instruction produces tzcnt when used with the ret prefix. So ffs() produces bsf assembly instructions but __ffs() and ffz() produces tzcnt. c.f. http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com I will update the figures in v2 and benchmark both bsf and tzcnt. > Also, on a side note, I am not the first one to realize that > __builtin_ffs() is able to optimize the constant variable. Some > people already used it to locally: > > | $ git grep __builtin_ffs | wc -l > | 80 ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-05-11 15:38 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-10 14:25 [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions Vincent Mailhol 2022-05-10 14:25 ` [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate " Vincent Mailhol 2022-05-10 22:29 ` Nick Desaulniers 2022-05-10 23:54 ` Vincent MAILHOL 2022-05-11 15:38 ` Vincent MAILHOL 2022-05-10 14:25 ` [PATCH 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl " Vincent Mailhol 2022-05-10 22:14 ` [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for " Nick Desaulniers 2022-05-10 23:24 ` Vincent MAILHOL 2022-05-11 14:46 ` Vincent MAILHOL
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).