* [PATCH 0/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro @ 2022-05-25 14:40 Uros Bizjak 2022-05-25 14:40 ` [PATCH 1/2] " Uros Bizjak 2022-05-25 14:40 ` [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 Uros Bizjak 0 siblings, 2 replies; 18+ messages in thread From: Uros Bizjak @ 2022-05-25 14:40 UTC (permalink / raw) To: x86, linux-kernel Cc: Uros Bizjak, Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Waiman.Long, paulmck This patch series introduces try_cmpxchg64 into CMPXCHG_LOOP macro to improve generated code a bit. x86 CMPXCHG instruction returns success in ZF flag, so a compare after CMPXCHG and a related move instruction can be eliminated. The second patch (optionally) enables lockless reference count updates for X86_32 target with X86_CMPXCHG64 config flag set. When try_cmpxchg64 is used in CMPXCHG_LOOP macro in lib/lockref.c, the compiler avoids double-word compare and related move and produces quite optimal code around CMPXCHG8B for a register starved X86_32 target. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: paulmck@linux.vnet.ibm.com --- Uros Bizjak (2): locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 arch/x86/Kconfig | 2 +- lib/lockref.c | 9 ++++----- 2 files changed, 5 insertions(+), 6 deletions(-) -- 2.35.3 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-25 14:40 [PATCH 0/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Uros Bizjak @ 2022-05-25 14:40 ` Uros Bizjak 2022-05-25 16:47 ` Linus Torvalds 2022-05-25 14:40 ` [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 Uros Bizjak 1 sibling, 1 reply; 18+ messages in thread From: Uros Bizjak @ 2022-05-25 14:40 UTC (permalink / raw) To: x86, linux-kernel Cc: Uros Bizjak, Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Waiman.Long, paulmck Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). The main loop of lockref_get improves from: 13: 48 89 c1 mov %rax,%rcx 16: 48 c1 f9 20 sar $0x20,%rcx 1a: 83 c1 01 add $0x1,%ecx 1d: 48 89 ce mov %rcx,%rsi 20: 89 c1 mov %eax,%ecx 22: 48 89 d0 mov %rdx,%rax 25: 48 c1 e6 20 shl $0x20,%rsi 29: 48 09 f1 or %rsi,%rcx 2c: f0 48 0f b1 4d 00 lock cmpxchg %rcx,0x0(%rbp) 32: 48 39 d0 cmp %rdx,%rax 35: 75 17 jne 4e <lockref_get+0x4e> to: 13: 48 89 ca mov %rcx,%rdx 16: 48 c1 fa 20 sar $0x20,%rdx 1a: 83 c2 01 add $0x1,%edx 1d: 48 89 d6 mov %rdx,%rsi 20: 89 ca mov %ecx,%edx 22: 48 c1 e6 20 shl $0x20,%rsi 26: 48 09 f2 or %rsi,%rdx 29: f0 48 0f b1 55 00 lock cmpxchg %rdx,0x0(%rbp) 2f: 75 02 jne 33 <lockref_get+0x33> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: paulmck@linux.vnet.ibm.com --- lib/lockref.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/lib/lockref.c b/lib/lockref.c index 5b34bbd3eba8..c6f0b183b937 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -14,12 +14,11 @@ BUILD_BUG_ON(sizeof(old) != 8); \ old.lock_count = READ_ONCE(lockref->lock_count); \ while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ - struct lockref new = old, prev = old; \ + struct lockref new = old; \ CODE \ - old.lock_count = cmpxchg64_relaxed(&lockref->lock_count, \ - old.lock_count, \ - new.lock_count); \ - if (likely(old.lock_count == prev.lock_count)) { \ + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ + &old.lock_count, \ + new.lock_count))) { \ SUCCESS; \ } \ if (!--retry) \ -- 2.35.3 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-25 14:40 ` [PATCH 1/2] " Uros Bizjak @ 2022-05-25 16:47 ` Linus Torvalds 2022-05-26 8:54 ` Uros Bizjak ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Linus Torvalds @ 2022-05-25 16:47 UTC (permalink / raw) To: Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Michael Ellerman, Thomas Bogendoerfer, Heiko Carstens Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > x86 CMPXCHG instruction returns success in ZF flag, so this > change saves a compare after cmpxchg (and related move instruction > in front of cmpxchg). The main loop of lockref_get improves from: Ack on this one regardless of the 32-bit x86 question. HOWEVER. I'd like other architectures to pipe up too, because I think right now x86 is the only one that implements that "arch_try_cmpxchg()" family of operations natively, and I think the generic fallback for when it is missing might be kind of nasty. Maybe it ends up generating ok code, but it's also possible that it just didn't matter when it was only used in one place in the scheduler. The lockref_get() case can be quite hot under some loads, it would be sad if this made other architectures worse. Anyway, maybe that try_cmpxchg() fallback is fine, and works out well on architectures that use load-locked / store-conditional as-is. But just to verify, I'm adding arm/powerpc/s390/mips people to the cc. See https://lore.kernel.org/all/20220525144013.6481-2-ubizjak@gmail.com/ for the original email and the x86-64 code example. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-25 16:47 ` Linus Torvalds @ 2022-05-26 8:54 ` Uros Bizjak 2022-05-26 12:14 ` Michael Ellerman 2022-05-26 16:56 ` Linus Torvalds 2 siblings, 0 replies; 18+ messages in thread From: Uros Bizjak @ 2022-05-26 8:54 UTC (permalink / raw) To: Linus Torvalds Cc: Catalin Marinas, Will Deacon, Russell King, Michael Ellerman, Thomas Bogendoerfer, Heiko Carstens, the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Paul McKenney [-- Attachment #1: Type: text/plain, Size: 2593 bytes --] On Wed, May 25, 2022 at 6:48 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > > x86 CMPXCHG instruction returns success in ZF flag, so this > > change saves a compare after cmpxchg (and related move instruction > > in front of cmpxchg). The main loop of lockref_get improves from: > > Ack on this one regardless of the 32-bit x86 question. > > HOWEVER. > > I'd like other architectures to pipe up too, because I think right now > x86 is the only one that implements that "arch_try_cmpxchg()" family > of operations natively, and I think the generic fallback for when it > is missing might be kind of nasty. > > Maybe it ends up generating ok code, but it's also possible that it > just didn't matter when it was only used in one place in the > scheduler. > > The lockref_get() case can be quite hot under some loads, it would be > sad if this made other architectures worse. > > Anyway, maybe that try_cmpxchg() fallback is fine, and works out well > on architectures that use load-locked / store-conditional as-is. Attached to this message, please find attached the testcase that analyses various CMPXCHG_LOOPs. Here you will find the old, the fallback and the new cmpxchg loop, together with corresponding lockref_get_* functions. The testcase models the x86 cmpxchg8b and can be compiled for 64bit as well as 32bit targets. As can be seen from the experiment, the try_cmpxchg fallback creates EXACTLY THE SAME code for 64bit target as the unpatched code. For the 32bit target one extra dead reg-reg 32bit move remains in the generated fallback code assembly (this is the compiler (gcc-10.3) artefact with double-word 64bit moves on x86_32 target). From the above experiment, we can conclude that the patched lockref.c creates the same code with the try_cmpxchg fallback as the original code. I think the same will be observed also for other targets. When the new code involving try_cmpxchg is compiled, impressive size gains for x86_32 can be seen. The main loop size reduces from 44 bytes to 30 bytes. In the git repository, several transitions from cmpxchg to try_cmpxchg can be found. The above experiment confirms, that the generated fallback assembly is at least as good as the original unpatched version, but can be more optimal when the target provides try_cmpxchg instruction. Also, it looks to me that several other hot spots throughout the code can be improved by changing them from using cmpxchg to try_cmpxchg. Uros. [-- Attachment #2: lockref-test.c --] [-- Type: text/x-csrc, Size: 2606 bytes --] #include <stdint.h> #define __aligned_u64 u64 __attribute__((aligned(8))) #define LOCK_PREFIX "lock " # define likely(x) __builtin_expect(!!(x), 1) # define unlikely(x) __builtin_expect(!!(x), 0) typedef uint64_t u64; typedef uint32_t u32; static inline void cpu_relax(void) { asm volatile("rep; nop" ::: "memory"); } static inline u64 cmpxchg64(volatile u64 *ptr, u64 old, u64 new) { u64 prev; asm volatile(LOCK_PREFIX "cmpxchg8b %1" : "=A" (prev), "+m" (*ptr) : "b" ((u32)new), "c" ((u32)(new >> 32)), "0" (old) : "memory"); return prev; } #define CMPXCHG_LOOP_OLD(CODE, SUCCESS) do { \ int retry = 100; \ __aligned_u64 old = *lockref; \ while (t) { \ __aligned_u64 new = old, prev = old; \ CODE \ old = cmpxchg64(lockref, old, new); \ if (likely(old == prev)) { \ SUCCESS; \ } \ if (!--retry) \ break; \ cpu_relax(); \ } \ } while (0) void lockref_get_old(u64 *lockref, _Bool t) { CMPXCHG_LOOP_OLD( new++; , return; ); } #define try_cmpxchg64_fallback(_ptr, _oldp, _new) \ ({ \ typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \ ___r = cmpxchg64((_ptr), ___o, (_new)); \ if (unlikely(___r != ___o)) \ *___op = ___r; \ likely(___r == ___o); \ }) #define CMPXCHG_LOOP_FALLBACK(CODE, SUCCESS) do { \ int retry = 100; \ __aligned_u64 old = *lockref; \ while (t) { \ __aligned_u64 new = old; \ CODE \ if (likely(try_cmpxchg64_fallback(lockref, &old, new))) { \ SUCCESS; \ } \ if (!--retry) \ break; \ cpu_relax(); \ } \ } while (0) void lockref_get_fallback(u64 *lockref, _Bool t) { CMPXCHG_LOOP_FALLBACK( new++; , return; ); } static inline _Bool try_cmpxchg64(volatile u64 *ptr, u64 *pold, u64 new) { _Bool success; u64 old = *pold; asm volatile(LOCK_PREFIX "cmpxchg8b %[ptr]" : "=@ccz" (success), [ptr] "+m" (*ptr), "+A" (old) : "b" ((u32)new), "c" ((u32)(new >> 32)) : "memory"); if (unlikely(!success)) *pold = old; return success; } #define CMPXCHG_LOOP_NEW(CODE, SUCCESS) do { \ int retry = 100; \ __aligned_u64 old = *lockref; \ while (t) { \ __aligned_u64 new = old; \ CODE \ if (likely(try_cmpxchg64(lockref, &old, new))) { \ SUCCESS; \ } \ if (!--retry) \ break; \ cpu_relax(); \ } \ } while (0) void lockref_get_new(u64 *lockref, _Bool t) { CMPXCHG_LOOP_NEW( new++; , return; ); } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-25 16:47 ` Linus Torvalds @ 2022-05-26 12:14 ` Michael Ellerman 2022-05-26 12:14 ` Michael Ellerman 2022-05-26 16:56 ` Linus Torvalds 2 siblings, 0 replies; 18+ messages in thread From: Michael Ellerman @ 2022-05-26 12:14 UTC (permalink / raw) To: Linus Torvalds, Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Thomas Bogendoerfer, Heiko Carstens Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney, linuxppc-dev Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: >> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. >> x86 CMPXCHG instruction returns success in ZF flag, so this >> change saves a compare after cmpxchg (and related move instruction >> in front of cmpxchg). The main loop of lockref_get improves from: > > Ack on this one regardless of the 32-bit x86 question. > > HOWEVER. > > I'd like other architectures to pipe up too, because I think right now > x86 is the only one that implements that "arch_try_cmpxchg()" family > of operations natively, and I think the generic fallback for when it > is missing might be kind of nasty. > > Maybe it ends up generating ok code, but it's also possible that it > just didn't matter when it was only used in one place in the > scheduler. This patch seems to generate slightly *better* code on powerpc. I see one register-to-register move that gets shifted slightly later, so that it's skipped on the path that returns directly via the SUCCESS case. So LGTM. > The lockref_get() case can be quite hot under some loads, it would be > sad if this made other architectures worse. Do you know of a benchmark that shows it up? I tried a few things but couldn't get lockref_get() to count for more than 1-2%. cheers ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro @ 2022-05-26 12:14 ` Michael Ellerman 0 siblings, 0 replies; 18+ messages in thread From: Michael Ellerman @ 2022-05-26 12:14 UTC (permalink / raw) To: Linus Torvalds, Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Thomas Bogendoerfer, Heiko Carstens Cc: Waiman.Long, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Thomas Gleixner, Paul McKenney, linuxppc-dev Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: >> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. >> x86 CMPXCHG instruction returns success in ZF flag, so this >> change saves a compare after cmpxchg (and related move instruction >> in front of cmpxchg). The main loop of lockref_get improves from: > > Ack on this one regardless of the 32-bit x86 question. > > HOWEVER. > > I'd like other architectures to pipe up too, because I think right now > x86 is the only one that implements that "arch_try_cmpxchg()" family > of operations natively, and I think the generic fallback for when it > is missing might be kind of nasty. > > Maybe it ends up generating ok code, but it's also possible that it > just didn't matter when it was only used in one place in the > scheduler. This patch seems to generate slightly *better* code on powerpc. I see one register-to-register move that gets shifted slightly later, so that it's skipped on the path that returns directly via the SUCCESS case. So LGTM. > The lockref_get() case can be quite hot under some loads, it would be > sad if this made other architectures worse. Do you know of a benchmark that shows it up? I tried a few things but couldn't get lockref_get() to count for more than 1-2%. cheers ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-26 12:14 ` Michael Ellerman @ 2022-05-26 12:42 ` Mark Rutland -1 siblings, 0 replies; 18+ messages in thread From: Mark Rutland @ 2022-05-26 12:42 UTC (permalink / raw) To: Michael Ellerman, Linus Torvalds Cc: Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Thomas Bogendoerfer, Heiko Carstens, the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney, linuxppc-dev On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > >> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > >> x86 CMPXCHG instruction returns success in ZF flag, so this > >> change saves a compare after cmpxchg (and related move instruction > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > Ack on this one regardless of the 32-bit x86 question. > > > > HOWEVER. > > > > I'd like other architectures to pipe up too, because I think right now > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > of operations natively, and I think the generic fallback for when it > > is missing might be kind of nasty. > > > > Maybe it ends up generating ok code, but it's also possible that it > > just didn't matter when it was only used in one place in the > > scheduler. > > This patch seems to generate slightly *better* code on powerpc. > > I see one register-to-register move that gets shifted slightly later, so > that it's skipped on the path that returns directly via the SUCCESS > case. FWIW, I see the same on arm64; a register-to-register move gets moved out of the success path. That changes the register allocation, and resulting in one fewer move, but otherwise the code generation is the same. Thanks, Mark. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro @ 2022-05-26 12:42 ` Mark Rutland 0 siblings, 0 replies; 18+ messages in thread From: Mark Rutland @ 2022-05-26 12:42 UTC (permalink / raw) To: Michael Ellerman, Linus Torvalds Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas, Heiko Carstens, the arch/x86 maintainers, Uros Bizjak, Russell King, Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, Paul McKenney, Will Deacon On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > >> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > >> x86 CMPXCHG instruction returns success in ZF flag, so this > >> change saves a compare after cmpxchg (and related move instruction > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > Ack on this one regardless of the 32-bit x86 question. > > > > HOWEVER. > > > > I'd like other architectures to pipe up too, because I think right now > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > of operations natively, and I think the generic fallback for when it > > is missing might be kind of nasty. > > > > Maybe it ends up generating ok code, but it's also possible that it > > just didn't matter when it was only used in one place in the > > scheduler. > > This patch seems to generate slightly *better* code on powerpc. > > I see one register-to-register move that gets shifted slightly later, so > that it's skipped on the path that returns directly via the SUCCESS > case. FWIW, I see the same on arm64; a register-to-register move gets moved out of the success path. That changes the register allocation, and resulting in one fewer move, but otherwise the code generation is the same. Thanks, Mark. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-26 12:42 ` Mark Rutland @ 2022-05-27 9:36 ` Heiko Carstens -1 siblings, 0 replies; 18+ messages in thread From: Heiko Carstens @ 2022-05-27 9:36 UTC (permalink / raw) To: Mark Rutland Cc: Michael Ellerman, Linus Torvalds, Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Thomas Bogendoerfer, the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney, linuxppc-dev On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote: > On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > >> > > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > > >> x86 CMPXCHG instruction returns success in ZF flag, so this > > >> change saves a compare after cmpxchg (and related move instruction > > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > > > Ack on this one regardless of the 32-bit x86 question. > > > > > > HOWEVER. > > > > > > I'd like other architectures to pipe up too, because I think right now > > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > > of operations natively, and I think the generic fallback for when it > > > is missing might be kind of nasty. > > > > > > Maybe it ends up generating ok code, but it's also possible that it > > > just didn't matter when it was only used in one place in the > > > scheduler. > > > > This patch seems to generate slightly *better* code on powerpc. > > > > I see one register-to-register move that gets shifted slightly later, so > > that it's skipped on the path that returns directly via the SUCCESS > > case. > > FWIW, I see the same on arm64; a register-to-register move gets moved out of > the success path. That changes the register allocation, and resulting in one > fewer move, but otherwise the code generation is the same. Just for the records: s390 code generation changes the same like on powerpc; so looks good. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro @ 2022-05-27 9:36 ` Heiko Carstens 0 siblings, 0 replies; 18+ messages in thread From: Heiko Carstens @ 2022-05-27 9:36 UTC (permalink / raw) To: Mark Rutland Cc: Waiman.Long, Thomas Bogendoerfer, Will Deacon, Peter Zijlstra, the arch/x86 maintainers, Uros Bizjak, Russell King, Linux Kernel Mailing List, linuxppc-dev, Catalin Marinas, Thomas Gleixner, Paul McKenney, Linus Torvalds On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote: > On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > >> > > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > > >> x86 CMPXCHG instruction returns success in ZF flag, so this > > >> change saves a compare after cmpxchg (and related move instruction > > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > > > Ack on this one regardless of the 32-bit x86 question. > > > > > > HOWEVER. > > > > > > I'd like other architectures to pipe up too, because I think right now > > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > > of operations natively, and I think the generic fallback for when it > > > is missing might be kind of nasty. > > > > > > Maybe it ends up generating ok code, but it's also possible that it > > > just didn't matter when it was only used in one place in the > > > scheduler. > > > > This patch seems to generate slightly *better* code on powerpc. > > > > I see one register-to-register move that gets shifted slightly later, so > > that it's skipped on the path that returns directly via the SUCCESS > > case. > > FWIW, I see the same on arm64; a register-to-register move gets moved out of > the success path. That changes the register allocation, and resulting in one > fewer move, but otherwise the code generation is the same. Just for the records: s390 code generation changes the same like on powerpc; so looks good. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-26 12:14 ` Michael Ellerman @ 2022-05-26 16:52 ` Linus Torvalds -1 siblings, 0 replies; 18+ messages in thread From: Linus Torvalds @ 2022-05-26 16:52 UTC (permalink / raw) To: Michael Ellerman Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas, Heiko Carstens, the arch/x86 maintainers, Uros Bizjak, Russell King, Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, Paul McKenney, Will Deacon On Thu, May 26, 2022 at 5:15 AM Michael Ellerman <mpe@ellerman.id.au> wrote: > > Do you know of a benchmark that shows it up? I tried a few things but > couldn't get lockref_get() to count for more than 1-2%. Heh. 1% for a small instruction sequence that is only handful of instructions and is used in just a couple of places counts as "very hot" for me. I assume you did the natural thing: threaded pathname lookup (with paths being of the longer variety to not be dominated by system call etc costs). Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro @ 2022-05-26 16:52 ` Linus Torvalds 0 siblings, 0 replies; 18+ messages in thread From: Linus Torvalds @ 2022-05-26 16:52 UTC (permalink / raw) To: Michael Ellerman Cc: Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Thomas Bogendoerfer, Heiko Carstens, the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney, linuxppc-dev On Thu, May 26, 2022 at 5:15 AM Michael Ellerman <mpe@ellerman.id.au> wrote: > > Do you know of a benchmark that shows it up? I tried a few things but > couldn't get lockref_get() to count for more than 1-2%. Heh. 1% for a small instruction sequence that is only handful of instructions and is used in just a couple of places counts as "very hot" for me. I assume you did the natural thing: threaded pathname lookup (with paths being of the longer variety to not be dominated by system call etc costs). Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-25 16:47 ` Linus Torvalds 2022-05-26 8:54 ` Uros Bizjak 2022-05-26 12:14 ` Michael Ellerman @ 2022-05-26 16:56 ` Linus Torvalds 2 siblings, 0 replies; 18+ messages in thread From: Linus Torvalds @ 2022-05-26 16:56 UTC (permalink / raw) To: Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Michael Ellerman, Thomas Bogendoerfer, Heiko Carstens Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney On Wed, May 25, 2022 at 9:47 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ack on this one regardless of the 32-bit x86 question. .. and with confirmation from Michael and Mark (and the analysis from Uros on the fallback code), I've applied this to my tree. NOTE! I have *not* applied the second patch with the x86-32 change to enable this on 32-bit with CMPXCHG8B enabled. I think that's independent, and will leave it up to the x86 maintainers. Plus I think the patch should be simplified. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 2022-05-25 14:40 [PATCH 0/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Uros Bizjak 2022-05-25 14:40 ` [PATCH 1/2] " Uros Bizjak @ 2022-05-25 14:40 ` Uros Bizjak 2022-05-25 16:29 ` Linus Torvalds 1 sibling, 1 reply; 18+ messages in thread From: Uros Bizjak @ 2022-05-25 14:40 UTC (permalink / raw) To: x86, linux-kernel Cc: Uros Bizjak, Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Waiman.Long, paulmck Commit bc08b449ee14ace4d869adaa1bb35a44ce68d775 enabled lockless reference count updates using cmpxchg() only for x86_64 and left x86_32 behind due to inability to detect support for cmpxchg8b instruction. Nowadays, we can use CONFIG_X86_CMPXCHG64 for this purpose. Also, by using try_cmpxchg64() instead of cmpxchg() in CMPXCHG_LOOP macro, the compiler actually produces sane code, improving lockref_get_or_lock main loop from: 2a5: 8d 48 01 lea 0x1(%eax),%ecx 2a8: 85 c0 test %eax,%eax 2aa: 7e 3c jle 2e8 <lockref_get_or_lock+0x78> 2ac: 8b 44 24 08 mov 0x8(%esp),%eax 2b0: 8b 54 24 0c mov 0xc(%esp),%edx 2b4: 8b 74 24 04 mov 0x4(%esp),%esi 2b8: f0 0f c7 0e lock cmpxchg8b (%esi) 2bc: 8b 4c 24 0c mov 0xc(%esp),%ecx 2c0: 89 c3 mov %eax,%ebx 2c2: 89 d0 mov %edx,%eax 2c4: 8b 74 24 08 mov 0x8(%esp),%esi 2c8: 31 ca xor %ecx,%edx 2ca: 31 de xor %ebx,%esi 2cc: 09 f2 or %esi,%edx 2ce: 75 40 jne 310 <lockref_get_or_lock+0xa0> to: 2d: 8d 4f 01 lea 0x1(%edi),%ecx 30: 85 ff test %edi,%edi 32: 7e 1c jle 50 <lockref_get_or_lock+0x50> 34: f0 0f c7 0e lock cmpxchg8b (%esi) 38: 75 36 jne 70 <lockref_get_or_lock+0x70> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: paulmck@linux.vnet.ibm.com --- arch/x86/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 762a0b6ab8b6..326cfdc4f136 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -27,7 +27,6 @@ config X86_64 # Options that are inherently 64-bit kernel only: select ARCH_HAS_GIGANTIC_PAGE select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 - select ARCH_USE_CMPXCHG_LOCKREF select HAVE_ARCH_SOFT_DIRTY select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE @@ -111,6 +110,7 @@ config X86 select ARCH_SUPPORTS_LTO_CLANG select ARCH_SUPPORTS_LTO_CLANG_THIN select ARCH_USE_BUILTIN_BSWAP + select ARCH_USE_CMPXCHG_LOCKREF if X86_64 || (X86_32 && X86_CMPXCHG64) select ARCH_USE_MEMTEST select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS -- 2.35.3 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 2022-05-25 14:40 ` [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 Uros Bizjak @ 2022-05-25 16:29 ` Linus Torvalds 2022-05-26 8:30 ` David Laight 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2022-05-25 16:29 UTC (permalink / raw) To: Uros Bizjak Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > + select ARCH_USE_CMPXCHG_LOCKREF if X86_64 || (X86_32 && X86_CMPXCHG64) Ugh. That looks pointlessly complicated. X86_64 already enables X86_CMPXCHG64 afaik, so you can just say select ARCH_USE_CMPXCHG_LOCKREF if X86_CMPXCHG64 which is much clearer: CMPXCHG_LOCKREF needs CMPXCHG64, and the Kconfig option says exactly that. That said, this also makes me wonder if we should just make cmpxchg8b requirement unconditional. Googling around, it looks like Windows NT stopped booting on CPUs without cmpxchg8b in version 5.1. That was in 2001. Here we are, 21 years later, and we still ostensibly try to support CPUs without it, but I doubt it's worth it. So maybe time to just say "we require cmpxchg8b". In fact, I think we have effectively done that already for years, since we have config X86_CMPXCHG64 def_bool y depends on X86_PAE || ... iow, enabling X86_PAE will force-enable CMPXCHG8B due to the wider page table entries. And I would assume that all distros basically do that anyway (but I do not have a 32-bit distro around to check). It would mean that we would finally drop support for i486 (and possibly some Pentium clones, but afaik a number of them did actually support cmpxchg8b even if they didn't report it in cpuid). Comments? Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 2022-05-25 16:29 ` Linus Torvalds @ 2022-05-26 8:30 ` David Laight 2022-05-26 9:12 ` Uros Bizjak 2022-05-26 17:23 ` Linus Torvalds 0 siblings, 2 replies; 18+ messages in thread From: David Laight @ 2022-05-26 8:30 UTC (permalink / raw) To: 'Linus Torvalds', Uros Bizjak Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney From: Linus Torvalds > Sent: 25 May 2022 17:30 > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > + select ARCH_USE_CMPXCHG_LOCKREF if X86_64 || (X86_32 && X86_CMPXCHG64) > > Ugh. That looks pointlessly complicated. X86_64 already enables > X86_CMPXCHG64 afaik, so you can just say > > select ARCH_USE_CMPXCHG_LOCKREF if X86_CMPXCHG64 > > which is much clearer: CMPXCHG_LOCKREF needs CMPXCHG64, and the > Kconfig option says exactly that. > > That said, this also makes me wonder if we should just make cmpxchg8b > requirement unconditional. > > Googling around, it looks like Windows NT stopped booting on CPUs > without cmpxchg8b in version 5.1. That was in 2001. > > Here we are, 21 years later, and we still ostensibly try to support > CPUs without it, but I doubt it's worth it. > > So maybe time to just say "we require cmpxchg8b". > > In fact, I think we have effectively done that already for years, since we have > > config X86_CMPXCHG64 > def_bool y > depends on X86_PAE || ... > > iow, enabling X86_PAE will force-enable CMPXCHG8B due to the wider > page table entries. > > And I would assume that all distros basically do that anyway (but I do > not have a 32-bit distro around to check). > > It would mean that we would finally drop support for i486 (and > possibly some Pentium clones, but afaik a number of them did actually > support cmpxchg8b even if they didn't report it in cpuid). Perhaps there could be a non-smp implementation of cmpxchg8b that just disables interrupts? While I have used a dual 486 I doubt Linux would run ever have on it. The same is probably true for old dual Pentiums. I think there are still some 486-class embedded cpu that include a few of the newer instructions (usually things like rdtsc). But they won't be smp. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 2022-05-26 8:30 ` David Laight @ 2022-05-26 9:12 ` Uros Bizjak 2022-05-26 17:23 ` Linus Torvalds 1 sibling, 0 replies; 18+ messages in thread From: Uros Bizjak @ 2022-05-26 9:12 UTC (permalink / raw) To: David Laight Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney On Thu, May 26, 2022 at 10:30 AM David Laight <David.Laight@aculab.com> wrote: > > From: Linus Torvalds > > Sent: 25 May 2022 17:30 > > > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > + select ARCH_USE_CMPXCHG_LOCKREF if X86_64 || (X86_32 && X86_CMPXCHG64) > > > > Ugh. That looks pointlessly complicated. X86_64 already enables > > X86_CMPXCHG64 afaik, so you can just say > > > > select ARCH_USE_CMPXCHG_LOCKREF if X86_CMPXCHG64 > > > > which is much clearer: CMPXCHG_LOCKREF needs CMPXCHG64, and the > > Kconfig option says exactly that. > > > > That said, this also makes me wonder if we should just make cmpxchg8b > > requirement unconditional. > > > > Googling around, it looks like Windows NT stopped booting on CPUs > > without cmpxchg8b in version 5.1. That was in 2001. > > > > Here we are, 21 years later, and we still ostensibly try to support > > CPUs without it, but I doubt it's worth it. > > > > So maybe time to just say "we require cmpxchg8b". > > > > In fact, I think we have effectively done that already for years, since we have > > > > config X86_CMPXCHG64 > > def_bool y > > depends on X86_PAE || ... > > > > iow, enabling X86_PAE will force-enable CMPXCHG8B due to the wider > > page table entries. > > > > And I would assume that all distros basically do that anyway (but I do > > not have a 32-bit distro around to check). > > > > It would mean that we would finally drop support for i486 (and > > possibly some Pentium clones, but afaik a number of them did actually > > support cmpxchg8b even if they didn't report it in cpuid). > > Perhaps there could be a non-smp implementation of cmpxchg8b > that just disables interrupts? > > While I have used a dual 486 I doubt Linux would run ever > have on it. The same is probably true for old dual Pentiums. > > I think there are still some 486-class embedded cpu that include > a few of the newer instructions (usually things like rdtsc). > But they won't be smp. The above solution is already implemented when X86_CMPXCHG64 is not defined. Please see arch/x86/include/asm/cmpxchg_32.h, where in this case we have: #define arch_cmpxchg64(ptr, o, n) \ ({ \ __typeof__(*(ptr)) __ret; \ __typeof__(*(ptr)) __old = (o); \ __typeof__(*(ptr)) __new = (n); \ alternative_io(LOCK_PREFIX_HERE \ "call cmpxchg8b_emu", \ "lock; cmpxchg8b (%%esi)" , \ X86_FEATURE_CX8, \ "=A" (__ret), \ "S" ((ptr)), "0" (__old), \ "b" ((unsigned int)__new), \ "c" ((unsigned int)(__new>>32)) \ : "memory"); \ __ret; }) Without CX8, the code will be patched to the call to cmpxchg8b_emu, defined in arch/x86/lib/cmpxchg8b_emu.S. I think that the solution with CONFIG_X86_CMPXCHG64 is quite good. When defined, arch_cmpxchg64 defines the real CMPXCHG8B instruction. Without the config flag, the above definition is compiled in and the code is patched to either the call or the real instruction during runtime. Also, the config flag makes difference only in arch/x86/include/asm/cmpxchg_32.h, and with try_cmpxchg64 fallbacks, one can still use cmpxchg64/try_cmpxchg64 even when the HW doesn't support CMPXCHG8B natively. Uros. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 2022-05-26 8:30 ` David Laight 2022-05-26 9:12 ` Uros Bizjak @ 2022-05-26 17:23 ` Linus Torvalds 1 sibling, 0 replies; 18+ messages in thread From: Linus Torvalds @ 2022-05-26 17:23 UTC (permalink / raw) To: David Laight Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Waiman.Long, Paul McKenney On Thu, May 26, 2022 at 1:30 AM David Laight <David.Laight@aculab.com> wrote: > > Perhaps there could be a non-smp implementation of cmpxchg8b > that just disables interrupts? As Uros points out, we do have exactly that, but it's not actually written to be usable for the "trylock" case. Plus it would be pointless for lockrefs, since the non-cmpxchg case that just falls back to a spinlock would be faster and simpler (particularly on UP, where locking goes away). > While I have used a dual 486 I doubt Linux would run ever > have on it. The same is probably true for old dual Pentiums. Yeah, I don't think we ever supported SMP on i486, afaik they all needed special system glue. I think the "modern" x86 SMP support with a local APIC was a PPro and newer thing historically, but clearly there were then later what amounted to Penitum/MMX class cores (ie old Atom) that did support SMP. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2022-05-27 9:37 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-25 14:40 [PATCH 0/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Uros Bizjak 2022-05-25 14:40 ` [PATCH 1/2] " Uros Bizjak 2022-05-25 16:47 ` Linus Torvalds 2022-05-26 8:54 ` Uros Bizjak 2022-05-26 12:14 ` Michael Ellerman 2022-05-26 12:14 ` Michael Ellerman 2022-05-26 12:42 ` Mark Rutland 2022-05-26 12:42 ` Mark Rutland 2022-05-27 9:36 ` Heiko Carstens 2022-05-27 9:36 ` Heiko Carstens 2022-05-26 16:52 ` Linus Torvalds 2022-05-26 16:52 ` Linus Torvalds 2022-05-26 16:56 ` Linus Torvalds 2022-05-25 14:40 ` [PATCH 2/2] locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_32 && X86_CMPXCHG64 Uros Bizjak 2022-05-25 16:29 ` Linus Torvalds 2022-05-26 8:30 ` David Laight 2022-05-26 9:12 ` Uros Bizjak 2022-05-26 17:23 ` Linus Torvalds
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.