* [PATCH V4 0/5] riscv: Optimize atomic implementation @ 2022-05-05 3:55 guoren 2022-05-05 3:55 ` [PATCH V4 1/5] riscv: atomic: Cleanup unnecessary definition guoren ` (4 more replies) 0 siblings, 5 replies; 24+ messages in thread From: guoren @ 2022-05-05 3:55 UTC (permalink / raw) To: guoren, arnd, palmer, mark.rutland, will, peterz, boqun.feng, dlustig, parri.andrea Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren From: Guo Ren <guoren@linux.alibaba.com> Here are some optimizations for riscv atomic implementation, the first three patches are normal cleanup and custom implementation without relating to atomic semantics. The 4th is the same as arm64 LSE with using embedded .aq/.rl annotation. The 5th is good for riscv implementation with reducing a full-barrier cost. Changes in V4: - Coding convention & optimize the comments - Re-order the patchset Changes in V3: - Fixup usage of lr.rl & sc.aq with violation of ISA - Add Optimize dec_if_positive functions - Add conditional atomic operations' optimization Changes in V2: - Fixup LR/SC memory barrier semantic problems which pointed by Rutland - Combine patches into one patchset series - Separate AMO optimization & LRSC optimization for convenience patch review Guo Ren (5): riscv: atomic: Cleanup unnecessary definition riscv: atomic: Optimize acquire and release for AMO operations riscv: atomic: Optimize memory barrier semantics of LRSC-pairs riscv: atomic: Optimize dec_if_positive functions riscv: atomic: Add conditional atomic operations' optimization Guo Ren (5): riscv: atomic: Cleanup unnecessary definition riscv: atomic: Optimize dec_if_positive functions riscv: atomic: Add custom conditional atomic operation implementation riscv: atomic: Optimize atomic_ops & xchg with .aq/rl annotation riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation arch/riscv/include/asm/atomic.h | 174 +++++++++++++++++++++++++++---- arch/riscv/include/asm/cmpxchg.h | 30 ++---- 2 files changed, 162 insertions(+), 42 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH V4 1/5] riscv: atomic: Cleanup unnecessary definition 2022-05-05 3:55 [PATCH V4 0/5] riscv: Optimize atomic implementation guoren @ 2022-05-05 3:55 ` guoren 2022-05-05 3:55 ` [PATCH V4 2/5] riscv: atomic: Optimize dec_if_positive functions guoren ` (3 subsequent siblings) 4 siblings, 0 replies; 24+ messages in thread From: guoren @ 2022-05-05 3:55 UTC (permalink / raw) To: guoren, arnd, palmer, mark.rutland, will, peterz, boqun.feng, dlustig, parri.andrea Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren From: Guo Ren <guoren@linux.alibaba.com> The cmpxchg32 & cmpxchg32_local are not used in Linux anymore. So clean up asm/cmpxchg.h. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mark Rutland <mark.rutland@arm.com> --- arch/riscv/include/asm/cmpxchg.h | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h index 36dc962f6343..12debce235e5 100644 --- a/arch/riscv/include/asm/cmpxchg.h +++ b/arch/riscv/include/asm/cmpxchg.h @@ -348,18 +348,6 @@ #define arch_cmpxchg_local(ptr, o, n) \ (__cmpxchg_relaxed((ptr), (o), (n), sizeof(*(ptr)))) -#define cmpxchg32(ptr, o, n) \ -({ \ - BUILD_BUG_ON(sizeof(*(ptr)) != 4); \ - arch_cmpxchg((ptr), (o), (n)); \ -}) - -#define cmpxchg32_local(ptr, o, n) \ -({ \ - BUILD_BUG_ON(sizeof(*(ptr)) != 4); \ - arch_cmpxchg_relaxed((ptr), (o), (n)) \ -}) - #define arch_cmpxchg64(ptr, o, n) \ ({ \ BUILD_BUG_ON(sizeof(*(ptr)) != 8); \ -- 2.25.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH V4 2/5] riscv: atomic: Optimize dec_if_positive functions 2022-05-05 3:55 [PATCH V4 0/5] riscv: Optimize atomic implementation guoren 2022-05-05 3:55 ` [PATCH V4 1/5] riscv: atomic: Cleanup unnecessary definition guoren @ 2022-05-05 3:55 ` guoren 2022-05-05 3:55 ` [PATCH V4 3/5] riscv: atomic: Add custom conditional atomic operation implementation guoren ` (2 subsequent siblings) 4 siblings, 0 replies; 24+ messages in thread From: guoren @ 2022-05-05 3:55 UTC (permalink / raw) To: guoren, arnd, palmer, mark.rutland, will, peterz, boqun.feng, dlustig, parri.andrea Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren From: Guo Ren <guoren@linux.alibaba.com> Current implementation wastes another register to pass the argument, but we only need addi to calculate the result. Optimize the code with minimize the usage of registers. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Dan Lustig <dlustig@nvidia.com> Cc: Andrea Parri <parri.andrea@gmail.com> --- arch/riscv/include/asm/atomic.h | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h index ac9bdf4fc404..f3c6a6eac02a 100644 --- a/arch/riscv/include/asm/atomic.h +++ b/arch/riscv/include/asm/atomic.h @@ -310,47 +310,47 @@ ATOMIC_OPS() #undef ATOMIC_OPS #undef ATOMIC_OP -static __always_inline int arch_atomic_sub_if_positive(atomic_t *v, int offset) +static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) { int prev, rc; __asm__ __volatile__ ( "0: lr.w %[p], %[c]\n" - " sub %[rc], %[p], %[o]\n" + " addi %[rc], %[p], -1\n" " bltz %[rc], 1f\n" " sc.w.rl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) - : [o]"r" (offset) + : : "memory"); - return prev - offset; + return prev - 1; } -#define arch_atomic_dec_if_positive(v) arch_atomic_sub_if_positive(v, 1) +#define arch_atomic_dec_if_positive arch_atomic_dec_if_positive #ifndef CONFIG_GENERIC_ATOMIC64 -static __always_inline s64 arch_atomic64_sub_if_positive(atomic64_t *v, s64 offset) +static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) { s64 prev; long rc; __asm__ __volatile__ ( "0: lr.d %[p], %[c]\n" - " sub %[rc], %[p], %[o]\n" + " addi %[rc], %[p], -1\n" " bltz %[rc], 1f\n" " sc.d.rl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) - : [o]"r" (offset) + : : "memory"); - return prev - offset; + return prev - 1; } -#define arch_atomic64_dec_if_positive(v) arch_atomic64_sub_if_positive(v, 1) +#define arch_atomic64_dec_if_positive arch_atomic64_dec_if_positive #endif #endif /* _ASM_RISCV_ATOMIC_H */ -- 2.25.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH V4 3/5] riscv: atomic: Add custom conditional atomic operation implementation 2022-05-05 3:55 [PATCH V4 0/5] riscv: Optimize atomic implementation guoren 2022-05-05 3:55 ` [PATCH V4 1/5] riscv: atomic: Cleanup unnecessary definition guoren 2022-05-05 3:55 ` [PATCH V4 2/5] riscv: atomic: Optimize dec_if_positive functions guoren @ 2022-05-05 3:55 ` guoren 2022-05-05 3:55 ` [PATCH V4 4/5] riscv: atomic: Optimize atomic_ops & xchg with .aq/rl annotation guoren 2022-05-05 3:55 ` [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation guoren 4 siblings, 0 replies; 24+ messages in thread From: guoren @ 2022-05-05 3:55 UTC (permalink / raw) To: guoren, arnd, palmer, mark.rutland, will, peterz, boqun.feng, dlustig, parri.andrea Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren From: Guo Ren <guoren@linux.alibaba.com> Add conditional atomic operations' custom implementation (similar to dec_if_positive), here is the list: - arch_atomic_inc_unless_negative - arch_atomic_dec_unless_positive - arch_atomic64_inc_unless_negative - arch_atomic64_dec_unless_positive Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Dan Lustig <dlustig@nvidia.com> --- arch/riscv/include/asm/atomic.h | 82 +++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h index f3c6a6eac02a..0dfe9d857a76 100644 --- a/arch/riscv/include/asm/atomic.h +++ b/arch/riscv/include/asm/atomic.h @@ -310,6 +310,46 @@ ATOMIC_OPS() #undef ATOMIC_OPS #undef ATOMIC_OP +static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) +{ + int prev, rc; + + __asm__ __volatile__ ( + "0: lr.w %[p], %[c]\n" + " bltz %[p], 1f\n" + " addi %[rc], %[p], 1\n" + " sc.w.rl %[rc], %[rc], %[c]\n" + " bnez %[rc], 0b\n" + " fence rw, rw\n" + "1:\n" + : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) + : + : "memory"); + return !(prev < 0); +} + +#define arch_atomic_inc_unless_negative arch_atomic_inc_unless_negative + +static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) +{ + int prev, rc; + + __asm__ __volatile__ ( + "0: lr.w %[p], %[c]\n" + " bgtz %[p], 1f\n" + " addi %[rc], %[p], -1\n" + " sc.w.rl %[rc], %[rc], %[c]\n" + " bnez %[rc], 0b\n" + " fence rw, rw\n" + "1:\n" + : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) + : + : "memory"); + return !(prev > 0); +} + +#define arch_atomic_dec_unless_positive arch_atomic_dec_unless_positive + static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) { int prev, rc; @@ -331,6 +371,48 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) #define arch_atomic_dec_if_positive arch_atomic_dec_if_positive #ifndef CONFIG_GENERIC_ATOMIC64 +static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) +{ + s64 prev; + long rc; + + __asm__ __volatile__ ( + "0: lr.d %[p], %[c]\n" + " bltz %[p], 1f\n" + " addi %[rc], %[p], 1\n" + " sc.d.rl %[rc], %[rc], %[c]\n" + " bnez %[rc], 0b\n" + " fence rw, rw\n" + "1:\n" + : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) + : + : "memory"); + return !(prev < 0); +} + +#define arch_atomic64_inc_unless_negative arch_atomic64_inc_unless_negative + +static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) +{ + s64 prev; + long rc; + + __asm__ __volatile__ ( + "0: lr.d %[p], %[c]\n" + " bgtz %[p], 1f\n" + " addi %[rc], %[p], -1\n" + " sc.d.rl %[rc], %[rc], %[c]\n" + " bnez %[rc], 0b\n" + " fence rw, rw\n" + "1:\n" + : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) + : + : "memory"); + return !(prev > 0); +} + +#define arch_atomic64_dec_unless_positive arch_atomic64_dec_unless_positive + static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) { s64 prev; -- 2.25.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH V4 4/5] riscv: atomic: Optimize atomic_ops & xchg with .aq/rl annotation 2022-05-05 3:55 [PATCH V4 0/5] riscv: Optimize atomic implementation guoren ` (2 preceding siblings ...) 2022-05-05 3:55 ` [PATCH V4 3/5] riscv: atomic: Add custom conditional atomic operation implementation guoren @ 2022-05-05 3:55 ` guoren 2022-05-05 3:55 ` [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation guoren 4 siblings, 0 replies; 24+ messages in thread From: guoren @ 2022-05-05 3:55 UTC (permalink / raw) To: guoren, arnd, palmer, mark.rutland, will, peterz, boqun.feng, dlustig, parri.andrea Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren From: Guo Ren <guoren@linux.alibaba.com> Current atomic_ops' acquire & release implementations are using fence instruction which causes another extra bad "fence r/fence ,w" performance effect. They are too heavy for Linux atomic acquire & release semantic requirements. RISC-V AMO instructions could embed aquire and release into AMO instructions. Here is the related description from RISC-V ISA spec: To help implement multiprocessor synchronization, the AMOs optionally provide release consistency semantics. - .aq: If the aq bit is set, then no later memory operations in this RISC-V hart can be observed to take place before the AMO. - .rl: If the rl bit is set, then other RISC-V harts will not observe the AMO before memory accesses preceding the AMO in this RISC-V hart. - .aqrl: Setting both the aq and the rl bit on an AMO makes the sequence sequentially consistent, meaning that it cannot be reordered with earlier or later memory operations from the same hart. Let's use the above to optimize riscv atomic performance, just like arm64/include/asm/atomic_lse.h. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Dan Lustig <dlustig@nvidia.com> --- arch/riscv/include/asm/atomic.h | 64 ++++++++++++++++++++++++++++++++ arch/riscv/include/asm/cmpxchg.h | 12 ++---- 2 files changed, 68 insertions(+), 8 deletions(-) diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h index 0dfe9d857a76..34f757dfc8f2 100644 --- a/arch/riscv/include/asm/atomic.h +++ b/arch/riscv/include/asm/atomic.h @@ -99,6 +99,30 @@ c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ return ret; \ } \ static __always_inline \ +c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ + atomic##prefix##_t *v) \ +{ \ + register c_type ret; \ + __asm__ __volatile__ ( \ + " amo" #asm_op "." #asm_type ".aq %1, %2, %0" \ + : "+A" (v->counter), "=r" (ret) \ + : "r" (I) \ + : "memory"); \ + return ret; \ +} \ +static __always_inline \ +c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ + atomic##prefix##_t *v) \ +{ \ + register c_type ret; \ + __asm__ __volatile__ ( \ + " amo" #asm_op "." #asm_type ".rl %1, %2, %0" \ + : "+A" (v->counter), "=r" (ret) \ + : "r" (I) \ + : "memory"); \ + return ret; \ +} \ +static __always_inline \ c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ { \ register c_type ret; \ @@ -118,6 +142,18 @@ c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ } \ static __always_inline \ +c_type arch_atomic##prefix##_##op##_return_acquire(c_type i, \ + atomic##prefix##_t *v) \ +{ \ + return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ +} \ +static __always_inline \ +c_type arch_atomic##prefix##_##op##_return_release(c_type i, \ + atomic##prefix##_t *v) \ +{ \ + return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ +} \ +static __always_inline \ c_type arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ { \ return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ @@ -140,22 +176,38 @@ ATOMIC_OPS(sub, add, +, -i) #define arch_atomic_add_return_relaxed arch_atomic_add_return_relaxed #define arch_atomic_sub_return_relaxed arch_atomic_sub_return_relaxed +#define arch_atomic_add_return_acquire arch_atomic_add_return_acquire +#define arch_atomic_sub_return_acquire arch_atomic_sub_return_acquire +#define arch_atomic_add_return_release arch_atomic_add_return_release +#define arch_atomic_sub_return_release arch_atomic_sub_return_release #define arch_atomic_add_return arch_atomic_add_return #define arch_atomic_sub_return arch_atomic_sub_return #define arch_atomic_fetch_add_relaxed arch_atomic_fetch_add_relaxed #define arch_atomic_fetch_sub_relaxed arch_atomic_fetch_sub_relaxed +#define arch_atomic_fetch_add_acquire arch_atomic_fetch_add_acquire +#define arch_atomic_fetch_sub_acquire arch_atomic_fetch_sub_acquire +#define arch_atomic_fetch_add_release arch_atomic_fetch_add_release +#define arch_atomic_fetch_sub_release arch_atomic_fetch_sub_release #define arch_atomic_fetch_add arch_atomic_fetch_add #define arch_atomic_fetch_sub arch_atomic_fetch_sub #ifndef CONFIG_GENERIC_ATOMIC64 #define arch_atomic64_add_return_relaxed arch_atomic64_add_return_relaxed #define arch_atomic64_sub_return_relaxed arch_atomic64_sub_return_relaxed +#define arch_atomic64_add_return_acquire arch_atomic64_add_return_acquire +#define arch_atomic64_sub_return_acquire arch_atomic64_sub_return_acquire +#define arch_atomic64_add_return_release arch_atomic64_add_return_release +#define arch_atomic64_sub_return_release arch_atomic64_sub_return_release #define arch_atomic64_add_return arch_atomic64_add_return #define arch_atomic64_sub_return arch_atomic64_sub_return #define arch_atomic64_fetch_add_relaxed arch_atomic64_fetch_add_relaxed #define arch_atomic64_fetch_sub_relaxed arch_atomic64_fetch_sub_relaxed +#define arch_atomic64_fetch_add_acquire arch_atomic64_fetch_add_acquire +#define arch_atomic64_fetch_sub_acquire arch_atomic64_fetch_sub_acquire +#define arch_atomic64_fetch_add_release arch_atomic64_fetch_add_release +#define arch_atomic64_fetch_sub_release arch_atomic64_fetch_sub_release #define arch_atomic64_fetch_add arch_atomic64_fetch_add #define arch_atomic64_fetch_sub arch_atomic64_fetch_sub #endif @@ -178,6 +230,12 @@ ATOMIC_OPS(xor, xor, i) #define arch_atomic_fetch_and_relaxed arch_atomic_fetch_and_relaxed #define arch_atomic_fetch_or_relaxed arch_atomic_fetch_or_relaxed #define arch_atomic_fetch_xor_relaxed arch_atomic_fetch_xor_relaxed +#define arch_atomic_fetch_and_acquire arch_atomic_fetch_and_acquire +#define arch_atomic_fetch_or_acquire arch_atomic_fetch_or_acquire +#define arch_atomic_fetch_xor_acquire arch_atomic_fetch_xor_acquire +#define arch_atomic_fetch_and_release arch_atomic_fetch_and_release +#define arch_atomic_fetch_or_release arch_atomic_fetch_or_release +#define arch_atomic_fetch_xor_release arch_atomic_fetch_xor_release #define arch_atomic_fetch_and arch_atomic_fetch_and #define arch_atomic_fetch_or arch_atomic_fetch_or #define arch_atomic_fetch_xor arch_atomic_fetch_xor @@ -186,6 +244,12 @@ ATOMIC_OPS(xor, xor, i) #define arch_atomic64_fetch_and_relaxed arch_atomic64_fetch_and_relaxed #define arch_atomic64_fetch_or_relaxed arch_atomic64_fetch_or_relaxed #define arch_atomic64_fetch_xor_relaxed arch_atomic64_fetch_xor_relaxed +#define arch_atomic64_fetch_and_acquire arch_atomic64_fetch_and_acquire +#define arch_atomic64_fetch_or_acquire arch_atomic64_fetch_or_acquire +#define arch_atomic64_fetch_xor_acquire arch_atomic64_fetch_xor_acquire +#define arch_atomic64_fetch_and_release arch_atomic64_fetch_and_release +#define arch_atomic64_fetch_or_release arch_atomic64_fetch_or_release +#define arch_atomic64_fetch_xor_release arch_atomic64_fetch_xor_release #define arch_atomic64_fetch_and arch_atomic64_fetch_and #define arch_atomic64_fetch_or arch_atomic64_fetch_or #define arch_atomic64_fetch_xor arch_atomic64_fetch_xor diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h index 12debce235e5..1af8db92250b 100644 --- a/arch/riscv/include/asm/cmpxchg.h +++ b/arch/riscv/include/asm/cmpxchg.h @@ -52,16 +52,14 @@ switch (size) { \ case 4: \ __asm__ __volatile__ ( \ - " amoswap.w %0, %2, %1\n" \ - RISCV_ACQUIRE_BARRIER \ + " amoswap.w.aq %0, %2, %1\n" \ : "=r" (__ret), "+A" (*__ptr) \ : "r" (__new) \ : "memory"); \ break; \ case 8: \ __asm__ __volatile__ ( \ - " amoswap.d %0, %2, %1\n" \ - RISCV_ACQUIRE_BARRIER \ + " amoswap.d.aq %0, %2, %1\n" \ : "=r" (__ret), "+A" (*__ptr) \ : "r" (__new) \ : "memory"); \ @@ -87,16 +85,14 @@ switch (size) { \ case 4: \ __asm__ __volatile__ ( \ - RISCV_RELEASE_BARRIER \ - " amoswap.w %0, %2, %1\n" \ + " amoswap.w.rl %0, %2, %1\n" \ : "=r" (__ret), "+A" (*__ptr) \ : "r" (__new) \ : "memory"); \ break; \ case 8: \ __asm__ __volatile__ ( \ - RISCV_RELEASE_BARRIER \ - " amoswap.d %0, %2, %1\n" \ + " amoswap.d.rl %0, %2, %1\n" \ : "=r" (__ret), "+A" (*__ptr) \ : "r" (__new) \ : "memory"); \ -- 2.25.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-05-05 3:55 [PATCH V4 0/5] riscv: Optimize atomic implementation guoren ` (3 preceding siblings ...) 2022-05-05 3:55 ` [PATCH V4 4/5] riscv: atomic: Optimize atomic_ops & xchg with .aq/rl annotation guoren @ 2022-05-05 3:55 ` guoren 2022-05-21 20:46 ` Palmer Dabbelt 4 siblings, 1 reply; 24+ messages in thread From: guoren @ 2022-05-05 3:55 UTC (permalink / raw) To: guoren, arnd, palmer, mark.rutland, will, peterz, boqun.feng, dlustig, parri.andrea Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren From: Guo Ren <guoren@linux.alibaba.com> The current implementation is the same with 8e86f0b409a4 ("arm64: atomics: fix use of acquire + release for full barrier semantics"). RISC-V could combine acquire and release into the SC instructions and it could reduce a fence instruction to gain better performance. Here is related descriptio from RISC-V ISA 10.2 Load-Reserved/Store-Conditional Instructions: - .aq: The LR/SC sequence can be given acquire semantics by setting the aq bit on the LR instruction. - .rl: The LR/SC sequence can be given release semantics by setting the rl bit on the SC instruction. - .aqrl: Setting the aq bit on the LR instruction, and setting both the aq and the rl bit on the SC instruction makes the LR/SC sequence sequentially consistent, meaning that it cannot be reordered with earlier or later memory operations from the same hart. Software should not set the rl bit on an LR instruction unless the aq bit is also set, nor should software set the aq bit on an SC instruction unless the rl bit is also set. LR.rl and SC.aq instructions are not guaranteed to provide any stronger ordering than those with both bits clear, but may result in lower performance. The only difference is when sc.w/d.aqrl failed, it would cause .aq effect than before. But it's okay for sematic because overlap address LR couldn't beyond relating SC. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Dan Lustig <dlustig@nvidia.com> Cc: Andrea Parri <parri.andrea@gmail.com> --- arch/riscv/include/asm/atomic.h | 24 ++++++++---------------- arch/riscv/include/asm/cmpxchg.h | 6 ++---- 2 files changed, 10 insertions(+), 20 deletions(-) diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h index 34f757dfc8f2..aef8aa9ac4f4 100644 --- a/arch/riscv/include/asm/atomic.h +++ b/arch/riscv/include/asm/atomic.h @@ -269,9 +269,8 @@ static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int "0: lr.w %[p], %[c]\n" " beq %[p], %[u], 1f\n" " add %[rc], %[p], %[a]\n" - " sc.w.rl %[rc], %[rc], %[c]\n" + " sc.w.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : [a]"r" (a), [u]"r" (u) @@ -290,9 +289,8 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, "0: lr.d %[p], %[c]\n" " beq %[p], %[u], 1f\n" " add %[rc], %[p], %[a]\n" - " sc.d.rl %[rc], %[rc], %[c]\n" + " sc.d.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : [a]"r" (a), [u]"r" (u) @@ -382,9 +380,8 @@ static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) "0: lr.w %[p], %[c]\n" " bltz %[p], 1f\n" " addi %[rc], %[p], 1\n" - " sc.w.rl %[rc], %[rc], %[c]\n" + " sc.w.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : @@ -402,9 +399,8 @@ static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) "0: lr.w %[p], %[c]\n" " bgtz %[p], 1f\n" " addi %[rc], %[p], -1\n" - " sc.w.rl %[rc], %[rc], %[c]\n" + " sc.w.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : @@ -422,9 +418,8 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) "0: lr.w %[p], %[c]\n" " addi %[rc], %[p], -1\n" " bltz %[rc], 1f\n" - " sc.w.rl %[rc], %[rc], %[c]\n" + " sc.w.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : @@ -444,9 +439,8 @@ static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) "0: lr.d %[p], %[c]\n" " bltz %[p], 1f\n" " addi %[rc], %[p], 1\n" - " sc.d.rl %[rc], %[rc], %[c]\n" + " sc.d.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : @@ -465,9 +459,8 @@ static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) "0: lr.d %[p], %[c]\n" " bgtz %[p], 1f\n" " addi %[rc], %[p], -1\n" - " sc.d.rl %[rc], %[rc], %[c]\n" + " sc.d.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : @@ -486,9 +479,8 @@ static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) "0: lr.d %[p], %[c]\n" " addi %[rc], %[p], -1\n" " bltz %[rc], 1f\n" - " sc.d.rl %[rc], %[rc], %[c]\n" + " sc.d.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" "1:\n" : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) : diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h index 1af8db92250b..9269fceb86e0 100644 --- a/arch/riscv/include/asm/cmpxchg.h +++ b/arch/riscv/include/asm/cmpxchg.h @@ -307,9 +307,8 @@ __asm__ __volatile__ ( \ "0: lr.w %0, %2\n" \ " bne %0, %z3, 1f\n" \ - " sc.w.rl %1, %z4, %2\n" \ + " sc.w.aqrl %1, %z4, %2\n" \ " bnez %1, 0b\n" \ - " fence rw, rw\n" \ "1:\n" \ : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ : "rJ" ((long)__old), "rJ" (__new) \ @@ -319,9 +318,8 @@ __asm__ __volatile__ ( \ "0: lr.d %0, %2\n" \ " bne %0, %z3, 1f\n" \ - " sc.d.rl %1, %z4, %2\n" \ + " sc.d.aqrl %1, %z4, %2\n" \ " bnez %1, 0b\n" \ - " fence rw, rw\n" \ "1:\n" \ : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ : "rJ" (__old), "rJ" (__new) \ -- 2.25.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-05-05 3:55 ` [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation guoren @ 2022-05-21 20:46 ` Palmer Dabbelt 2022-05-22 13:12 ` Guo Ren 0 siblings, 1 reply; 24+ messages in thread From: Palmer Dabbelt @ 2022-05-21 20:46 UTC (permalink / raw) To: guoren, parri.andrea Cc: guoren, Arnd Bergmann, mark.rutland, Will Deacon, peterz, boqun.feng, Daniel Lustig, parri.andrea, linux-arch, linux-kernel, linux-riscv, guoren On Wed, 04 May 2022 20:55:26 PDT (-0700), guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > The current implementation is the same with 8e86f0b409a4 > ("arm64: atomics: fix use of acquire + release for full barrier > semantics"). RISC-V could combine acquire and release into the SC > instructions and it could reduce a fence instruction to gain better > performance. Here is related descriptio from RISC-V ISA 10.2 > Load-Reserved/Store-Conditional Instructions: > > - .aq: The LR/SC sequence can be given acquire semantics by > setting the aq bit on the LR instruction. > - .rl: The LR/SC sequence can be given release semantics by > setting the rl bit on the SC instruction. > - .aqrl: Setting the aq bit on the LR instruction, and setting > both the aq and the rl bit on the SC instruction makes > the LR/SC sequence sequentially consistent, meaning that > it cannot be reordered with earlier or later memory > operations from the same hart. > > Software should not set the rl bit on an LR instruction unless > the aq bit is also set, nor should software set the aq bit on an > SC instruction unless the rl bit is also set. LR.rl and SC.aq > instructions are not guaranteed to provide any stronger ordering > than those with both bits clear, but may result in lower > performance. > > The only difference is when sc.w/d.aqrl failed, it would cause .aq > effect than before. But it's okay for sematic because overlap > address LR couldn't beyond relating SC. IIUC that's not accurate, or at least wasn't in 2018. The ISA tends to drift around a bit, so it's possible things have changed since then. 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") describes the issue more specifically, that's when we added these fences. There have certainly been complains that these fences are too heavyweight for the HW to go fast, but IIUC it's the best option we have given the current set of memory model primitives we implement in the ISA (ie, there's more in RVWMO but no way to encode that). The others all look good, though, and as these are really all independent cleanups I'm going to go ahead and put those three on for-next. There's also a bunch of checkpatch errors. The ones about "*" seem spurious, but the alignment ones aren't. Here's my fixups: diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h index 34f757dfc8f2..0bde499fa8bc 100644 --- a/arch/riscv/include/asm/atomic.h +++ b/arch/riscv/include/asm/atomic.h @@ -86,9 +86,9 @@ ATOMIC_OPS(xor, xor, i) * versions, while the logical ops only have fetch versions. */ #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix) \ -static __always_inline \ -c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ - atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ + atomic##prefix##_t *v) \ { \ register c_type ret; \ __asm__ __volatile__ ( \ @@ -98,9 +98,9 @@ c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ : "memory"); \ return ret; \ } \ -static __always_inline \ -c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ - atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ + atomic##prefix##_t *v) \ { \ register c_type ret; \ __asm__ __volatile__ ( \ @@ -110,9 +110,9 @@ c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ : "memory"); \ return ret; \ } \ -static __always_inline \ -c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ - atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_fetch_##op##_release(c_type i, \ + atomic##prefix##_t *v) \ { \ register c_type ret; \ __asm__ __volatile__ ( \ @@ -122,8 +122,8 @@ c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ : "memory"); \ return ret; \ } \ -static __always_inline \ -c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ { \ register c_type ret; \ __asm__ __volatile__ ( \ @@ -135,28 +135,28 @@ c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ } #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix) \ -static __always_inline \ -c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ - atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ + atomic##prefix##_t *v) \ { \ - return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ + return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ } \ -static __always_inline \ -c_type arch_atomic##prefix##_##op##_return_acquire(c_type i, \ - atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_##op##_return_acquire(c_type i, \ + atomic##prefix##_t *v) \ { \ - return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ + return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ } \ -static __always_inline \ -c_type arch_atomic##prefix##_##op##_return_release(c_type i, \ - atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_##op##_return_release(c_type i, \ + atomic##prefix##_t *v) \ { \ - return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ + return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ } \ -static __always_inline \ -c_type arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ +static __always_inline c_type \ +arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ { \ - return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ + return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ } #ifdef CONFIG_GENERIC_ATOMIC64 > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Palmer Dabbelt <palmer@dabbelt.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Dan Lustig <dlustig@nvidia.com> > Cc: Andrea Parri <parri.andrea@gmail.com> > --- > arch/riscv/include/asm/atomic.h | 24 ++++++++---------------- > arch/riscv/include/asm/cmpxchg.h | 6 ++---- > 2 files changed, 10 insertions(+), 20 deletions(-) > > diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > index 34f757dfc8f2..aef8aa9ac4f4 100644 > --- a/arch/riscv/include/asm/atomic.h > +++ b/arch/riscv/include/asm/atomic.h > @@ -269,9 +269,8 @@ static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int > "0: lr.w %[p], %[c]\n" > " beq %[p], %[u], 1f\n" > " add %[rc], %[p], %[a]\n" > - " sc.w.rl %[rc], %[rc], %[c]\n" > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : [a]"r" (a), [u]"r" (u) > @@ -290,9 +289,8 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, > "0: lr.d %[p], %[c]\n" > " beq %[p], %[u], 1f\n" > " add %[rc], %[p], %[a]\n" > - " sc.d.rl %[rc], %[rc], %[c]\n" > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : [a]"r" (a), [u]"r" (u) > @@ -382,9 +380,8 @@ static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) > "0: lr.w %[p], %[c]\n" > " bltz %[p], 1f\n" > " addi %[rc], %[p], 1\n" > - " sc.w.rl %[rc], %[rc], %[c]\n" > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : > @@ -402,9 +399,8 @@ static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) > "0: lr.w %[p], %[c]\n" > " bgtz %[p], 1f\n" > " addi %[rc], %[p], -1\n" > - " sc.w.rl %[rc], %[rc], %[c]\n" > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : > @@ -422,9 +418,8 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) > "0: lr.w %[p], %[c]\n" > " addi %[rc], %[p], -1\n" > " bltz %[rc], 1f\n" > - " sc.w.rl %[rc], %[rc], %[c]\n" > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : > @@ -444,9 +439,8 @@ static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) > "0: lr.d %[p], %[c]\n" > " bltz %[p], 1f\n" > " addi %[rc], %[p], 1\n" > - " sc.d.rl %[rc], %[rc], %[c]\n" > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : > @@ -465,9 +459,8 @@ static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) > "0: lr.d %[p], %[c]\n" > " bgtz %[p], 1f\n" > " addi %[rc], %[p], -1\n" > - " sc.d.rl %[rc], %[rc], %[c]\n" > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : > @@ -486,9 +479,8 @@ static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) > "0: lr.d %[p], %[c]\n" > " addi %[rc], %[p], -1\n" > " bltz %[rc], 1f\n" > - " sc.d.rl %[rc], %[rc], %[c]\n" > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" > "1:\n" > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > : > diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h > index 1af8db92250b..9269fceb86e0 100644 > --- a/arch/riscv/include/asm/cmpxchg.h > +++ b/arch/riscv/include/asm/cmpxchg.h > @@ -307,9 +307,8 @@ > __asm__ __volatile__ ( \ > "0: lr.w %0, %2\n" \ > " bne %0, %z3, 1f\n" \ > - " sc.w.rl %1, %z4, %2\n" \ > + " sc.w.aqrl %1, %z4, %2\n" \ > " bnez %1, 0b\n" \ > - " fence rw, rw\n" \ > "1:\n" \ > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > : "rJ" ((long)__old), "rJ" (__new) \ > @@ -319,9 +318,8 @@ > __asm__ __volatile__ ( \ > "0: lr.d %0, %2\n" \ > " bne %0, %z3, 1f\n" \ > - " sc.d.rl %1, %z4, %2\n" \ > + " sc.d.aqrl %1, %z4, %2\n" \ > " bnez %1, 0b\n" \ > - " fence rw, rw\n" \ > "1:\n" \ > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > : "rJ" (__old), "rJ" (__new) \ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-05-21 20:46 ` Palmer Dabbelt @ 2022-05-22 13:12 ` Guo Ren 2022-06-02 5:59 ` Palmer Dabbelt 0 siblings, 1 reply; 24+ messages in thread From: Guo Ren @ 2022-05-22 13:12 UTC (permalink / raw) To: Palmer Dabbelt Cc: Andrea Parri, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, Boqun Feng, Daniel Lustig, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Sun, May 22, 2022 at 4:46 AM Palmer Dabbelt <palmer@dabbelt.com> wrote: > > On Wed, 04 May 2022 20:55:26 PDT (-0700), guoren@kernel.org wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > > > The current implementation is the same with 8e86f0b409a4 > > ("arm64: atomics: fix use of acquire + release for full barrier > > semantics"). RISC-V could combine acquire and release into the SC > > instructions and it could reduce a fence instruction to gain better > > performance. Here is related descriptio from RISC-V ISA 10.2 > > Load-Reserved/Store-Conditional Instructions: > > > > - .aq: The LR/SC sequence can be given acquire semantics by > > setting the aq bit on the LR instruction. > > - .rl: The LR/SC sequence can be given release semantics by > > setting the rl bit on the SC instruction. > > - .aqrl: Setting the aq bit on the LR instruction, and setting > > both the aq and the rl bit on the SC instruction makes > > the LR/SC sequence sequentially consistent, meaning that > > it cannot be reordered with earlier or later memory > > operations from the same hart. > > > > Software should not set the rl bit on an LR instruction unless > > the aq bit is also set, nor should software set the aq bit on an > > SC instruction unless the rl bit is also set. LR.rl and SC.aq > > instructions are not guaranteed to provide any stronger ordering > > than those with both bits clear, but may result in lower > > performance. > > > > The only difference is when sc.w/d.aqrl failed, it would cause .aq > > effect than before. But it's okay for sematic because overlap > > address LR couldn't beyond relating SC. > > IIUC that's not accurate, or at least wasn't in 2018. The ISA tends to > drift around a bit, so it's possible things have changed since then. > 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") Thx for explaining, that why I suspected with the sentence in the comment: "which do not give full-ordering with .aqrl" > describes the issue more specifically, that's when we added these > fences. There have certainly been complains that these fences are too > heavyweight for the HW to go fast, but IIUC it's the best option we have Yeah, it would reduce the performance on D1 and our next-generation processor has optimized fence performance a lot. > given the current set of memory model primitives we implement in the > ISA (ie, there's more in RVWMO but no way to encode that). > > The others all look good, though, and as these are really all > independent cleanups I'm going to go ahead and put those three on > for-next. > > There's also a bunch of checkpatch errors. The ones about "*" seem > spurious, but the alignment ones aren't. Here's my fixups: Thx for the fixup. > > diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > index 34f757dfc8f2..0bde499fa8bc 100644 > --- a/arch/riscv/include/asm/atomic.h > +++ b/arch/riscv/include/asm/atomic.h > @@ -86,9 +86,9 @@ ATOMIC_OPS(xor, xor, i) > * versions, while the logical ops only have fetch versions. > */ > #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix) \ > -static __always_inline \ > -c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > - atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > + atomic##prefix##_t *v) \ > { \ > register c_type ret; \ > __asm__ __volatile__ ( \ > @@ -98,9 +98,9 @@ c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > : "memory"); \ > return ret; \ > } \ > -static __always_inline \ > -c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > - atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > + atomic##prefix##_t *v) \ > { \ > register c_type ret; \ > __asm__ __volatile__ ( \ > @@ -110,9 +110,9 @@ c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > : "memory"); \ > return ret; \ > } \ > -static __always_inline \ > -c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > - atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > + atomic##prefix##_t *v) \ > { \ > register c_type ret; \ > __asm__ __volatile__ ( \ > @@ -122,8 +122,8 @@ c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > : "memory"); \ > return ret; \ > } \ > -static __always_inline \ > -c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > { \ > register c_type ret; \ > __asm__ __volatile__ ( \ > @@ -135,28 +135,28 @@ c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > } > > #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix) \ > -static __always_inline \ > -c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > - atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > + atomic##prefix##_t *v) \ > { \ > - return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > + return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > } \ > -static __always_inline \ > -c_type arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > - atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > + atomic##prefix##_t *v) \ > { \ > - return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > + return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > } \ > -static __always_inline \ > -c_type arch_atomic##prefix##_##op##_return_release(c_type i, \ > - atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_##op##_return_release(c_type i, \ > + atomic##prefix##_t *v) \ > { \ > - return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > + return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > } \ > -static __always_inline \ > -c_type arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > +static __always_inline c_type \ > +arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > { \ > - return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > + return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > } > > #ifdef CONFIG_GENERIC_ATOMIC64 > > > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > Signed-off-by: Guo Ren <guoren@kernel.org> > > Cc: Palmer Dabbelt <palmer@dabbelt.com> > > Cc: Mark Rutland <mark.rutland@arm.com> > > Cc: Dan Lustig <dlustig@nvidia.com> > > Cc: Andrea Parri <parri.andrea@gmail.com> > > --- > > arch/riscv/include/asm/atomic.h | 24 ++++++++---------------- > > arch/riscv/include/asm/cmpxchg.h | 6 ++---- > > 2 files changed, 10 insertions(+), 20 deletions(-) > > > > diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > > index 34f757dfc8f2..aef8aa9ac4f4 100644 > > --- a/arch/riscv/include/asm/atomic.h > > +++ b/arch/riscv/include/asm/atomic.h > > @@ -269,9 +269,8 @@ static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int > > "0: lr.w %[p], %[c]\n" > > " beq %[p], %[u], 1f\n" > > " add %[rc], %[p], %[a]\n" > > - " sc.w.rl %[rc], %[rc], %[c]\n" > > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : [a]"r" (a), [u]"r" (u) > > @@ -290,9 +289,8 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, > > "0: lr.d %[p], %[c]\n" > > " beq %[p], %[u], 1f\n" > > " add %[rc], %[p], %[a]\n" > > - " sc.d.rl %[rc], %[rc], %[c]\n" > > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : [a]"r" (a), [u]"r" (u) > > @@ -382,9 +380,8 @@ static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) > > "0: lr.w %[p], %[c]\n" > > " bltz %[p], 1f\n" > > " addi %[rc], %[p], 1\n" > > - " sc.w.rl %[rc], %[rc], %[c]\n" > > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : > > @@ -402,9 +399,8 @@ static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) > > "0: lr.w %[p], %[c]\n" > > " bgtz %[p], 1f\n" > > " addi %[rc], %[p], -1\n" > > - " sc.w.rl %[rc], %[rc], %[c]\n" > > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : > > @@ -422,9 +418,8 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) > > "0: lr.w %[p], %[c]\n" > > " addi %[rc], %[p], -1\n" > > " bltz %[rc], 1f\n" > > - " sc.w.rl %[rc], %[rc], %[c]\n" > > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : > > @@ -444,9 +439,8 @@ static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) > > "0: lr.d %[p], %[c]\n" > > " bltz %[p], 1f\n" > > " addi %[rc], %[p], 1\n" > > - " sc.d.rl %[rc], %[rc], %[c]\n" > > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : > > @@ -465,9 +459,8 @@ static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) > > "0: lr.d %[p], %[c]\n" > > " bgtz %[p], 1f\n" > > " addi %[rc], %[p], -1\n" > > - " sc.d.rl %[rc], %[rc], %[c]\n" > > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : > > @@ -486,9 +479,8 @@ static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) > > "0: lr.d %[p], %[c]\n" > > " addi %[rc], %[p], -1\n" > > " bltz %[rc], 1f\n" > > - " sc.d.rl %[rc], %[rc], %[c]\n" > > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > "1:\n" > > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > : > > diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h > > index 1af8db92250b..9269fceb86e0 100644 > > --- a/arch/riscv/include/asm/cmpxchg.h > > +++ b/arch/riscv/include/asm/cmpxchg.h > > @@ -307,9 +307,8 @@ > > __asm__ __volatile__ ( \ > > "0: lr.w %0, %2\n" \ > > " bne %0, %z3, 1f\n" \ > > - " sc.w.rl %1, %z4, %2\n" \ > > + " sc.w.aqrl %1, %z4, %2\n" \ > > " bnez %1, 0b\n" \ > > - " fence rw, rw\n" \ > > "1:\n" \ > > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > > : "rJ" ((long)__old), "rJ" (__new) \ > > @@ -319,9 +318,8 @@ > > __asm__ __volatile__ ( \ > > "0: lr.d %0, %2\n" \ > > " bne %0, %z3, 1f\n" \ > > - " sc.d.rl %1, %z4, %2\n" \ > > + " sc.d.aqrl %1, %z4, %2\n" \ > > " bnez %1, 0b\n" \ > > - " fence rw, rw\n" \ > > "1:\n" \ > > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > > : "rJ" (__old), "rJ" (__new) \ -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-05-22 13:12 ` Guo Ren @ 2022-06-02 5:59 ` Palmer Dabbelt 2022-06-13 11:49 ` Guo Ren 0 siblings, 1 reply; 24+ messages in thread From: Palmer Dabbelt @ 2022-06-02 5:59 UTC (permalink / raw) To: guoren Cc: parri.andrea, Arnd Bergmann, mark.rutland, Will Deacon, peterz, boqun.feng, Daniel Lustig, linux-arch, linux-kernel, linux-riscv, guoren On Sun, 22 May 2022 06:12:56 PDT (-0700), guoren@kernel.org wrote: > On Sun, May 22, 2022 at 4:46 AM Palmer Dabbelt <palmer@dabbelt.com> wrote: >> >> On Wed, 04 May 2022 20:55:26 PDT (-0700), guoren@kernel.org wrote: >> > From: Guo Ren <guoren@linux.alibaba.com> >> > >> > The current implementation is the same with 8e86f0b409a4 >> > ("arm64: atomics: fix use of acquire + release for full barrier >> > semantics"). RISC-V could combine acquire and release into the SC >> > instructions and it could reduce a fence instruction to gain better >> > performance. Here is related descriptio from RISC-V ISA 10.2 >> > Load-Reserved/Store-Conditional Instructions: >> > >> > - .aq: The LR/SC sequence can be given acquire semantics by >> > setting the aq bit on the LR instruction. >> > - .rl: The LR/SC sequence can be given release semantics by >> > setting the rl bit on the SC instruction. >> > - .aqrl: Setting the aq bit on the LR instruction, and setting >> > both the aq and the rl bit on the SC instruction makes >> > the LR/SC sequence sequentially consistent, meaning that >> > it cannot be reordered with earlier or later memory >> > operations from the same hart. >> > >> > Software should not set the rl bit on an LR instruction unless >> > the aq bit is also set, nor should software set the aq bit on an >> > SC instruction unless the rl bit is also set. LR.rl and SC.aq >> > instructions are not guaranteed to provide any stronger ordering >> > than those with both bits clear, but may result in lower >> > performance. >> > >> > The only difference is when sc.w/d.aqrl failed, it would cause .aq >> > effect than before. But it's okay for sematic because overlap >> > address LR couldn't beyond relating SC. >> >> IIUC that's not accurate, or at least wasn't in 2018. The ISA tends to >> drift around a bit, so it's possible things have changed since then. >> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > Thx for explaining, that why I suspected with the sentence in the comment: > "which do not give full-ordering with .aqrl" Sorry, I'm not quite sure what you're saying here. My understanding is that this change results in mappings that violate LKMM, based on the rationale in that previous commit. IIUC that all still holds, but I'm not really a memory model person so I frequently miss stuff around there. >> describes the issue more specifically, that's when we added these >> fences. There have certainly been complains that these fences are too >> heavyweight for the HW to go fast, but IIUC it's the best option we have > Yeah, it would reduce the performance on D1 and our next-generation > processor has optimized fence performance a lot. Definately a bummer that the fences make the HW go slow, but I don't really see any other way to go about this. If you think these mappings are valid for LKMM and RVWMO then we should figure this out, but trying to drop fences to make HW go faster in ways that violate the memory model is going to lead to insanity. I can definately buy the argument that we have mappings of various application memory models that are difficult to make fast given the ISA encodings of RVWMO we have, but that's not really an issue we can fix in the atomic mappings. >> given the current set of memory model primitives we implement in the >> ISA (ie, there's more in RVWMO but no way to encode that). >> >> The others all look good, though, and as these are really all >> independent cleanups I'm going to go ahead and put those three on >> for-next. >> >> There's also a bunch of checkpatch errors. The ones about "*" seem >> spurious, but the alignment ones aren't. Here's my fixups: > Thx for the fixup. > >> >> diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h >> index 34f757dfc8f2..0bde499fa8bc 100644 >> --- a/arch/riscv/include/asm/atomic.h >> +++ b/arch/riscv/include/asm/atomic.h >> @@ -86,9 +86,9 @@ ATOMIC_OPS(xor, xor, i) >> * versions, while the logical ops only have fetch versions. >> */ >> #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix) \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ >> - atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ >> + atomic##prefix##_t *v) \ >> { \ >> register c_type ret; \ >> __asm__ __volatile__ ( \ >> @@ -98,9 +98,9 @@ c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ >> : "memory"); \ >> return ret; \ >> } \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ >> - atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ >> + atomic##prefix##_t *v) \ >> { \ >> register c_type ret; \ >> __asm__ __volatile__ ( \ >> @@ -110,9 +110,9 @@ c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ >> : "memory"); \ >> return ret; \ >> } \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ >> - atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_fetch_##op##_release(c_type i, \ >> + atomic##prefix##_t *v) \ >> { \ >> register c_type ret; \ >> __asm__ __volatile__ ( \ >> @@ -122,8 +122,8 @@ c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ >> : "memory"); \ >> return ret; \ >> } \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ >> { \ >> register c_type ret; \ >> __asm__ __volatile__ ( \ >> @@ -135,28 +135,28 @@ c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ >> } >> >> #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix) \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ >> - atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ >> + atomic##prefix##_t *v) \ >> { \ >> - return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ >> + return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ >> } \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_##op##_return_acquire(c_type i, \ >> - atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_##op##_return_acquire(c_type i, \ >> + atomic##prefix##_t *v) \ >> { \ >> - return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ >> + return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ >> } \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_##op##_return_release(c_type i, \ >> - atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_##op##_return_release(c_type i, \ >> + atomic##prefix##_t *v) \ >> { \ >> - return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ >> + return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ >> } \ >> -static __always_inline \ >> -c_type arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ >> +static __always_inline c_type \ >> +arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ >> { \ >> - return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ >> + return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ >> } >> >> #ifdef CONFIG_GENERIC_ATOMIC64 >> >> >> > >> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >> > Signed-off-by: Guo Ren <guoren@kernel.org> >> > Cc: Palmer Dabbelt <palmer@dabbelt.com> >> > Cc: Mark Rutland <mark.rutland@arm.com> >> > Cc: Dan Lustig <dlustig@nvidia.com> >> > Cc: Andrea Parri <parri.andrea@gmail.com> >> > --- >> > arch/riscv/include/asm/atomic.h | 24 ++++++++---------------- >> > arch/riscv/include/asm/cmpxchg.h | 6 ++---- >> > 2 files changed, 10 insertions(+), 20 deletions(-) >> > >> > diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h >> > index 34f757dfc8f2..aef8aa9ac4f4 100644 >> > --- a/arch/riscv/include/asm/atomic.h >> > +++ b/arch/riscv/include/asm/atomic.h >> > @@ -269,9 +269,8 @@ static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int >> > "0: lr.w %[p], %[c]\n" >> > " beq %[p], %[u], 1f\n" >> > " add %[rc], %[p], %[a]\n" >> > - " sc.w.rl %[rc], %[rc], %[c]\n" >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : [a]"r" (a), [u]"r" (u) >> > @@ -290,9 +289,8 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, >> > "0: lr.d %[p], %[c]\n" >> > " beq %[p], %[u], 1f\n" >> > " add %[rc], %[p], %[a]\n" >> > - " sc.d.rl %[rc], %[rc], %[c]\n" >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : [a]"r" (a), [u]"r" (u) >> > @@ -382,9 +380,8 @@ static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) >> > "0: lr.w %[p], %[c]\n" >> > " bltz %[p], 1f\n" >> > " addi %[rc], %[p], 1\n" >> > - " sc.w.rl %[rc], %[rc], %[c]\n" >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : >> > @@ -402,9 +399,8 @@ static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) >> > "0: lr.w %[p], %[c]\n" >> > " bgtz %[p], 1f\n" >> > " addi %[rc], %[p], -1\n" >> > - " sc.w.rl %[rc], %[rc], %[c]\n" >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : >> > @@ -422,9 +418,8 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) >> > "0: lr.w %[p], %[c]\n" >> > " addi %[rc], %[p], -1\n" >> > " bltz %[rc], 1f\n" >> > - " sc.w.rl %[rc], %[rc], %[c]\n" >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : >> > @@ -444,9 +439,8 @@ static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) >> > "0: lr.d %[p], %[c]\n" >> > " bltz %[p], 1f\n" >> > " addi %[rc], %[p], 1\n" >> > - " sc.d.rl %[rc], %[rc], %[c]\n" >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : >> > @@ -465,9 +459,8 @@ static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) >> > "0: lr.d %[p], %[c]\n" >> > " bgtz %[p], 1f\n" >> > " addi %[rc], %[p], -1\n" >> > - " sc.d.rl %[rc], %[rc], %[c]\n" >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : >> > @@ -486,9 +479,8 @@ static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) >> > "0: lr.d %[p], %[c]\n" >> > " addi %[rc], %[p], -1\n" >> > " bltz %[rc], 1f\n" >> > - " sc.d.rl %[rc], %[rc], %[c]\n" >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" >> > " bnez %[rc], 0b\n" >> > - " fence rw, rw\n" >> > "1:\n" >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) >> > : >> > diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h >> > index 1af8db92250b..9269fceb86e0 100644 >> > --- a/arch/riscv/include/asm/cmpxchg.h >> > +++ b/arch/riscv/include/asm/cmpxchg.h >> > @@ -307,9 +307,8 @@ >> > __asm__ __volatile__ ( \ >> > "0: lr.w %0, %2\n" \ >> > " bne %0, %z3, 1f\n" \ >> > - " sc.w.rl %1, %z4, %2\n" \ >> > + " sc.w.aqrl %1, %z4, %2\n" \ >> > " bnez %1, 0b\n" \ >> > - " fence rw, rw\n" \ >> > "1:\n" \ >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ >> > : "rJ" ((long)__old), "rJ" (__new) \ >> > @@ -319,9 +318,8 @@ >> > __asm__ __volatile__ ( \ >> > "0: lr.d %0, %2\n" \ >> > " bne %0, %z3, 1f\n" \ >> > - " sc.d.rl %1, %z4, %2\n" \ >> > + " sc.d.aqrl %1, %z4, %2\n" \ >> > " bnez %1, 0b\n" \ >> > - " fence rw, rw\n" \ >> > "1:\n" \ >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ >> > : "rJ" (__old), "rJ" (__new) \ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-02 5:59 ` Palmer Dabbelt @ 2022-06-13 11:49 ` Guo Ren 2022-06-14 11:03 ` Andrea Parri 0 siblings, 1 reply; 24+ messages in thread From: Guo Ren @ 2022-06-13 11:49 UTC (permalink / raw) To: Palmer Dabbelt, Daniel Lustig Cc: Andrea Parri, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, Boqun Feng, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Thu, Jun 2, 2022 at 1:59 PM Palmer Dabbelt <palmer@dabbelt.com> wrote: > > On Sun, 22 May 2022 06:12:56 PDT (-0700), guoren@kernel.org wrote: > > On Sun, May 22, 2022 at 4:46 AM Palmer Dabbelt <palmer@dabbelt.com> wrote: > >> > >> On Wed, 04 May 2022 20:55:26 PDT (-0700), guoren@kernel.org wrote: > >> > From: Guo Ren <guoren@linux.alibaba.com> > >> > > >> > The current implementation is the same with 8e86f0b409a4 > >> > ("arm64: atomics: fix use of acquire + release for full barrier > >> > semantics"). RISC-V could combine acquire and release into the SC > >> > instructions and it could reduce a fence instruction to gain better > >> > performance. Here is related descriptio from RISC-V ISA 10.2 > >> > Load-Reserved/Store-Conditional Instructions: > >> > > >> > - .aq: The LR/SC sequence can be given acquire semantics by > >> > setting the aq bit on the LR instruction. > >> > - .rl: The LR/SC sequence can be given release semantics by > >> > setting the rl bit on the SC instruction. > >> > - .aqrl: Setting the aq bit on the LR instruction, and setting > >> > both the aq and the rl bit on the SC instruction makes > >> > the LR/SC sequence sequentially consistent, meaning that > >> > it cannot be reordered with earlier or later memory > >> > operations from the same hart. > >> > > >> > Software should not set the rl bit on an LR instruction unless > >> > the aq bit is also set, nor should software set the aq bit on an > >> > SC instruction unless the rl bit is also set. LR.rl and SC.aq > >> > instructions are not guaranteed to provide any stronger ordering > >> > than those with both bits clear, but may result in lower > >> > performance. > >> > > >> > The only difference is when sc.w/d.aqrl failed, it would cause .aq > >> > effect than before. But it's okay for sematic because overlap > >> > address LR couldn't beyond relating SC. > >> > >> IIUC that's not accurate, or at least wasn't in 2018. The ISA tends to > >> drift around a bit, so it's possible things have changed since then. > >> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > Thx for explaining, that why I suspected with the sentence in the comment: > > "which do not give full-ordering with .aqrl" > > Sorry, I'm not quite sure what you're saying here. My understanding is > that this change results in mappings that violate LKMM, based on the > rationale in that previous commit. IIUC that all still holds, but I'm > not really a memory model person so I frequently miss stuff around > there. 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") is about fixup wrong spinlock/unlock implementation and not relate to this patch. Actually, sc.w.aqrl is very strong and the same with: fence rw, rw sc.w fence rw,rw So "which do not give full-ordering with .aqrl" is not writen in RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > > >> describes the issue more specifically, that's when we added these > >> fences. There have certainly been complains that these fences are too > >> heavyweight for the HW to go fast, but IIUC it's the best option we have > > Yeah, it would reduce the performance on D1 and our next-generation > > processor has optimized fence performance a lot. > > Definately a bummer that the fences make the HW go slow, but I don't > really see any other way to go about this. If you think these mappings > are valid for LKMM and RVWMO then we should figure this out, but trying > to drop fences to make HW go faster in ways that violate the memory > model is going to lead to insanity. Actually, this patch is okay with the ISA spec, and Dan also thought it was valid. ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw ------ > 2. Using .aqrl to replace the fence rw, rw is okay to ISA manual, > right? And reducing a fence instruction to gain better performance: > "0: lr.w %[p], %[c]\n" > " sub %[rc], %[p], %[o]\n" > " bltz %[rc], 1f\n". > - " sc.w.rl %[rc], %[rc], %[c]\n" > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > " bnez %[rc], 0b\n" > - " fence rw, rw\n" Yes, using .aqrl is valid. Dan ------ > > I can definately buy the argument that we have mappings of various > application memory models that are difficult to make fast given the ISA > encodings of RVWMO we have, but that's not really an issue we can fix in > the atomic mappings. > > >> given the current set of memory model primitives we implement in the > >> ISA (ie, there's more in RVWMO but no way to encode that). > >> > >> The others all look good, though, and as these are really all > >> independent cleanups I'm going to go ahead and put those three on > >> for-next. > >> > >> There's also a bunch of checkpatch errors. The ones about "*" seem > >> spurious, but the alignment ones aren't. Here's my fixups: > > Thx for the fixup. > > > >> > >> diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > >> index 34f757dfc8f2..0bde499fa8bc 100644 > >> --- a/arch/riscv/include/asm/atomic.h > >> +++ b/arch/riscv/include/asm/atomic.h > >> @@ -86,9 +86,9 @@ ATOMIC_OPS(xor, xor, i) > >> * versions, while the logical ops only have fetch versions. > >> */ > >> #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix) \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > >> - atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > >> + atomic##prefix##_t *v) \ > >> { \ > >> register c_type ret; \ > >> __asm__ __volatile__ ( \ > >> @@ -98,9 +98,9 @@ c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > >> : "memory"); \ > >> return ret; \ > >> } \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > >> - atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > >> + atomic##prefix##_t *v) \ > >> { \ > >> register c_type ret; \ > >> __asm__ __volatile__ ( \ > >> @@ -110,9 +110,9 @@ c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > >> : "memory"); \ > >> return ret; \ > >> } \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > >> - atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > >> + atomic##prefix##_t *v) \ > >> { \ > >> register c_type ret; \ > >> __asm__ __volatile__ ( \ > >> @@ -122,8 +122,8 @@ c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > >> : "memory"); \ > >> return ret; \ > >> } \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > >> { \ > >> register c_type ret; \ > >> __asm__ __volatile__ ( \ > >> @@ -135,28 +135,28 @@ c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > >> } > >> > >> #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix) \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > >> - atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > >> + atomic##prefix##_t *v) \ > >> { \ > >> - return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > >> + return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > >> } \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > >> - atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > >> + atomic##prefix##_t *v) \ > >> { \ > >> - return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > >> + return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > >> } \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_##op##_return_release(c_type i, \ > >> - atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_##op##_return_release(c_type i, \ > >> + atomic##prefix##_t *v) \ > >> { \ > >> - return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > >> + return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > >> } \ > >> -static __always_inline \ > >> -c_type arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > >> +static __always_inline c_type \ > >> +arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > >> { \ > >> - return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > >> + return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > >> } > >> > >> #ifdef CONFIG_GENERIC_ATOMIC64 > >> > >> > >> > > >> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > >> > Signed-off-by: Guo Ren <guoren@kernel.org> > >> > Cc: Palmer Dabbelt <palmer@dabbelt.com> > >> > Cc: Mark Rutland <mark.rutland@arm.com> > >> > Cc: Dan Lustig <dlustig@nvidia.com> > >> > Cc: Andrea Parri <parri.andrea@gmail.com> > >> > --- > >> > arch/riscv/include/asm/atomic.h | 24 ++++++++---------------- > >> > arch/riscv/include/asm/cmpxchg.h | 6 ++---- > >> > 2 files changed, 10 insertions(+), 20 deletions(-) > >> > > >> > diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > >> > index 34f757dfc8f2..aef8aa9ac4f4 100644 > >> > --- a/arch/riscv/include/asm/atomic.h > >> > +++ b/arch/riscv/include/asm/atomic.h > >> > @@ -269,9 +269,8 @@ static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int > >> > "0: lr.w %[p], %[c]\n" > >> > " beq %[p], %[u], 1f\n" > >> > " add %[rc], %[p], %[a]\n" > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : [a]"r" (a), [u]"r" (u) > >> > @@ -290,9 +289,8 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, > >> > "0: lr.d %[p], %[c]\n" > >> > " beq %[p], %[u], 1f\n" > >> > " add %[rc], %[p], %[a]\n" > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : [a]"r" (a), [u]"r" (u) > >> > @@ -382,9 +380,8 @@ static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) > >> > "0: lr.w %[p], %[c]\n" > >> > " bltz %[p], 1f\n" > >> > " addi %[rc], %[p], 1\n" > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : > >> > @@ -402,9 +399,8 @@ static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) > >> > "0: lr.w %[p], %[c]\n" > >> > " bgtz %[p], 1f\n" > >> > " addi %[rc], %[p], -1\n" > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : > >> > @@ -422,9 +418,8 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) > >> > "0: lr.w %[p], %[c]\n" > >> > " addi %[rc], %[p], -1\n" > >> > " bltz %[rc], 1f\n" > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : > >> > @@ -444,9 +439,8 @@ static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) > >> > "0: lr.d %[p], %[c]\n" > >> > " bltz %[p], 1f\n" > >> > " addi %[rc], %[p], 1\n" > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : > >> > @@ -465,9 +459,8 @@ static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) > >> > "0: lr.d %[p], %[c]\n" > >> > " bgtz %[p], 1f\n" > >> > " addi %[rc], %[p], -1\n" > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : > >> > @@ -486,9 +479,8 @@ static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) > >> > "0: lr.d %[p], %[c]\n" > >> > " addi %[rc], %[p], -1\n" > >> > " bltz %[rc], 1f\n" > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > >> > " bnez %[rc], 0b\n" > >> > - " fence rw, rw\n" > >> > "1:\n" > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > >> > : > >> > diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h > >> > index 1af8db92250b..9269fceb86e0 100644 > >> > --- a/arch/riscv/include/asm/cmpxchg.h > >> > +++ b/arch/riscv/include/asm/cmpxchg.h > >> > @@ -307,9 +307,8 @@ > >> > __asm__ __volatile__ ( \ > >> > "0: lr.w %0, %2\n" \ > >> > " bne %0, %z3, 1f\n" \ > >> > - " sc.w.rl %1, %z4, %2\n" \ > >> > + " sc.w.aqrl %1, %z4, %2\n" \ > >> > " bnez %1, 0b\n" \ > >> > - " fence rw, rw\n" \ > >> > "1:\n" \ > >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > >> > : "rJ" ((long)__old), "rJ" (__new) \ > >> > @@ -319,9 +318,8 @@ > >> > __asm__ __volatile__ ( \ > >> > "0: lr.d %0, %2\n" \ > >> > " bne %0, %z3, 1f\n" \ > >> > - " sc.d.rl %1, %z4, %2\n" \ > >> > + " sc.d.aqrl %1, %z4, %2\n" \ > >> > " bnez %1, 0b\n" \ > >> > - " fence rw, rw\n" \ > >> > "1:\n" \ > >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > >> > : "rJ" (__old), "rJ" (__new) \ -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-13 11:49 ` Guo Ren @ 2022-06-14 11:03 ` Andrea Parri 2022-06-23 3:31 ` Boqun Feng 2022-06-24 3:28 ` Guo Ren 0 siblings, 2 replies; 24+ messages in thread From: Andrea Parri @ 2022-06-14 11:03 UTC (permalink / raw) To: Guo Ren Cc: Palmer Dabbelt, Daniel Lustig, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, Boqun Feng, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren Guo, On Mon, Jun 13, 2022 at 07:49:50PM +0800, Guo Ren wrote: > On Thu, Jun 2, 2022 at 1:59 PM Palmer Dabbelt <palmer@dabbelt.com> wrote: > > > > On Sun, 22 May 2022 06:12:56 PDT (-0700), guoren@kernel.org wrote: > > > On Sun, May 22, 2022 at 4:46 AM Palmer Dabbelt <palmer@dabbelt.com> wrote: > > >> > > >> On Wed, 04 May 2022 20:55:26 PDT (-0700), guoren@kernel.org wrote: > > >> > From: Guo Ren <guoren@linux.alibaba.com> > > >> > > > >> > The current implementation is the same with 8e86f0b409a4 > > >> > ("arm64: atomics: fix use of acquire + release for full barrier > > >> > semantics"). RISC-V could combine acquire and release into the SC > > >> > instructions and it could reduce a fence instruction to gain better > > >> > performance. Here is related descriptio from RISC-V ISA 10.2 > > >> > Load-Reserved/Store-Conditional Instructions: > > >> > > > >> > - .aq: The LR/SC sequence can be given acquire semantics by > > >> > setting the aq bit on the LR instruction. > > >> > - .rl: The LR/SC sequence can be given release semantics by > > >> > setting the rl bit on the SC instruction. > > >> > - .aqrl: Setting the aq bit on the LR instruction, and setting > > >> > both the aq and the rl bit on the SC instruction makes > > >> > the LR/SC sequence sequentially consistent, meaning that > > >> > it cannot be reordered with earlier or later memory > > >> > operations from the same hart. > > >> > > > >> > Software should not set the rl bit on an LR instruction unless > > >> > the aq bit is also set, nor should software set the aq bit on an > > >> > SC instruction unless the rl bit is also set. LR.rl and SC.aq > > >> > instructions are not guaranteed to provide any stronger ordering > > >> > than those with both bits clear, but may result in lower > > >> > performance. > > >> > > > >> > The only difference is when sc.w/d.aqrl failed, it would cause .aq > > >> > effect than before. But it's okay for sematic because overlap > > >> > address LR couldn't beyond relating SC. > > >> > > >> IIUC that's not accurate, or at least wasn't in 2018. The ISA tends to > > >> drift around a bit, so it's possible things have changed since then. > > >> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > > Thx for explaining, that why I suspected with the sentence in the comment: > > > "which do not give full-ordering with .aqrl" > > > > Sorry, I'm not quite sure what you're saying here. My understanding is > > that this change results in mappings that violate LKMM, based on the > > rationale in that previous commit. IIUC that all still holds, but I'm > > not really a memory model person so I frequently miss stuff around > > there. > 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > is about fixup wrong spinlock/unlock implementation and not relate to > this patch. No. The commit in question is evidence of the fact that the changes you are presenting here (as an optimization) were buggy/incorrect at the time in which that commit was worked out. > Actually, sc.w.aqrl is very strong and the same with: > fence rw, rw > sc.w > fence rw,rw > > So "which do not give full-ordering with .aqrl" is not writen in > RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > > > > > >> describes the issue more specifically, that's when we added these > > >> fences. There have certainly been complains that these fences are too > > >> heavyweight for the HW to go fast, but IIUC it's the best option we have > > > Yeah, it would reduce the performance on D1 and our next-generation > > > processor has optimized fence performance a lot. > > > > Definately a bummer that the fences make the HW go slow, but I don't > > really see any other way to go about this. If you think these mappings > > are valid for LKMM and RVWMO then we should figure this out, but trying > > to drop fences to make HW go faster in ways that violate the memory > > model is going to lead to insanity. > Actually, this patch is okay with the ISA spec, and Dan also thought > it was valid. > > ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw "Thoughts" on this regard have _changed_. Please compare that quote with, e.g. https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ So here's a suggestion: Reviewers of your patches have asked: How come that code we used to consider as buggy is now considered "an optimization" (correct)? Denying the evidence or going around it is not making their job (and this upstreaming) easier, so why don't you address it? Take time to review previous works and discussions in this area, understand them, and integrate such knowledge in future submissions. Andrea > ------ > > 2. Using .aqrl to replace the fence rw, rw is okay to ISA manual, > > right? And reducing a fence instruction to gain better performance: > > "0: lr.w %[p], %[c]\n" > > " sub %[rc], %[p], %[o]\n" > > " bltz %[rc], 1f\n". > > - " sc.w.rl %[rc], %[rc], %[c]\n" > > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > " bnez %[rc], 0b\n" > > - " fence rw, rw\n" > > Yes, using .aqrl is valid. > > Dan > ------ > > > > > > I can definately buy the argument that we have mappings of various > > application memory models that are difficult to make fast given the ISA > > encodings of RVWMO we have, but that's not really an issue we can fix in > > the atomic mappings. > > > > >> given the current set of memory model primitives we implement in the > > >> ISA (ie, there's more in RVWMO but no way to encode that). > > >> > > >> The others all look good, though, and as these are really all > > >> independent cleanups I'm going to go ahead and put those three on > > >> for-next. > > >> > > >> There's also a bunch of checkpatch errors. The ones about "*" seem > > >> spurious, but the alignment ones aren't. Here's my fixups: > > > Thx for the fixup. > > > > > >> > > >> diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > > >> index 34f757dfc8f2..0bde499fa8bc 100644 > > >> --- a/arch/riscv/include/asm/atomic.h > > >> +++ b/arch/riscv/include/asm/atomic.h > > >> @@ -86,9 +86,9 @@ ATOMIC_OPS(xor, xor, i) > > >> * versions, while the logical ops only have fetch versions. > > >> */ > > >> #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix) \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > > >> - atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > > >> + atomic##prefix##_t *v) \ > > >> { \ > > >> register c_type ret; \ > > >> __asm__ __volatile__ ( \ > > >> @@ -98,9 +98,9 @@ c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > > >> : "memory"); \ > > >> return ret; \ > > >> } \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > > >> - atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > > >> + atomic##prefix##_t *v) \ > > >> { \ > > >> register c_type ret; \ > > >> __asm__ __volatile__ ( \ > > >> @@ -110,9 +110,9 @@ c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > > >> : "memory"); \ > > >> return ret; \ > > >> } \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > > >> - atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > > >> + atomic##prefix##_t *v) \ > > >> { \ > > >> register c_type ret; \ > > >> __asm__ __volatile__ ( \ > > >> @@ -122,8 +122,8 @@ c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > > >> : "memory"); \ > > >> return ret; \ > > >> } \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > > >> { \ > > >> register c_type ret; \ > > >> __asm__ __volatile__ ( \ > > >> @@ -135,28 +135,28 @@ c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > > >> } > > >> > > >> #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix) \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > > >> - atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > > >> + atomic##prefix##_t *v) \ > > >> { \ > > >> - return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > > >> + return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > > >> } \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > > >> - atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > > >> + atomic##prefix##_t *v) \ > > >> { \ > > >> - return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > > >> + return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > > >> } \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_##op##_return_release(c_type i, \ > > >> - atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_##op##_return_release(c_type i, \ > > >> + atomic##prefix##_t *v) \ > > >> { \ > > >> - return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > > >> + return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > > >> } \ > > >> -static __always_inline \ > > >> -c_type arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > > >> +static __always_inline c_type \ > > >> +arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > > >> { \ > > >> - return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > > >> + return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > > >> } > > >> > > >> #ifdef CONFIG_GENERIC_ATOMIC64 > > >> > > >> > > >> > > > >> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > >> > Signed-off-by: Guo Ren <guoren@kernel.org> > > >> > Cc: Palmer Dabbelt <palmer@dabbelt.com> > > >> > Cc: Mark Rutland <mark.rutland@arm.com> > > >> > Cc: Dan Lustig <dlustig@nvidia.com> > > >> > Cc: Andrea Parri <parri.andrea@gmail.com> > > >> > --- > > >> > arch/riscv/include/asm/atomic.h | 24 ++++++++---------------- > > >> > arch/riscv/include/asm/cmpxchg.h | 6 ++---- > > >> > 2 files changed, 10 insertions(+), 20 deletions(-) > > >> > > > >> > diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > > >> > index 34f757dfc8f2..aef8aa9ac4f4 100644 > > >> > --- a/arch/riscv/include/asm/atomic.h > > >> > +++ b/arch/riscv/include/asm/atomic.h > > >> > @@ -269,9 +269,8 @@ static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int > > >> > "0: lr.w %[p], %[c]\n" > > >> > " beq %[p], %[u], 1f\n" > > >> > " add %[rc], %[p], %[a]\n" > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : [a]"r" (a), [u]"r" (u) > > >> > @@ -290,9 +289,8 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, > > >> > "0: lr.d %[p], %[c]\n" > > >> > " beq %[p], %[u], 1f\n" > > >> > " add %[rc], %[p], %[a]\n" > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : [a]"r" (a), [u]"r" (u) > > >> > @@ -382,9 +380,8 @@ static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) > > >> > "0: lr.w %[p], %[c]\n" > > >> > " bltz %[p], 1f\n" > > >> > " addi %[rc], %[p], 1\n" > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : > > >> > @@ -402,9 +399,8 @@ static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) > > >> > "0: lr.w %[p], %[c]\n" > > >> > " bgtz %[p], 1f\n" > > >> > " addi %[rc], %[p], -1\n" > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : > > >> > @@ -422,9 +418,8 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) > > >> > "0: lr.w %[p], %[c]\n" > > >> > " addi %[rc], %[p], -1\n" > > >> > " bltz %[rc], 1f\n" > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : > > >> > @@ -444,9 +439,8 @@ static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) > > >> > "0: lr.d %[p], %[c]\n" > > >> > " bltz %[p], 1f\n" > > >> > " addi %[rc], %[p], 1\n" > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : > > >> > @@ -465,9 +459,8 @@ static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) > > >> > "0: lr.d %[p], %[c]\n" > > >> > " bgtz %[p], 1f\n" > > >> > " addi %[rc], %[p], -1\n" > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : > > >> > @@ -486,9 +479,8 @@ static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) > > >> > "0: lr.d %[p], %[c]\n" > > >> > " addi %[rc], %[p], -1\n" > > >> > " bltz %[rc], 1f\n" > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > >> > " bnez %[rc], 0b\n" > > >> > - " fence rw, rw\n" > > >> > "1:\n" > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > >> > : > > >> > diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h > > >> > index 1af8db92250b..9269fceb86e0 100644 > > >> > --- a/arch/riscv/include/asm/cmpxchg.h > > >> > +++ b/arch/riscv/include/asm/cmpxchg.h > > >> > @@ -307,9 +307,8 @@ > > >> > __asm__ __volatile__ ( \ > > >> > "0: lr.w %0, %2\n" \ > > >> > " bne %0, %z3, 1f\n" \ > > >> > - " sc.w.rl %1, %z4, %2\n" \ > > >> > + " sc.w.aqrl %1, %z4, %2\n" \ > > >> > " bnez %1, 0b\n" \ > > >> > - " fence rw, rw\n" \ > > >> > "1:\n" \ > > >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > > >> > : "rJ" ((long)__old), "rJ" (__new) \ > > >> > @@ -319,9 +318,8 @@ > > >> > __asm__ __volatile__ ( \ > > >> > "0: lr.d %0, %2\n" \ > > >> > " bne %0, %z3, 1f\n" \ > > >> > - " sc.d.rl %1, %z4, %2\n" \ > > >> > + " sc.d.aqrl %1, %z4, %2\n" \ > > >> > " bnez %1, 0b\n" \ > > >> > - " fence rw, rw\n" \ > > >> > "1:\n" \ > > >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > > >> > : "rJ" (__old), "rJ" (__new) \ > > > > -- > Best Regards > Guo Ren > > ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-14 11:03 ` Andrea Parri @ 2022-06-23 3:31 ` Boqun Feng 2022-06-23 17:09 ` Dan Lustig 2022-06-24 3:28 ` Guo Ren 1 sibling, 1 reply; 24+ messages in thread From: Boqun Feng @ 2022-06-23 3:31 UTC (permalink / raw) To: Andrea Parri Cc: Guo Ren, Palmer Dabbelt, Daniel Lustig, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren Hi, On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: [...] > > 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > is about fixup wrong spinlock/unlock implementation and not relate to > > this patch. > > No. The commit in question is evidence of the fact that the changes > you are presenting here (as an optimization) were buggy/incorrect at > the time in which that commit was worked out. > > > > Actually, sc.w.aqrl is very strong and the same with: > > fence rw, rw > > sc.w > > fence rw,rw > > > > So "which do not give full-ordering with .aqrl" is not writen in > > RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > > > > > > > > >> describes the issue more specifically, that's when we added these > > > >> fences. There have certainly been complains that these fences are too > > > >> heavyweight for the HW to go fast, but IIUC it's the best option we have > > > > Yeah, it would reduce the performance on D1 and our next-generation > > > > processor has optimized fence performance a lot. > > > > > > Definately a bummer that the fences make the HW go slow, but I don't > > > really see any other way to go about this. If you think these mappings > > > are valid for LKMM and RVWMO then we should figure this out, but trying > > > to drop fences to make HW go faster in ways that violate the memory > > > model is going to lead to insanity. > > Actually, this patch is okay with the ISA spec, and Dan also thought > > it was valid. > > > > ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > > "Thoughts" on this regard have _changed_. Please compare that quote > with, e.g. > > https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > > So here's a suggestion: > > Reviewers of your patches have asked: How come that code we used to > consider as buggy is now considered "an optimization" (correct)? > > Denying the evidence or going around it is not making their job (and > this upstreaming) easier, so why don't you address it? Take time to > review previous works and discussions in this area, understand them, > and integrate such knowledge in future submissions. > I agree with Andrea. And I actually took a look into this, and I think I find some explanation. There are two versions of RISV memory model here: Model 2017: released at Dec 1, 2017 as a draft https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ Model 2018: released at May 2, 2018 https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ Noted that previous conversation about commit 5ce6c1f3535f happened at March 2018. So the timeline is roughly: Model 2017 -> commit 5ce6c1f3535f -> Model 2018 And in the email thread of Model 2018, the commit related to model changes also got mentioned: https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a in that commit, we can see the changes related to sc.aqrl are: to have occurred between the LR and a successful SC. The LR/SC sequence can be given acquire semantics by setting the {\em aq} bit on -the SC instruction. The LR/SC sequence can be given release semantics -by setting the {\em rl} bit on the LR instruction. Setting both {\em - aq} and {\em rl} bits on the LR instruction, and setting the {\em - aq} bit on the SC instruction makes the LR/SC sequence sequentially -consistent with respect to other sequentially consistent atomic -operations. +the LR instruction. The LR/SC sequence can be given release semantics +by setting the {\em rl} bit on the SC instruction. Setting the {\em + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em + rl} bit on the SC instruction makes the LR/SC sequence sequentially +consistent, meaning that it cannot be reordered with earlier or +later memory operations from the same hart. note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered against "earlier or later memory operations from the same hart", and this statement was not in Model 2017. So my understanding of the story is that at some point between March and May 2018, RISV memory model folks decided to add this rule, which does look more consistent with other parts of the model and is useful. And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered barrier ;-) Now if my understanding is correct, to move forward, it's better that 1) this patch gets resend with the above information (better rewording a bit), and 2) gets an Acked-by from Dan to confirm this is a correct history ;-) Regards, Boqun > Andrea > > [...] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-23 3:31 ` Boqun Feng @ 2022-06-23 17:09 ` Dan Lustig 2022-06-23 17:55 ` Boqun Feng 2022-06-25 5:29 ` Guo Ren 0 siblings, 2 replies; 24+ messages in thread From: Dan Lustig @ 2022-06-23 17:09 UTC (permalink / raw) To: Boqun Feng, Andrea Parri Cc: Guo Ren, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On 6/22/2022 11:31 PM, Boqun Feng wrote: > Hi, > > On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > [...] >>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") >>> is about fixup wrong spinlock/unlock implementation and not relate to >>> this patch. >> >> No. The commit in question is evidence of the fact that the changes >> you are presenting here (as an optimization) were buggy/incorrect at >> the time in which that commit was worked out. >> >> >>> Actually, sc.w.aqrl is very strong and the same with: >>> fence rw, rw >>> sc.w >>> fence rw,rw >>> >>> So "which do not give full-ordering with .aqrl" is not writen in >>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. >>> >>>> >>>>>> describes the issue more specifically, that's when we added these >>>>>> fences. There have certainly been complains that these fences are too >>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have >>>>> Yeah, it would reduce the performance on D1 and our next-generation >>>>> processor has optimized fence performance a lot. >>>> >>>> Definately a bummer that the fences make the HW go slow, but I don't >>>> really see any other way to go about this. If you think these mappings >>>> are valid for LKMM and RVWMO then we should figure this out, but trying >>>> to drop fences to make HW go faster in ways that violate the memory >>>> model is going to lead to insanity. >>> Actually, this patch is okay with the ISA spec, and Dan also thought >>> it was valid. >>> >>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw >> >> "Thoughts" on this regard have _changed_. Please compare that quote >> with, e.g. >> >> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ >> >> So here's a suggestion: >> >> Reviewers of your patches have asked: How come that code we used to >> consider as buggy is now considered "an optimization" (correct)? >> >> Denying the evidence or going around it is not making their job (and >> this upstreaming) easier, so why don't you address it? Take time to >> review previous works and discussions in this area, understand them, >> and integrate such knowledge in future submissions. >> > > I agree with Andrea. > > And I actually took a look into this, and I think I find some > explanation. There are two versions of RISV memory model here: > > Model 2017: released at Dec 1, 2017 as a draft > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > > Model 2018: released at May 2, 2018 > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > > Noted that previous conversation about commit 5ce6c1f3535f happened at > March 2018. So the timeline is roughly: > > Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > > And in the email thread of Model 2018, the commit related to model > changes also got mentioned: > > https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > > in that commit, we can see the changes related to sc.aqrl are: > > to have occurred between the LR and a successful SC. The LR/SC > sequence can be given acquire semantics by setting the {\em aq} bit on > -the SC instruction. The LR/SC sequence can be given release semantics > -by setting the {\em rl} bit on the LR instruction. Setting both {\em > - aq} and {\em rl} bits on the LR instruction, and setting the {\em > - aq} bit on the SC instruction makes the LR/SC sequence sequentially > -consistent with respect to other sequentially consistent atomic > -operations. > +the LR instruction. The LR/SC sequence can be given release semantics > +by setting the {\em rl} bit on the SC instruction. Setting the {\em > + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > + rl} bit on the SC instruction makes the LR/SC sequence sequentially > +consistent, meaning that it cannot be reordered with earlier or > +later memory operations from the same hart. > > note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > against "earlier or later memory operations from the same hart", and > this statement was not in Model 2017. > > So my understanding of the story is that at some point between March and > May 2018, RISV memory model folks decided to add this rule, which does > look more consistent with other parts of the model and is useful. > > And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > barrier ;-) > > Now if my understanding is correct, to move forward, it's better that 1) > this patch gets resend with the above information (better rewording a > bit), and 2) gets an Acked-by from Dan to confirm this is a correct > history ;-) I'm a bit lost as to why digging into RISC-V mailing list history is relevant here...what's relevant is what was ratified in the RVWMO chapter of the RISC-V spec, and whether the code you're proposing is the most optimized code that is correct wrt RVWMO. Is your claim that the code you're proposing to fix was based on a pre-RVWMO RISC-V memory model definition, and you're updating it to be more RVWMO-compliant? Dan > Regards, > Boqun > >> Andrea >> >> > [...] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-23 17:09 ` Dan Lustig @ 2022-06-23 17:55 ` Boqun Feng 2022-06-23 22:15 ` Palmer Dabbelt 2022-06-24 3:34 ` Guo Ren 2022-06-25 5:29 ` Guo Ren 1 sibling, 2 replies; 24+ messages in thread From: Boqun Feng @ 2022-06-23 17:55 UTC (permalink / raw) To: Dan Lustig Cc: Andrea Parri, Guo Ren, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Thu, Jun 23, 2022 at 01:09:23PM -0400, Dan Lustig wrote: > On 6/22/2022 11:31 PM, Boqun Feng wrote: > > Hi, > > > > On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > > [...] > >>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > >>> is about fixup wrong spinlock/unlock implementation and not relate to > >>> this patch. > >> > >> No. The commit in question is evidence of the fact that the changes > >> you are presenting here (as an optimization) were buggy/incorrect at > >> the time in which that commit was worked out. > >> > >> > >>> Actually, sc.w.aqrl is very strong and the same with: > >>> fence rw, rw > >>> sc.w > >>> fence rw,rw > >>> > >>> So "which do not give full-ordering with .aqrl" is not writen in > >>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > >>> > >>>> > >>>>>> describes the issue more specifically, that's when we added these > >>>>>> fences. There have certainly been complains that these fences are too > >>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have > >>>>> Yeah, it would reduce the performance on D1 and our next-generation > >>>>> processor has optimized fence performance a lot. > >>>> > >>>> Definately a bummer that the fences make the HW go slow, but I don't > >>>> really see any other way to go about this. If you think these mappings > >>>> are valid for LKMM and RVWMO then we should figure this out, but trying > >>>> to drop fences to make HW go faster in ways that violate the memory > >>>> model is going to lead to insanity. > >>> Actually, this patch is okay with the ISA spec, and Dan also thought > >>> it was valid. > >>> > >>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > >> > >> "Thoughts" on this regard have _changed_. Please compare that quote > >> with, e.g. > >> > >> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > >> > >> So here's a suggestion: > >> > >> Reviewers of your patches have asked: How come that code we used to > >> consider as buggy is now considered "an optimization" (correct)? > >> > >> Denying the evidence or going around it is not making their job (and > >> this upstreaming) easier, so why don't you address it? Take time to > >> review previous works and discussions in this area, understand them, > >> and integrate such knowledge in future submissions. > >> > > > > I agree with Andrea. > > > > And I actually took a look into this, and I think I find some > > explanation. There are two versions of RISV memory model here: > > > > Model 2017: released at Dec 1, 2017 as a draft > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > > > > Model 2018: released at May 2, 2018 > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > > > > Noted that previous conversation about commit 5ce6c1f3535f happened at > > March 2018. So the timeline is roughly: > > > > Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > > > > And in the email thread of Model 2018, the commit related to model > > changes also got mentioned: > > > > https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > > > > in that commit, we can see the changes related to sc.aqrl are: > > > > to have occurred between the LR and a successful SC. The LR/SC > > sequence can be given acquire semantics by setting the {\em aq} bit on > > -the SC instruction. The LR/SC sequence can be given release semantics > > -by setting the {\em rl} bit on the LR instruction. Setting both {\em > > - aq} and {\em rl} bits on the LR instruction, and setting the {\em > > - aq} bit on the SC instruction makes the LR/SC sequence sequentially > > -consistent with respect to other sequentially consistent atomic > > -operations. > > +the LR instruction. The LR/SC sequence can be given release semantics > > +by setting the {\em rl} bit on the SC instruction. Setting the {\em > > + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > > + rl} bit on the SC instruction makes the LR/SC sequence sequentially > > +consistent, meaning that it cannot be reordered with earlier or > > +later memory operations from the same hart. > > > > note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > > against "earlier or later memory operations from the same hart", and > > this statement was not in Model 2017. > > > > So my understanding of the story is that at some point between March and > > May 2018, RISV memory model folks decided to add this rule, which does > > look more consistent with other parts of the model and is useful. > > > > And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > > barrier ;-) > > > > Now if my understanding is correct, to move forward, it's better that 1) > > this patch gets resend with the above information (better rewording a > > bit), and 2) gets an Acked-by from Dan to confirm this is a correct > > history ;-) > > I'm a bit lost as to why digging into RISC-V mailing list history is > relevant here...what's relevant is what was ratified in the RVWMO > chapter of the RISC-V spec, and whether the code you're proposing > is the most optimized code that is correct wrt RVWMO. > > Is your claim that the code you're proposing to fix was based on a > pre-RVWMO RISC-V memory model definition, and you're updating it to > be more RVWMO-compliant? > Well, first it's not my code ;-) The thing is that this patch proposed by Guo Ren kinda fixes/revertes a previous commit 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences"). It looks to me that Guo Ren's patch fits the current RISCV memory model and Linux kernel memory model, but the question is that commit 5ce6c1f3535f was also a fix, so why do we change things back and forth? If I understand correctly, this is also what Palmer and Andrea asked for. My understanding is that commit 5ce6c1f3535f was based on a draft RVWMO that was different than the current one. I'd love to record this difference in the commit log of Guo Ren's patch, so that later on we know why we changed things back and forth. To do so, the confirmation from RVWMO authors is helpful. Hope that I make things more clear ;-) Regards, Boqun > Dan > > > Regards, > > Boqun > > > >> Andrea > >> > >> > > [...] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-23 17:55 ` Boqun Feng @ 2022-06-23 22:15 ` Palmer Dabbelt 2022-06-24 3:34 ` Guo Ren 1 sibling, 0 replies; 24+ messages in thread From: Palmer Dabbelt @ 2022-06-23 22:15 UTC (permalink / raw) To: boqun.feng Cc: Daniel Lustig, parri.andrea, guoren, Arnd Bergmann, mark.rutland, Will Deacon, peterz, linux-arch, linux-kernel, linux-riscv, guoren On Thu, 23 Jun 2022 10:55:11 PDT (-0700), boqun.feng@gmail.com wrote: > On Thu, Jun 23, 2022 at 01:09:23PM -0400, Dan Lustig wrote: >> On 6/22/2022 11:31 PM, Boqun Feng wrote: >> > Hi, >> > >> > On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: >> > [...] >> >>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") >> >>> is about fixup wrong spinlock/unlock implementation and not relate to >> >>> this patch. >> >> >> >> No. The commit in question is evidence of the fact that the changes >> >> you are presenting here (as an optimization) were buggy/incorrect at >> >> the time in which that commit was worked out. >> >> >> >> >> >>> Actually, sc.w.aqrl is very strong and the same with: >> >>> fence rw, rw >> >>> sc.w >> >>> fence rw,rw >> >>> >> >>> So "which do not give full-ordering with .aqrl" is not writen in >> >>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. >> >>> >> >>>> >> >>>>>> describes the issue more specifically, that's when we added these >> >>>>>> fences. There have certainly been complains that these fences are too >> >>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have >> >>>>> Yeah, it would reduce the performance on D1 and our next-generation >> >>>>> processor has optimized fence performance a lot. >> >>>> >> >>>> Definately a bummer that the fences make the HW go slow, but I don't >> >>>> really see any other way to go about this. If you think these mappings >> >>>> are valid for LKMM and RVWMO then we should figure this out, but trying >> >>>> to drop fences to make HW go faster in ways that violate the memory >> >>>> model is going to lead to insanity. >> >>> Actually, this patch is okay with the ISA spec, and Dan also thought >> >>> it was valid. >> >>> >> >>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw >> >> >> >> "Thoughts" on this regard have _changed_. Please compare that quote >> >> with, e.g. >> >> >> >> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ >> >> >> >> So here's a suggestion: >> >> >> >> Reviewers of your patches have asked: How come that code we used to >> >> consider as buggy is now considered "an optimization" (correct)? >> >> >> >> Denying the evidence or going around it is not making their job (and >> >> this upstreaming) easier, so why don't you address it? Take time to >> >> review previous works and discussions in this area, understand them, >> >> and integrate such knowledge in future submissions. >> >> >> > >> > I agree with Andrea. >> > >> > And I actually took a look into this, and I think I find some >> > explanation. There are two versions of RISV memory model here: >> > >> > Model 2017: released at Dec 1, 2017 as a draft >> > >> > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ >> > >> > Model 2018: released at May 2, 2018 >> > >> > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ >> > >> > Noted that previous conversation about commit 5ce6c1f3535f happened at >> > March 2018. So the timeline is roughly: >> > >> > Model 2017 -> commit 5ce6c1f3535f -> Model 2018 >> > >> > And in the email thread of Model 2018, the commit related to model >> > changes also got mentioned: >> > >> > https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a >> > >> > in that commit, we can see the changes related to sc.aqrl are: >> > >> > to have occurred between the LR and a successful SC. The LR/SC >> > sequence can be given acquire semantics by setting the {\em aq} bit on >> > -the SC instruction. The LR/SC sequence can be given release semantics >> > -by setting the {\em rl} bit on the LR instruction. Setting both {\em >> > - aq} and {\em rl} bits on the LR instruction, and setting the {\em >> > - aq} bit on the SC instruction makes the LR/SC sequence sequentially >> > -consistent with respect to other sequentially consistent atomic >> > -operations. >> > +the LR instruction. The LR/SC sequence can be given release semantics >> > +by setting the {\em rl} bit on the SC instruction. Setting the {\em >> > + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em >> > + rl} bit on the SC instruction makes the LR/SC sequence sequentially >> > +consistent, meaning that it cannot be reordered with earlier or >> > +later memory operations from the same hart. >> > >> > note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered >> > against "earlier or later memory operations from the same hart", and >> > this statement was not in Model 2017. >> > >> > So my understanding of the story is that at some point between March and >> > May 2018, RISV memory model folks decided to add this rule, which does >> > look more consistent with other parts of the model and is useful. >> > >> > And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered >> > barrier ;-) >> > >> > Now if my understanding is correct, to move forward, it's better that 1) >> > this patch gets resend with the above information (better rewording a >> > bit), and 2) gets an Acked-by from Dan to confirm this is a correct >> > history ;-) >> >> I'm a bit lost as to why digging into RISC-V mailing list history is >> relevant here...what's relevant is what was ratified in the RVWMO >> chapter of the RISC-V spec, and whether the code you're proposing >> is the most optimized code that is correct wrt RVWMO. >> >> Is your claim that the code you're proposing to fix was based on a >> pre-RVWMO RISC-V memory model definition, and you're updating it to >> be more RVWMO-compliant? >> > > Well, first it's not my code ;-) > > The thing is that this patch proposed by Guo Ren kinda fixes/revertes a > previous commit 5ce6c1f3535f ("riscv/atomic: Strengthen implementations > with fences"). It looks to me that Guo Ren's patch fits the current > RISCV memory model and Linux kernel memory model, but the question is > that commit 5ce6c1f3535f was also a fix, so why do we change things > back and forth? If I understand correctly, this is also what Palmer and > Andrea asked for. That's essentially my confusion. I'm not really a formal memory model guy so I can easily get lost in these bits, but when I saw it I remembered having looked at a fix before. I dug that up, saw it was from someone who likley knows a lot more about formal memory models than I do, and thus figured I'd ask everyone to see what's up. IMO if that original fix was made to a pre-ratification version of WMO, this new version is legal WRT the ratified WMO then the code change is fine to take on for-next. That said, we should be explicit about why it's legal and why the reasoning in the previous patch is no loger connect, just to make sure everyone can follow along. > My understanding is that commit 5ce6c1f3535f was based on a draft RVWMO > that was different than the current one. I'd love to record this > difference in the commit log of Guo Ren's patch, so that later on we > know why we changed things back and forth. To do so, the confirmation > from RVWMO authors is helpful. Agreed. IMO that's always good hygine, but it's extra important when we're dealing with external specifications in a complex field like formal models. > Hope that I make things more clear ;-) > > Regards, > Boqun > >> Dan >> >> > Regards, >> > Boqun >> > >> >> Andrea >> >> >> >> >> > [...] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-23 17:55 ` Boqun Feng 2022-06-23 22:15 ` Palmer Dabbelt @ 2022-06-24 3:34 ` Guo Ren 1 sibling, 0 replies; 24+ messages in thread From: Guo Ren @ 2022-06-24 3:34 UTC (permalink / raw) To: Boqun Feng Cc: Dan Lustig, Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren q On Fri, Jun 24, 2022 at 1:55 AM Boqun Feng <boqun.feng@gmail.com> wrote: > > On Thu, Jun 23, 2022 at 01:09:23PM -0400, Dan Lustig wrote: > > On 6/22/2022 11:31 PM, Boqun Feng wrote: > > > Hi, > > > > > > On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > > > [...] > > >>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > >>> is about fixup wrong spinlock/unlock implementation and not relate to > > >>> this patch. > > >> > > >> No. The commit in question is evidence of the fact that the changes > > >> you are presenting here (as an optimization) were buggy/incorrect at > > >> the time in which that commit was worked out. > > >> > > >> > > >>> Actually, sc.w.aqrl is very strong and the same with: > > >>> fence rw, rw > > >>> sc.w > > >>> fence rw,rw > > >>> > > >>> So "which do not give full-ordering with .aqrl" is not writen in > > >>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > > >>> > > >>>> > > >>>>>> describes the issue more specifically, that's when we added these > > >>>>>> fences. There have certainly been complains that these fences are too > > >>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have > > >>>>> Yeah, it would reduce the performance on D1 and our next-generation > > >>>>> processor has optimized fence performance a lot. > > >>>> > > >>>> Definately a bummer that the fences make the HW go slow, but I don't > > >>>> really see any other way to go about this. If you think these mappings > > >>>> are valid for LKMM and RVWMO then we should figure this out, but trying > > >>>> to drop fences to make HW go faster in ways that violate the memory > > >>>> model is going to lead to insanity. > > >>> Actually, this patch is okay with the ISA spec, and Dan also thought > > >>> it was valid. > > >>> > > >>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > > >> > > >> "Thoughts" on this regard have _changed_. Please compare that quote > > >> with, e.g. > > >> > > >> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > > >> > > >> So here's a suggestion: > > >> > > >> Reviewers of your patches have asked: How come that code we used to > > >> consider as buggy is now considered "an optimization" (correct)? > > >> > > >> Denying the evidence or going around it is not making their job (and > > >> this upstreaming) easier, so why don't you address it? Take time to > > >> review previous works and discussions in this area, understand them, > > >> and integrate such knowledge in future submissions. > > >> > > > > > > I agree with Andrea. > > > > > > And I actually took a look into this, and I think I find some > > > explanation. There are two versions of RISV memory model here: > > > > > > Model 2017: released at Dec 1, 2017 as a draft > > > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > > > > > > Model 2018: released at May 2, 2018 > > > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > > > > > > Noted that previous conversation about commit 5ce6c1f3535f happened at > > > March 2018. So the timeline is roughly: > > > > > > Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > > > > > > And in the email thread of Model 2018, the commit related to model > > > changes also got mentioned: > > > > > > https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > > > > > > in that commit, we can see the changes related to sc.aqrl are: > > > > > > to have occurred between the LR and a successful SC. The LR/SC > > > sequence can be given acquire semantics by setting the {\em aq} bit on > > > -the SC instruction. The LR/SC sequence can be given release semantics > > > -by setting the {\em rl} bit on the LR instruction. Setting both {\em > > > - aq} and {\em rl} bits on the LR instruction, and setting the {\em > > > - aq} bit on the SC instruction makes the LR/SC sequence sequentially > > > -consistent with respect to other sequentially consistent atomic > > > -operations. > > > +the LR instruction. The LR/SC sequence can be given release semantics > > > +by setting the {\em rl} bit on the SC instruction. Setting the {\em > > > + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > > > + rl} bit on the SC instruction makes the LR/SC sequence sequentially > > > +consistent, meaning that it cannot be reordered with earlier or > > > +later memory operations from the same hart. > > > > > > note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > > > against "earlier or later memory operations from the same hart", and > > > this statement was not in Model 2017. > > > > > > So my understanding of the story is that at some point between March and > > > May 2018, RISV memory model folks decided to add this rule, which does > > > look more consistent with other parts of the model and is useful. > > > > > > And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > > > barrier ;-) > > > > > > Now if my understanding is correct, to move forward, it's better that 1) > > > this patch gets resend with the above information (better rewording a > > > bit), and 2) gets an Acked-by from Dan to confirm this is a correct > > > history ;-) > > > > I'm a bit lost as to why digging into RISC-V mailing list history is > > relevant here...what's relevant is what was ratified in the RVWMO > > chapter of the RISC-V spec, and whether the code you're proposing > > is the most optimized code that is correct wrt RVWMO. > > > > Is your claim that the code you're proposing to fix was based on a > > pre-RVWMO RISC-V memory model definition, and you're updating it to > > be more RVWMO-compliant? > > > > Well, first it's not my code ;-) > > The thing is that this patch proposed by Guo Ren kinda fixes/revertes a > previous commit 5ce6c1f3535f ("riscv/atomic: Strengthen implementations > with fences"). It looks to me that Guo Ren's patch fits the current > RISCV memory model and Linux kernel memory model, but the question is > that commit 5ce6c1f3535f was also a fix, so why do we change things > back and forth? If I understand correctly, this is also what Palmer and > Andrea asked for. I think the patch is still a little different from 5ce6c1f3535f: before: lr.w.aqrl + sc.w.aqrl 5ce6c1f3535f: lr.w + sc.w.rl + fence this patch: lr.w + sc.w.aqrl > > My understanding is that commit 5ce6c1f3535f was based on a draft RVWMO > that was different than the current one. I'd love to record this > difference in the commit log of Guo Ren's patch, so that later on we > know why we changed things back and forth. To do so, the confirmation > from RVWMO authors is helpful. Thx for the effort on the patch and so much related history information which let me clearer. The motivation of this patch is to save one fence instruction and let .aqrl give the RCsc synchronization point. We also have used the .aqrl in ATOMIC_FETCH_OP: " amo" #asm_op "." #asm_type ".aqrl %1, %2, %0" \ I think using .aqrl would be better and union into one style in atomic.h would be clearer. (Some are .aqrl, some are .rl + fence) > > Hope that I make things more clear ;-) > > Regards, > Boqun > > > Dan > > > > > Regards, > > > Boqun > > > > > >> Andrea > > >> > > >> > > > [...] -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-23 17:09 ` Dan Lustig 2022-06-23 17:55 ` Boqun Feng @ 2022-06-25 5:29 ` Guo Ren 2022-07-07 0:03 ` Boqun Feng 1 sibling, 1 reply; 24+ messages in thread From: Guo Ren @ 2022-06-25 5:29 UTC (permalink / raw) To: Dan Lustig Cc: Boqun Feng, Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Fri, Jun 24, 2022 at 1:09 AM Dan Lustig <dlustig@nvidia.com> wrote: > > On 6/22/2022 11:31 PM, Boqun Feng wrote: > > Hi, > > > > On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > > [...] > >>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > >>> is about fixup wrong spinlock/unlock implementation and not relate to > >>> this patch. > >> > >> No. The commit in question is evidence of the fact that the changes > >> you are presenting here (as an optimization) were buggy/incorrect at > >> the time in which that commit was worked out. > >> > >> > >>> Actually, sc.w.aqrl is very strong and the same with: > >>> fence rw, rw > >>> sc.w > >>> fence rw,rw > >>> > >>> So "which do not give full-ordering with .aqrl" is not writen in > >>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > >>> > >>>> > >>>>>> describes the issue more specifically, that's when we added these > >>>>>> fences. There have certainly been complains that these fences are too > >>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have > >>>>> Yeah, it would reduce the performance on D1 and our next-generation > >>>>> processor has optimized fence performance a lot. > >>>> > >>>> Definately a bummer that the fences make the HW go slow, but I don't > >>>> really see any other way to go about this. If you think these mappings > >>>> are valid for LKMM and RVWMO then we should figure this out, but trying > >>>> to drop fences to make HW go faster in ways that violate the memory > >>>> model is going to lead to insanity. > >>> Actually, this patch is okay with the ISA spec, and Dan also thought > >>> it was valid. > >>> > >>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > >> > >> "Thoughts" on this regard have _changed_. Please compare that quote > >> with, e.g. > >> > >> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > >> > >> So here's a suggestion: > >> > >> Reviewers of your patches have asked: How come that code we used to > >> consider as buggy is now considered "an optimization" (correct)? > >> > >> Denying the evidence or going around it is not making their job (and > >> this upstreaming) easier, so why don't you address it? Take time to > >> review previous works and discussions in this area, understand them, > >> and integrate such knowledge in future submissions. > >> > > > > I agree with Andrea. > > > > And I actually took a look into this, and I think I find some > > explanation. There are two versions of RISV memory model here: > > > > Model 2017: released at Dec 1, 2017 as a draft > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > > > > Model 2018: released at May 2, 2018 > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > > > > Noted that previous conversation about commit 5ce6c1f3535f happened at > > March 2018. So the timeline is roughly: > > > > Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > > > > And in the email thread of Model 2018, the commit related to model > > changes also got mentioned: > > > > https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > > > > in that commit, we can see the changes related to sc.aqrl are: > > > > to have occurred between the LR and a successful SC. The LR/SC > > sequence can be given acquire semantics by setting the {\em aq} bit on > > -the SC instruction. The LR/SC sequence can be given release semantics > > -by setting the {\em rl} bit on the LR instruction. Setting both {\em > > - aq} and {\em rl} bits on the LR instruction, and setting the {\em > > - aq} bit on the SC instruction makes the LR/SC sequence sequentially > > -consistent with respect to other sequentially consistent atomic > > -operations. > > +the LR instruction. The LR/SC sequence can be given release semantics > > +by setting the {\em rl} bit on the SC instruction. Setting the {\em > > + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > > + rl} bit on the SC instruction makes the LR/SC sequence sequentially > > +consistent, meaning that it cannot be reordered with earlier or > > +later memory operations from the same hart. > > > > note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > > against "earlier or later memory operations from the same hart", and > > this statement was not in Model 2017. > > > > So my understanding of the story is that at some point between March and > > May 2018, RISV memory model folks decided to add this rule, which does > > look more consistent with other parts of the model and is useful. > > > > And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > > barrier ;-) > > > > Now if my understanding is correct, to move forward, it's better that 1) > > this patch gets resend with the above information (better rewording a > > bit), and 2) gets an Acked-by from Dan to confirm this is a correct > > history ;-) > > I'm a bit lost as to why digging into RISC-V mailing list history is > relevant here...what's relevant is what was ratified in the RVWMO > chapter of the RISC-V spec, and whether the code you're proposing > is the most optimized code that is correct wrt RVWMO. > > Is your claim that the code you're proposing to fix was based on a > pre-RVWMO RISC-V memory model definition, and you're updating it to > be more RVWMO-compliant? Could "lr + beq + sc.aqrl" provides a conditional RCsc here with current spec? I only found "lr.aq + sc.aqrl" despcriton which is un-conditional RCsc. > > Dan > > > Regards, > > Boqun > > > >> Andrea > >> > >> > > [...] -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-25 5:29 ` Guo Ren @ 2022-07-07 0:03 ` Boqun Feng 2022-07-13 13:38 ` Dan Lustig 2022-07-13 23:47 ` Guo Ren 0 siblings, 2 replies; 24+ messages in thread From: Boqun Feng @ 2022-07-07 0:03 UTC (permalink / raw) To: Guo Ren Cc: Dan Lustig, Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Sat, Jun 25, 2022 at 01:29:50PM +0800, Guo Ren wrote: > On Fri, Jun 24, 2022 at 1:09 AM Dan Lustig <dlustig@nvidia.com> wrote: > > > > On 6/22/2022 11:31 PM, Boqun Feng wrote: > > > Hi, > > > > > > On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > > > [...] > > >>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > >>> is about fixup wrong spinlock/unlock implementation and not relate to > > >>> this patch. > > >> > > >> No. The commit in question is evidence of the fact that the changes > > >> you are presenting here (as an optimization) were buggy/incorrect at > > >> the time in which that commit was worked out. > > >> > > >> > > >>> Actually, sc.w.aqrl is very strong and the same with: > > >>> fence rw, rw > > >>> sc.w > > >>> fence rw,rw > > >>> > > >>> So "which do not give full-ordering with .aqrl" is not writen in > > >>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > > >>> > > >>>> > > >>>>>> describes the issue more specifically, that's when we added these > > >>>>>> fences. There have certainly been complains that these fences are too > > >>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have > > >>>>> Yeah, it would reduce the performance on D1 and our next-generation > > >>>>> processor has optimized fence performance a lot. > > >>>> > > >>>> Definately a bummer that the fences make the HW go slow, but I don't > > >>>> really see any other way to go about this. If you think these mappings > > >>>> are valid for LKMM and RVWMO then we should figure this out, but trying > > >>>> to drop fences to make HW go faster in ways that violate the memory > > >>>> model is going to lead to insanity. > > >>> Actually, this patch is okay with the ISA spec, and Dan also thought > > >>> it was valid. > > >>> > > >>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > > >> > > >> "Thoughts" on this regard have _changed_. Please compare that quote > > >> with, e.g. > > >> > > >> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > > >> > > >> So here's a suggestion: > > >> > > >> Reviewers of your patches have asked: How come that code we used to > > >> consider as buggy is now considered "an optimization" (correct)? > > >> > > >> Denying the evidence or going around it is not making their job (and > > >> this upstreaming) easier, so why don't you address it? Take time to > > >> review previous works and discussions in this area, understand them, > > >> and integrate such knowledge in future submissions. > > >> > > > > > > I agree with Andrea. > > > > > > And I actually took a look into this, and I think I find some > > > explanation. There are two versions of RISV memory model here: > > > > > > Model 2017: released at Dec 1, 2017 as a draft > > > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > > > > > > Model 2018: released at May 2, 2018 > > > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > > > > > > Noted that previous conversation about commit 5ce6c1f3535f happened at > > > March 2018. So the timeline is roughly: > > > > > > Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > > > > > > And in the email thread of Model 2018, the commit related to model > > > changes also got mentioned: > > > > > > https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > > > > > > in that commit, we can see the changes related to sc.aqrl are: > > > > > > to have occurred between the LR and a successful SC. The LR/SC > > > sequence can be given acquire semantics by setting the {\em aq} bit on > > > -the SC instruction. The LR/SC sequence can be given release semantics > > > -by setting the {\em rl} bit on the LR instruction. Setting both {\em > > > - aq} and {\em rl} bits on the LR instruction, and setting the {\em > > > - aq} bit on the SC instruction makes the LR/SC sequence sequentially > > > -consistent with respect to other sequentially consistent atomic > > > -operations. > > > +the LR instruction. The LR/SC sequence can be given release semantics > > > +by setting the {\em rl} bit on the SC instruction. Setting the {\em > > > + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > > > + rl} bit on the SC instruction makes the LR/SC sequence sequentially > > > +consistent, meaning that it cannot be reordered with earlier or > > > +later memory operations from the same hart. > > > > > > note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > > > against "earlier or later memory operations from the same hart", and > > > this statement was not in Model 2017. > > > > > > So my understanding of the story is that at some point between March and > > > May 2018, RISV memory model folks decided to add this rule, which does > > > look more consistent with other parts of the model and is useful. > > > > > > And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > > > barrier ;-) > > > > > > Now if my understanding is correct, to move forward, it's better that 1) > > > this patch gets resend with the above information (better rewording a > > > bit), and 2) gets an Acked-by from Dan to confirm this is a correct > > > history ;-) > > > > I'm a bit lost as to why digging into RISC-V mailing list history is > > relevant here...what's relevant is what was ratified in the RVWMO > > chapter of the RISC-V spec, and whether the code you're proposing > > is the most optimized code that is correct wrt RVWMO. > > > > Is your claim that the code you're proposing to fix was based on a > > pre-RVWMO RISC-V memory model definition, and you're updating it to > > be more RVWMO-compliant? > Could "lr + beq + sc.aqrl" provides a conditional RCsc here with > current spec? I only found "lr.aq + sc.aqrl" despcriton which is > un-conditional RCsc. > /me put the temporary RISCV memory model hat on and pretend to be a RISCV memory expert. I think the answer is yes, it's actually quite straightforwards given that RISCV treats PPO (Preserved Program Order) as part of GMO (Global Memory Order), considering the following (A and B are memory accesses): A .. sc.aqrl // M .. B , A has a ->ppo ordering to M since "sc.aqrl" is a RELEASE, and M has a ->ppo ordeing to B since "sc.aqrl" is an AQUIRE, so A ->ppo M ->ppo B And since RISCV describes that PPO is part of GMO: """ The subset of program order that must be respected by the global memory order is known as preserved program order. """ also in the herd model: (* Main model axiom *) acyclic co | rfe | fr | ppo as Model , therefore the ordering between A and B is GMO and GMO should be respected by all harts. Regards, Boqun > > > > Dan > > > > > Regards, > > > Boqun > > > > > >> Andrea > > >> > > >> > > > [...] > > > > -- > Best Regards > Guo Ren > > ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-07-07 0:03 ` Boqun Feng @ 2022-07-13 13:38 ` Dan Lustig 2022-07-13 23:34 ` Guo Ren 2022-07-13 23:47 ` Guo Ren 1 sibling, 1 reply; 24+ messages in thread From: Dan Lustig @ 2022-07-13 13:38 UTC (permalink / raw) To: Boqun Feng, Guo Ren Cc: Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On 7/6/2022 8:03 PM, Boqun Feng wrote: > On Sat, Jun 25, 2022 at 01:29:50PM +0800, Guo Ren wrote: >> On Fri, Jun 24, 2022 at 1:09 AM Dan Lustig <dlustig@nvidia.com> wrote: >>> >>> On 6/22/2022 11:31 PM, Boqun Feng wrote: >>>> Hi, >>>> >>>> On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: >>>> [...] >>>>>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") >>>>>> is about fixup wrong spinlock/unlock implementation and not relate to >>>>>> this patch. >>>>> >>>>> No. The commit in question is evidence of the fact that the changes >>>>> you are presenting here (as an optimization) were buggy/incorrect at >>>>> the time in which that commit was worked out. >>>>> >>>>> >>>>>> Actually, sc.w.aqrl is very strong and the same with: >>>>>> fence rw, rw >>>>>> sc.w >>>>>> fence rw,rw >>>>>> >>>>>> So "which do not give full-ordering with .aqrl" is not writen in >>>>>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. >>>>>> >>>>>>> >>>>>>>>> describes the issue more specifically, that's when we added these >>>>>>>>> fences. There have certainly been complains that these fences are too >>>>>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have >>>>>>>> Yeah, it would reduce the performance on D1 and our next-generation >>>>>>>> processor has optimized fence performance a lot. >>>>>>> >>>>>>> Definately a bummer that the fences make the HW go slow, but I don't >>>>>>> really see any other way to go about this. If you think these mappings >>>>>>> are valid for LKMM and RVWMO then we should figure this out, but trying >>>>>>> to drop fences to make HW go faster in ways that violate the memory >>>>>>> model is going to lead to insanity. >>>>>> Actually, this patch is okay with the ISA spec, and Dan also thought >>>>>> it was valid. >>>>>> >>>>>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw >>>>> >>>>> "Thoughts" on this regard have _changed_. Please compare that quote >>>>> with, e.g. >>>>> >>>>> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ >>>>> >>>>> So here's a suggestion: >>>>> >>>>> Reviewers of your patches have asked: How come that code we used to >>>>> consider as buggy is now considered "an optimization" (correct)? >>>>> >>>>> Denying the evidence or going around it is not making their job (and >>>>> this upstreaming) easier, so why don't you address it? Take time to >>>>> review previous works and discussions in this area, understand them, >>>>> and integrate such knowledge in future submissions. >>>>> >>>> >>>> I agree with Andrea. >>>> >>>> And I actually took a look into this, and I think I find some >>>> explanation. There are two versions of RISV memory model here: >>>> >>>> Model 2017: released at Dec 1, 2017 as a draft >>>> >>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ >>>> >>>> Model 2018: released at May 2, 2018 >>>> >>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ >>>> >>>> Noted that previous conversation about commit 5ce6c1f3535f happened at >>>> March 2018. So the timeline is roughly: >>>> >>>> Model 2017 -> commit 5ce6c1f3535f -> Model 2018 >>>> >>>> And in the email thread of Model 2018, the commit related to model >>>> changes also got mentioned: >>>> >>>> https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a >>>> >>>> in that commit, we can see the changes related to sc.aqrl are: >>>> >>>> to have occurred between the LR and a successful SC. The LR/SC >>>> sequence can be given acquire semantics by setting the {\em aq} bit on >>>> -the SC instruction. The LR/SC sequence can be given release semantics >>>> -by setting the {\em rl} bit on the LR instruction. Setting both {\em >>>> - aq} and {\em rl} bits on the LR instruction, and setting the {\em >>>> - aq} bit on the SC instruction makes the LR/SC sequence sequentially >>>> -consistent with respect to other sequentially consistent atomic >>>> -operations. >>>> +the LR instruction. The LR/SC sequence can be given release semantics >>>> +by setting the {\em rl} bit on the SC instruction. Setting the {\em >>>> + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em >>>> + rl} bit on the SC instruction makes the LR/SC sequence sequentially >>>> +consistent, meaning that it cannot be reordered with earlier or >>>> +later memory operations from the same hart. >>>> >>>> note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered >>>> against "earlier or later memory operations from the same hart", and >>>> this statement was not in Model 2017. >>>> >>>> So my understanding of the story is that at some point between March and >>>> May 2018, RISV memory model folks decided to add this rule, which does >>>> look more consistent with other parts of the model and is useful. >>>> >>>> And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered >>>> barrier ;-) >>>> >>>> Now if my understanding is correct, to move forward, it's better that 1) >>>> this patch gets resend with the above information (better rewording a >>>> bit), and 2) gets an Acked-by from Dan to confirm this is a correct >>>> history ;-) >>> >>> I'm a bit lost as to why digging into RISC-V mailing list history is >>> relevant here...what's relevant is what was ratified in the RVWMO >>> chapter of the RISC-V spec, and whether the code you're proposing >>> is the most optimized code that is correct wrt RVWMO. >>> >>> Is your claim that the code you're proposing to fix was based on a >>> pre-RVWMO RISC-V memory model definition, and you're updating it to >>> be more RVWMO-compliant? >> Could "lr + beq + sc.aqrl" provides a conditional RCsc here with >> current spec? I only found "lr.aq + sc.aqrl" despcriton which is >> un-conditional RCsc. >> > > /me put the temporary RISCV memory model hat on and pretend to be a > RISCV memory expert. > > I think the answer is yes, it's actually quite straightforwards given > that RISCV treats PPO (Preserved Program Order) as part of GMO (Global > Memory Order), considering the following (A and B are memory accesses): > > A > .. > sc.aqrl // M > .. > B > > , A has a ->ppo ordering to M since "sc.aqrl" is a RELEASE, and M has > a ->ppo ordeing to B since "sc.aqrl" is an AQUIRE, so > > A ->ppo M ->ppo B > > And since RISCV describes that PPO is part of GMO: > > """ > The subset of program order that must be respected by the global memory > order is known as preserved program order. > """ > > also in the herd model: > > (* Main model axiom *) > acyclic co | rfe | fr | ppo as Model > > , therefore the ordering between A and B is GMO and GMO should be > respected by all harts. > > Regards, > Boqun I agree with Boqun's reasoning, at least for the case where there is no branch. But to confirm, was the original question about also having a branch, I assume to the instruction immediately after the sc? If so, then yes, that would make the .aqrl effect conditional. Dan > >>> >>> Dan >>> >>>> Regards, >>>> Boqun >>>> >>>>> Andrea >>>>> >>>>> >>>> [...] >> >> >> >> -- >> Best Regards >> Guo Ren >> >> ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-07-13 13:38 ` Dan Lustig @ 2022-07-13 23:34 ` Guo Ren 0 siblings, 0 replies; 24+ messages in thread From: Guo Ren @ 2022-07-13 23:34 UTC (permalink / raw) To: Dan Lustig Cc: Boqun Feng, Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Wed, Jul 13, 2022 at 9:38 PM Dan Lustig <dlustig@nvidia.com> wrote: > > On 7/6/2022 8:03 PM, Boqun Feng wrote: > > On Sat, Jun 25, 2022 at 01:29:50PM +0800, Guo Ren wrote: > >> On Fri, Jun 24, 2022 at 1:09 AM Dan Lustig <dlustig@nvidia.com> wrote: > >>> > >>> On 6/22/2022 11:31 PM, Boqun Feng wrote: > >>>> Hi, > >>>> > >>>> On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > >>>> [...] > >>>>>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > >>>>>> is about fixup wrong spinlock/unlock implementation and not relate to > >>>>>> this patch. > >>>>> > >>>>> No. The commit in question is evidence of the fact that the changes > >>>>> you are presenting here (as an optimization) were buggy/incorrect at > >>>>> the time in which that commit was worked out. > >>>>> > >>>>> > >>>>>> Actually, sc.w.aqrl is very strong and the same with: > >>>>>> fence rw, rw > >>>>>> sc.w > >>>>>> fence rw,rw > >>>>>> > >>>>>> So "which do not give full-ordering with .aqrl" is not writen in > >>>>>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > >>>>>> > >>>>>>> > >>>>>>>>> describes the issue more specifically, that's when we added these > >>>>>>>>> fences. There have certainly been complains that these fences are too > >>>>>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have > >>>>>>>> Yeah, it would reduce the performance on D1 and our next-generation > >>>>>>>> processor has optimized fence performance a lot. > >>>>>>> > >>>>>>> Definately a bummer that the fences make the HW go slow, but I don't > >>>>>>> really see any other way to go about this. If you think these mappings > >>>>>>> are valid for LKMM and RVWMO then we should figure this out, but trying > >>>>>>> to drop fences to make HW go faster in ways that violate the memory > >>>>>>> model is going to lead to insanity. > >>>>>> Actually, this patch is okay with the ISA spec, and Dan also thought > >>>>>> it was valid. > >>>>>> > >>>>>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > >>>>> > >>>>> "Thoughts" on this regard have _changed_. Please compare that quote > >>>>> with, e.g. > >>>>> > >>>>> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > >>>>> > >>>>> So here's a suggestion: > >>>>> > >>>>> Reviewers of your patches have asked: How come that code we used to > >>>>> consider as buggy is now considered "an optimization" (correct)? > >>>>> > >>>>> Denying the evidence or going around it is not making their job (and > >>>>> this upstreaming) easier, so why don't you address it? Take time to > >>>>> review previous works and discussions in this area, understand them, > >>>>> and integrate such knowledge in future submissions. > >>>>> > >>>> > >>>> I agree with Andrea. > >>>> > >>>> And I actually took a look into this, and I think I find some > >>>> explanation. There are two versions of RISV memory model here: > >>>> > >>>> Model 2017: released at Dec 1, 2017 as a draft > >>>> > >>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > >>>> > >>>> Model 2018: released at May 2, 2018 > >>>> > >>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > >>>> > >>>> Noted that previous conversation about commit 5ce6c1f3535f happened at > >>>> March 2018. So the timeline is roughly: > >>>> > >>>> Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > >>>> > >>>> And in the email thread of Model 2018, the commit related to model > >>>> changes also got mentioned: > >>>> > >>>> https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > >>>> > >>>> in that commit, we can see the changes related to sc.aqrl are: > >>>> > >>>> to have occurred between the LR and a successful SC. The LR/SC > >>>> sequence can be given acquire semantics by setting the {\em aq} bit on > >>>> -the SC instruction. The LR/SC sequence can be given release semantics > >>>> -by setting the {\em rl} bit on the LR instruction. Setting both {\em > >>>> - aq} and {\em rl} bits on the LR instruction, and setting the {\em > >>>> - aq} bit on the SC instruction makes the LR/SC sequence sequentially > >>>> -consistent with respect to other sequentially consistent atomic > >>>> -operations. > >>>> +the LR instruction. The LR/SC sequence can be given release semantics > >>>> +by setting the {\em rl} bit on the SC instruction. Setting the {\em > >>>> + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > >>>> + rl} bit on the SC instruction makes the LR/SC sequence sequentially > >>>> +consistent, meaning that it cannot be reordered with earlier or > >>>> +later memory operations from the same hart. > >>>> > >>>> note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > >>>> against "earlier or later memory operations from the same hart", and > >>>> this statement was not in Model 2017. > >>>> > >>>> So my understanding of the story is that at some point between March and > >>>> May 2018, RISV memory model folks decided to add this rule, which does > >>>> look more consistent with other parts of the model and is useful. > >>>> > >>>> And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > >>>> barrier ;-) > >>>> > >>>> Now if my understanding is correct, to move forward, it's better that 1) > >>>> this patch gets resend with the above information (better rewording a > >>>> bit), and 2) gets an Acked-by from Dan to confirm this is a correct > >>>> history ;-) > >>> > >>> I'm a bit lost as to why digging into RISC-V mailing list history is > >>> relevant here...what's relevant is what was ratified in the RVWMO > >>> chapter of the RISC-V spec, and whether the code you're proposing > >>> is the most optimized code that is correct wrt RVWMO. > >>> > >>> Is your claim that the code you're proposing to fix was based on a > >>> pre-RVWMO RISC-V memory model definition, and you're updating it to > >>> be more RVWMO-compliant? > >> Could "lr + beq + sc.aqrl" provides a conditional RCsc here with > >> current spec? I only found "lr.aq + sc.aqrl" despcriton which is > >> un-conditional RCsc. > >> > > > > /me put the temporary RISCV memory model hat on and pretend to be a > > RISCV memory expert. > > > > I think the answer is yes, it's actually quite straightforwards given > > that RISCV treats PPO (Preserved Program Order) as part of GMO (Global > > Memory Order), considering the following (A and B are memory accesses): > > > > A > > .. > > sc.aqrl // M > > .. > > B > > > > , A has a ->ppo ordering to M since "sc.aqrl" is a RELEASE, and M has > > a ->ppo ordeing to B since "sc.aqrl" is an AQUIRE, so > > > > A ->ppo M ->ppo B > > > > And since RISCV describes that PPO is part of GMO: > > > > """ > > The subset of program order that must be respected by the global memory > > order is known as preserved program order. > > """ > > > > also in the herd model: > > > > (* Main model axiom *) > > acyclic co | rfe | fr | ppo as Model > > > > , therefore the ordering between A and B is GMO and GMO should be > > respected by all harts. > > > > Regards, > > Boqun > > I agree with Boqun's reasoning, at least for the case where there > is no branch. > > But to confirm, was the original question about also having a branch, > I assume to the instruction immediately after the sc? If so, then > yes, that would make the .aqrl effect conditional. >> Could "lr + beq + sc.aqrl" provides a conditional RCsc here with current spec? >> I only found "lr.aq + sc.aqrl" despcriton which is un-conditional RCsc. In ISA spec, I found: "Setting the aq bit on the LR instruction, and setting both the aq and the rl bit on the SC instruction makes the LR/SC sequence sequentially consistent, meaning that it cannot be reordered with earlier or later memory operations from the same hart." No "lr + bnez + sc.aqrl" or "lr.aq + bnez + sc.aqrl" example description. >> Could "lr + beq + sc.aqrl" provides a conditional RCsc here with current spec? So, the above is legal for the RVWMO & LKMM? > > Dan > > > > >>> > >>> Dan > >>> > >>>> Regards, > >>>> Boqun > >>>> > >>>>> Andrea > >>>>> > >>>>> > >>>> [...] > >> > >> > >> > >> -- > >> Best Regards > >> Guo Ren > >> > >> ML: https://lore.kernel.org/linux-csky/ -- Best Regards Guo Ren ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-07-07 0:03 ` Boqun Feng 2022-07-13 13:38 ` Dan Lustig @ 2022-07-13 23:47 ` Guo Ren 2022-07-14 13:06 ` Dan Lustig 1 sibling, 1 reply; 24+ messages in thread From: Guo Ren @ 2022-07-13 23:47 UTC (permalink / raw) To: Boqun Feng Cc: Dan Lustig, Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Thu, Jul 7, 2022 at 8:04 AM Boqun Feng <boqun.feng@gmail.com> wrote: > > On Sat, Jun 25, 2022 at 01:29:50PM +0800, Guo Ren wrote: > > On Fri, Jun 24, 2022 at 1:09 AM Dan Lustig <dlustig@nvidia.com> wrote: > > > > > > On 6/22/2022 11:31 PM, Boqun Feng wrote: > > > > Hi, > > > > > > > > On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > > > > [...] > > > >>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > > >>> is about fixup wrong spinlock/unlock implementation and not relate to > > > >>> this patch. > > > >> > > > >> No. The commit in question is evidence of the fact that the changes > > > >> you are presenting here (as an optimization) were buggy/incorrect at > > > >> the time in which that commit was worked out. > > > >> > > > >> > > > >>> Actually, sc.w.aqrl is very strong and the same with: > > > >>> fence rw, rw > > > >>> sc.w > > > >>> fence rw,rw > > > >>> > > > >>> So "which do not give full-ordering with .aqrl" is not writen in > > > >>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > > > >>> > > > >>>> > > > >>>>>> describes the issue more specifically, that's when we added these > > > >>>>>> fences. There have certainly been complains that these fences are too > > > >>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have > > > >>>>> Yeah, it would reduce the performance on D1 and our next-generation > > > >>>>> processor has optimized fence performance a lot. > > > >>>> > > > >>>> Definately a bummer that the fences make the HW go slow, but I don't > > > >>>> really see any other way to go about this. If you think these mappings > > > >>>> are valid for LKMM and RVWMO then we should figure this out, but trying > > > >>>> to drop fences to make HW go faster in ways that violate the memory > > > >>>> model is going to lead to insanity. > > > >>> Actually, this patch is okay with the ISA spec, and Dan also thought > > > >>> it was valid. > > > >>> > > > >>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > > > >> > > > >> "Thoughts" on this regard have _changed_. Please compare that quote > > > >> with, e.g. > > > >> > > > >> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > > > >> > > > >> So here's a suggestion: > > > >> > > > >> Reviewers of your patches have asked: How come that code we used to > > > >> consider as buggy is now considered "an optimization" (correct)? > > > >> > > > >> Denying the evidence or going around it is not making their job (and > > > >> this upstreaming) easier, so why don't you address it? Take time to > > > >> review previous works and discussions in this area, understand them, > > > >> and integrate such knowledge in future submissions. > > > >> > > > > > > > > I agree with Andrea. > > > > > > > > And I actually took a look into this, and I think I find some > > > > explanation. There are two versions of RISV memory model here: > > > > > > > > Model 2017: released at Dec 1, 2017 as a draft > > > > > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > > > > > > > > Model 2018: released at May 2, 2018 > > > > > > > > https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > > > > > > > > Noted that previous conversation about commit 5ce6c1f3535f happened at > > > > March 2018. So the timeline is roughly: > > > > > > > > Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > > > > > > > > And in the email thread of Model 2018, the commit related to model > > > > changes also got mentioned: > > > > > > > > https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > > > > > > > > in that commit, we can see the changes related to sc.aqrl are: > > > > > > > > to have occurred between the LR and a successful SC. The LR/SC > > > > sequence can be given acquire semantics by setting the {\em aq} bit on > > > > -the SC instruction. The LR/SC sequence can be given release semantics > > > > -by setting the {\em rl} bit on the LR instruction. Setting both {\em > > > > - aq} and {\em rl} bits on the LR instruction, and setting the {\em > > > > - aq} bit on the SC instruction makes the LR/SC sequence sequentially > > > > -consistent with respect to other sequentially consistent atomic > > > > -operations. > > > > +the LR instruction. The LR/SC sequence can be given release semantics > > > > +by setting the {\em rl} bit on the SC instruction. Setting the {\em > > > > + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > > > > + rl} bit on the SC instruction makes the LR/SC sequence sequentially > > > > +consistent, meaning that it cannot be reordered with earlier or > > > > +later memory operations from the same hart. > > > > > > > > note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > > > > against "earlier or later memory operations from the same hart", and > > > > this statement was not in Model 2017. > > > > > > > > So my understanding of the story is that at some point between March and > > > > May 2018, RISV memory model folks decided to add this rule, which does > > > > look more consistent with other parts of the model and is useful. > > > > > > > > And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > > > > barrier ;-) > > > > > > > > Now if my understanding is correct, to move forward, it's better that 1) > > > > this patch gets resend with the above information (better rewording a > > > > bit), and 2) gets an Acked-by from Dan to confirm this is a correct > > > > history ;-) > > > > > > I'm a bit lost as to why digging into RISC-V mailing list history is > > > relevant here...what's relevant is what was ratified in the RVWMO > > > chapter of the RISC-V spec, and whether the code you're proposing > > > is the most optimized code that is correct wrt RVWMO. > > > > > > Is your claim that the code you're proposing to fix was based on a > > > pre-RVWMO RISC-V memory model definition, and you're updating it to > > > be more RVWMO-compliant? > > Could "lr + beq + sc.aqrl" provides a conditional RCsc here with > > current spec? I only found "lr.aq + sc.aqrl" despcriton which is > > un-conditional RCsc. > > > > /me put the temporary RISCV memory model hat on and pretend to be a > RISCV memory expert. > > I think the answer is yes, it's actually quite straightforwards given > that RISCV treats PPO (Preserved Program Order) as part of GMO (Global > Memory Order), considering the following (A and B are memory accesses): > > A > .. > sc.aqrl // M > .. > B > > , A has a ->ppo ordering to M since "sc.aqrl" is a RELEASE, and M has > a ->ppo ordeing to B since "sc.aqrl" is an AQUIRE, so > > A ->ppo M ->ppo B That also means M must fence.rl + sc + fence.aq. But in the release consistency model, "rl + aq" is not legal and has no guarantee at all. So sc.aqrl should be clarified in spec, but I only found "lr.aq + sc.aqrl" description, see the patch commit log. Could we treat sc.aqrl as a whole in ISA? Because in micro-arch, we must separate it into pieces for implementation. That is what the RVWMO should give out. > > And since RISCV describes that PPO is part of GMO: > > """ > The subset of program order that must be respected by the global memory > order is known as preserved program order. > """ > > also in the herd model: > > (* Main model axiom *) > acyclic co | rfe | fr | ppo as Model If the herd7 model has defined that, I think it should be legal. Good catch. > > , therefore the ordering between A and B is GMO and GMO should be > respected by all harts. > > Regards, > Boqun > > > > > > > Dan > > > > > > > Regards, > > > > Boqun > > > > > > > >> Andrea > > > >> > > > >> > > > > [...] > > > > > > > > -- > > Best Regards > > Guo Ren > > > > ML: https://lore.kernel.org/linux-csky/ -- Best Regards Guo Ren ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-07-13 23:47 ` Guo Ren @ 2022-07-14 13:06 ` Dan Lustig 2022-08-09 7:06 ` Guo Ren 0 siblings, 1 reply; 24+ messages in thread From: Dan Lustig @ 2022-07-14 13:06 UTC (permalink / raw) To: Guo Ren, Boqun Feng Cc: Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On 7/13/2022 7:47 PM, Guo Ren wrote: > On Thu, Jul 7, 2022 at 8:04 AM Boqun Feng <boqun.feng@gmail.com> wrote: >> >> On Sat, Jun 25, 2022 at 01:29:50PM +0800, Guo Ren wrote: >>> On Fri, Jun 24, 2022 at 1:09 AM Dan Lustig <dlustig@nvidia.com> wrote: >>>> >>>> On 6/22/2022 11:31 PM, Boqun Feng wrote: >>>>> Hi, >>>>> >>>>> On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: >>>>> [...] >>>>>>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") >>>>>>> is about fixup wrong spinlock/unlock implementation and not relate to >>>>>>> this patch. >>>>>> >>>>>> No. The commit in question is evidence of the fact that the changes >>>>>> you are presenting here (as an optimization) were buggy/incorrect at >>>>>> the time in which that commit was worked out. >>>>>> >>>>>> >>>>>>> Actually, sc.w.aqrl is very strong and the same with: >>>>>>> fence rw, rw >>>>>>> sc.w >>>>>>> fence rw,rw >>>>>>> >>>>>>> So "which do not give full-ordering with .aqrl" is not writen in >>>>>>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. >>>>>>> >>>>>>>> >>>>>>>>>> describes the issue more specifically, that's when we added these >>>>>>>>>> fences. There have certainly been complains that these fences are too >>>>>>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have >>>>>>>>> Yeah, it would reduce the performance on D1 and our next-generation >>>>>>>>> processor has optimized fence performance a lot. >>>>>>>> >>>>>>>> Definately a bummer that the fences make the HW go slow, but I don't >>>>>>>> really see any other way to go about this. If you think these mappings >>>>>>>> are valid for LKMM and RVWMO then we should figure this out, but trying >>>>>>>> to drop fences to make HW go faster in ways that violate the memory >>>>>>>> model is going to lead to insanity. >>>>>>> Actually, this patch is okay with the ISA spec, and Dan also thought >>>>>>> it was valid. >>>>>>> >>>>>>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw >>>>>> >>>>>> "Thoughts" on this regard have _changed_. Please compare that quote >>>>>> with, e.g. >>>>>> >>>>>> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ >>>>>> >>>>>> So here's a suggestion: >>>>>> >>>>>> Reviewers of your patches have asked: How come that code we used to >>>>>> consider as buggy is now considered "an optimization" (correct)? >>>>>> >>>>>> Denying the evidence or going around it is not making their job (and >>>>>> this upstreaming) easier, so why don't you address it? Take time to >>>>>> review previous works and discussions in this area, understand them, >>>>>> and integrate such knowledge in future submissions. >>>>>> >>>>> >>>>> I agree with Andrea. >>>>> >>>>> And I actually took a look into this, and I think I find some >>>>> explanation. There are two versions of RISV memory model here: >>>>> >>>>> Model 2017: released at Dec 1, 2017 as a draft >>>>> >>>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ >>>>> >>>>> Model 2018: released at May 2, 2018 >>>>> >>>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ >>>>> >>>>> Noted that previous conversation about commit 5ce6c1f3535f happened at >>>>> March 2018. So the timeline is roughly: >>>>> >>>>> Model 2017 -> commit 5ce6c1f3535f -> Model 2018 >>>>> >>>>> And in the email thread of Model 2018, the commit related to model >>>>> changes also got mentioned: >>>>> >>>>> https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a >>>>> >>>>> in that commit, we can see the changes related to sc.aqrl are: >>>>> >>>>> to have occurred between the LR and a successful SC. The LR/SC >>>>> sequence can be given acquire semantics by setting the {\em aq} bit on >>>>> -the SC instruction. The LR/SC sequence can be given release semantics >>>>> -by setting the {\em rl} bit on the LR instruction. Setting both {\em >>>>> - aq} and {\em rl} bits on the LR instruction, and setting the {\em >>>>> - aq} bit on the SC instruction makes the LR/SC sequence sequentially >>>>> -consistent with respect to other sequentially consistent atomic >>>>> -operations. >>>>> +the LR instruction. The LR/SC sequence can be given release semantics >>>>> +by setting the {\em rl} bit on the SC instruction. Setting the {\em >>>>> + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em >>>>> + rl} bit on the SC instruction makes the LR/SC sequence sequentially >>>>> +consistent, meaning that it cannot be reordered with earlier or >>>>> +later memory operations from the same hart. >>>>> >>>>> note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered >>>>> against "earlier or later memory operations from the same hart", and >>>>> this statement was not in Model 2017. >>>>> >>>>> So my understanding of the story is that at some point between March and >>>>> May 2018, RISV memory model folks decided to add this rule, which does >>>>> look more consistent with other parts of the model and is useful. >>>>> >>>>> And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered >>>>> barrier ;-) >>>>> >>>>> Now if my understanding is correct, to move forward, it's better that 1) >>>>> this patch gets resend with the above information (better rewording a >>>>> bit), and 2) gets an Acked-by from Dan to confirm this is a correct >>>>> history ;-) >>>> >>>> I'm a bit lost as to why digging into RISC-V mailing list history is >>>> relevant here...what's relevant is what was ratified in the RVWMO >>>> chapter of the RISC-V spec, and whether the code you're proposing >>>> is the most optimized code that is correct wrt RVWMO. >>>> >>>> Is your claim that the code you're proposing to fix was based on a >>>> pre-RVWMO RISC-V memory model definition, and you're updating it to >>>> be more RVWMO-compliant? >>> Could "lr + beq + sc.aqrl" provides a conditional RCsc here with >>> current spec? I only found "lr.aq + sc.aqrl" despcriton which is >>> un-conditional RCsc. >>> >> >> /me put the temporary RISCV memory model hat on and pretend to be a >> RISCV memory expert. >> >> I think the answer is yes, it's actually quite straightforwards given >> that RISCV treats PPO (Preserved Program Order) as part of GMO (Global >> Memory Order), considering the following (A and B are memory accesses): >> >> A >> .. >> sc.aqrl // M >> .. >> B >> >> , A has a ->ppo ordering to M since "sc.aqrl" is a RELEASE, and M has >> a ->ppo ordeing to B since "sc.aqrl" is an AQUIRE, so >> >> A ->ppo M ->ppo B > That also means M must fence.rl + sc + fence.aq. But in the release > consistency model, "rl + aq" is not legal and has no guarantee at all. > > So sc.aqrl should be clarified in spec, but I only found "lr.aq + > sc.aqrl" description, see the patch commit log. The spec doesn't try to enumerate every possible synchronization instruction sequence. That's why the RVWMO rules are in place. > Could we treat sc.aqrl as a whole in ISA? Because in micro-arch, we > must separate it into pieces for implementation. > > That is what the RVWMO should give out. What exactly would you like the spec to say about this? RVWMO and the RISC-V spec in general describe the required architecturally observable behavior. They're not implementation guides. Generally speaking, I would expect splitting an sc.aqrl into a ".rl; sc; .aq" pattern to be OK. That wouldn't introduce new observable behaviors compared to treating the sc.aqrl as a single indivisible operation, would it? Dan >> And since RISCV describes that PPO is part of GMO: >> >> """ >> The subset of program order that must be respected by the global memory >> order is known as preserved program order. >> """ >> >> also in the herd model: >> >> (* Main model axiom *) >> acyclic co | rfe | fr | ppo as Model > If the herd7 model has defined that, I think it should be legal. Good catch. > > >> >> , therefore the ordering between A and B is GMO and GMO should be >> respected by all harts. >> >> Regards, >> Boqun >> >>>> >>>> Dan >>>> >>>>> Regards, >>>>> Boqun >>>>> >>>>>> Andrea >>>>>> >>>>>> >>>>> [...] >>> >>> >>> >>> -- >>> Best Regards >>> Guo Ren >>> >>> ML: https://lore.kernel.org/linux-csky/ > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-07-14 13:06 ` Dan Lustig @ 2022-08-09 7:06 ` Guo Ren 0 siblings, 0 replies; 24+ messages in thread From: Guo Ren @ 2022-08-09 7:06 UTC (permalink / raw) To: Dan Lustig Cc: Boqun Feng, Andrea Parri, Palmer Dabbelt, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Thu, Jul 14, 2022 at 9:06 PM Dan Lustig <dlustig@nvidia.com> wrote: > > On 7/13/2022 7:47 PM, Guo Ren wrote: > > On Thu, Jul 7, 2022 at 8:04 AM Boqun Feng <boqun.feng@gmail.com> wrote: > >> > >> On Sat, Jun 25, 2022 at 01:29:50PM +0800, Guo Ren wrote: > >>> On Fri, Jun 24, 2022 at 1:09 AM Dan Lustig <dlustig@nvidia.com> wrote: > >>>> > >>>> On 6/22/2022 11:31 PM, Boqun Feng wrote: > >>>>> Hi, > >>>>> > >>>>> On Tue, Jun 14, 2022 at 01:03:47PM +0200, Andrea Parri wrote: > >>>>> [...] > >>>>>>> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > >>>>>>> is about fixup wrong spinlock/unlock implementation and not relate to > >>>>>>> this patch. > >>>>>> > >>>>>> No. The commit in question is evidence of the fact that the changes > >>>>>> you are presenting here (as an optimization) were buggy/incorrect at > >>>>>> the time in which that commit was worked out. > >>>>>> > >>>>>> > >>>>>>> Actually, sc.w.aqrl is very strong and the same with: > >>>>>>> fence rw, rw > >>>>>>> sc.w > >>>>>>> fence rw,rw > >>>>>>> > >>>>>>> So "which do not give full-ordering with .aqrl" is not writen in > >>>>>>> RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > >>>>>>> > >>>>>>>> > >>>>>>>>>> describes the issue more specifically, that's when we added these > >>>>>>>>>> fences. There have certainly been complains that these fences are too > >>>>>>>>>> heavyweight for the HW to go fast, but IIUC it's the best option we have > >>>>>>>>> Yeah, it would reduce the performance on D1 and our next-generation > >>>>>>>>> processor has optimized fence performance a lot. > >>>>>>>> > >>>>>>>> Definately a bummer that the fences make the HW go slow, but I don't > >>>>>>>> really see any other way to go about this. If you think these mappings > >>>>>>>> are valid for LKMM and RVWMO then we should figure this out, but trying > >>>>>>>> to drop fences to make HW go faster in ways that violate the memory > >>>>>>>> model is going to lead to insanity. > >>>>>>> Actually, this patch is okay with the ISA spec, and Dan also thought > >>>>>>> it was valid. > >>>>>>> > >>>>>>> ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > >>>>>> > >>>>>> "Thoughts" on this regard have _changed_. Please compare that quote > >>>>>> with, e.g. > >>>>>> > >>>>>> https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > >>>>>> > >>>>>> So here's a suggestion: > >>>>>> > >>>>>> Reviewers of your patches have asked: How come that code we used to > >>>>>> consider as buggy is now considered "an optimization" (correct)? > >>>>>> > >>>>>> Denying the evidence or going around it is not making their job (and > >>>>>> this upstreaming) easier, so why don't you address it? Take time to > >>>>>> review previous works and discussions in this area, understand them, > >>>>>> and integrate such knowledge in future submissions. > >>>>>> > >>>>> > >>>>> I agree with Andrea. > >>>>> > >>>>> And I actually took a look into this, and I think I find some > >>>>> explanation. There are two versions of RISV memory model here: > >>>>> > >>>>> Model 2017: released at Dec 1, 2017 as a draft > >>>>> > >>>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/hKywNHBkAXM/m/QzUtxEWLBQAJ > >>>>> > >>>>> Model 2018: released at May 2, 2018 > >>>>> > >>>>> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/xW03vmfmPuA/m/bMPk3UCWAgAJ > >>>>> > >>>>> Noted that previous conversation about commit 5ce6c1f3535f happened at > >>>>> March 2018. So the timeline is roughly: > >>>>> > >>>>> Model 2017 -> commit 5ce6c1f3535f -> Model 2018 > >>>>> > >>>>> And in the email thread of Model 2018, the commit related to model > >>>>> changes also got mentioned: > >>>>> > >>>>> https://github.com/riscv/riscv-isa-manual/commit/b875fe417948635ed68b9644ffdf718cb343a81a > >>>>> > >>>>> in that commit, we can see the changes related to sc.aqrl are: > >>>>> > >>>>> to have occurred between the LR and a successful SC. The LR/SC > >>>>> sequence can be given acquire semantics by setting the {\em aq} bit on > >>>>> -the SC instruction. The LR/SC sequence can be given release semantics > >>>>> -by setting the {\em rl} bit on the LR instruction. Setting both {\em > >>>>> - aq} and {\em rl} bits on the LR instruction, and setting the {\em > >>>>> - aq} bit on the SC instruction makes the LR/SC sequence sequentially > >>>>> -consistent with respect to other sequentially consistent atomic > >>>>> -operations. > >>>>> +the LR instruction. The LR/SC sequence can be given release semantics > >>>>> +by setting the {\em rl} bit on the SC instruction. Setting the {\em > >>>>> + aq} bit on the LR instruction, and setting both the {\em aq} and the {\em > >>>>> + rl} bit on the SC instruction makes the LR/SC sequence sequentially > >>>>> +consistent, meaning that it cannot be reordered with earlier or > >>>>> +later memory operations from the same hart. > >>>>> > >>>>> note that Model 2018 explicitly says that "ld.aq+sc.aqrl" is ordered > >>>>> against "earlier or later memory operations from the same hart", and > >>>>> this statement was not in Model 2017. > >>>>> > >>>>> So my understanding of the story is that at some point between March and > >>>>> May 2018, RISV memory model folks decided to add this rule, which does > >>>>> look more consistent with other parts of the model and is useful. > >>>>> > >>>>> And this is why (and when) "ld.aq+sc.aqrl" can be used as a fully-ordered > >>>>> barrier ;-) > >>>>> > >>>>> Now if my understanding is correct, to move forward, it's better that 1) > >>>>> this patch gets resend with the above information (better rewording a > >>>>> bit), and 2) gets an Acked-by from Dan to confirm this is a correct > >>>>> history ;-) > >>>> > >>>> I'm a bit lost as to why digging into RISC-V mailing list history is > >>>> relevant here...what's relevant is what was ratified in the RVWMO > >>>> chapter of the RISC-V spec, and whether the code you're proposing > >>>> is the most optimized code that is correct wrt RVWMO. > >>>> > >>>> Is your claim that the code you're proposing to fix was based on a > >>>> pre-RVWMO RISC-V memory model definition, and you're updating it to > >>>> be more RVWMO-compliant? > >>> Could "lr + beq + sc.aqrl" provides a conditional RCsc here with > >>> current spec? I only found "lr.aq + sc.aqrl" despcriton which is > >>> un-conditional RCsc. > >>> > >> > >> /me put the temporary RISCV memory model hat on and pretend to be a > >> RISCV memory expert. > >> > >> I think the answer is yes, it's actually quite straightforwards given > >> that RISCV treats PPO (Preserved Program Order) as part of GMO (Global > >> Memory Order), considering the following (A and B are memory accesses): > >> > >> A > >> .. > >> sc.aqrl // M > >> .. > >> B > >> > >> , A has a ->ppo ordering to M since "sc.aqrl" is a RELEASE, and M has > >> a ->ppo ordeing to B since "sc.aqrl" is an AQUIRE, so > >> > >> A ->ppo M ->ppo B > > That also means M must fence.rl + sc + fence.aq. But in the release > > consistency model, "rl + aq" is not legal and has no guarantee at all. > > > > So sc.aqrl should be clarified in spec, but I only found "lr.aq + > > sc.aqrl" description, see the patch commit log. > > The spec doesn't try to enumerate every possible synchronization > instruction sequence. That's why the RVWMO rules are in place. Okay, I just want to confirm "lr + sc.aqrl" is correct here. > > > Could we treat sc.aqrl as a whole in ISA? Because in micro-arch, we > > must separate it into pieces for implementation. > > > > That is what the RVWMO should give out. > > What exactly would you like the spec to say about this? RVWMO and the > RISC-V spec in general describe the required architecturally observable > behavior. They're not implementation guides. > > Generally speaking, I would expect splitting an sc.aqrl into a > ".rl; sc; .aq" pattern to be OK. That wouldn't introduce new observable > behaviors compared to treating the sc.aqrl as a single indivisible > operation, would it? Yes, I think the below modification is correct, and it could improve the performance in the fast path. Adding .aq annotation during the false loop won't cause side effects. right? "0: lr.d %[p], %[c]\n" " beq %[p], %[u], 1f\n" " add %[rc], %[p], %[a]\n" - " sc.d.rl %[rc], %[rc], %[c]\n" + " sc.d.aqrl %[rc], %[rc], %[c]\n" " bnez %[rc], 0b\n" - " fence rw, rw\n" > > Dan > > >> And since RISCV describes that PPO is part of GMO: > >> > >> """ > >> The subset of program order that must be respected by the global memory > >> order is known as preserved program order. > >> """ > >> > >> also in the herd model: > >> > >> (* Main model axiom *) > >> acyclic co | rfe | fr | ppo as Model > > If the herd7 model has defined that, I think it should be legal. Good catch. > > > > > >> > >> , therefore the ordering between A and B is GMO and GMO should be > >> respected by all harts. > >> > >> Regards, > >> Boqun > >> > >>>> > >>>> Dan > >>>> > >>>>> Regards, > >>>>> Boqun > >>>>> > >>>>>> Andrea > >>>>>> > >>>>>> > >>>>> [...] > >>> > >>> > >>> > >>> -- > >>> Best Regards > >>> Guo Ren > >>> > >>> ML: https://lore.kernel.org/linux-csky/ > > > > > > -- Best Regards Guo Ren ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation 2022-06-14 11:03 ` Andrea Parri 2022-06-23 3:31 ` Boqun Feng @ 2022-06-24 3:28 ` Guo Ren 1 sibling, 0 replies; 24+ messages in thread From: Guo Ren @ 2022-06-24 3:28 UTC (permalink / raw) To: Andrea Parri Cc: Palmer Dabbelt, Daniel Lustig, Arnd Bergmann, Mark Rutland, Will Deacon, Peter Zijlstra, Boqun Feng, linux-arch, Linux Kernel Mailing List, linux-riscv, Guo Ren On Tue, Jun 14, 2022 at 7:03 PM Andrea Parri <parri.andrea@gmail.com> wrote: > > Guo, > > On Mon, Jun 13, 2022 at 07:49:50PM +0800, Guo Ren wrote: > > On Thu, Jun 2, 2022 at 1:59 PM Palmer Dabbelt <palmer@dabbelt.com> wrote: > > > > > > On Sun, 22 May 2022 06:12:56 PDT (-0700), guoren@kernel.org wrote: > > > > On Sun, May 22, 2022 at 4:46 AM Palmer Dabbelt <palmer@dabbelt.com> wrote: > > > >> > > > >> On Wed, 04 May 2022 20:55:26 PDT (-0700), guoren@kernel.org wrote: > > > >> > From: Guo Ren <guoren@linux.alibaba.com> > > > >> > > > > >> > The current implementation is the same with 8e86f0b409a4 > > > >> > ("arm64: atomics: fix use of acquire + release for full barrier > > > >> > semantics"). RISC-V could combine acquire and release into the SC > > > >> > instructions and it could reduce a fence instruction to gain better > > > >> > performance. Here is related descriptio from RISC-V ISA 10.2 > > > >> > Load-Reserved/Store-Conditional Instructions: > > > >> > > > > >> > - .aq: The LR/SC sequence can be given acquire semantics by > > > >> > setting the aq bit on the LR instruction. > > > >> > - .rl: The LR/SC sequence can be given release semantics by > > > >> > setting the rl bit on the SC instruction. > > > >> > - .aqrl: Setting the aq bit on the LR instruction, and setting > > > >> > both the aq and the rl bit on the SC instruction makes > > > >> > the LR/SC sequence sequentially consistent, meaning that > > > >> > it cannot be reordered with earlier or later memory > > > >> > operations from the same hart. > > > >> > > > > >> > Software should not set the rl bit on an LR instruction unless > > > >> > the aq bit is also set, nor should software set the aq bit on an > > > >> > SC instruction unless the rl bit is also set. LR.rl and SC.aq > > > >> > instructions are not guaranteed to provide any stronger ordering > > > >> > than those with both bits clear, but may result in lower > > > >> > performance. > > > >> > > > > >> > The only difference is when sc.w/d.aqrl failed, it would cause .aq > > > >> > effect than before. But it's okay for sematic because overlap > > > >> > address LR couldn't beyond relating SC. > > > >> > > > >> IIUC that's not accurate, or at least wasn't in 2018. The ISA tends to > > > >> drift around a bit, so it's possible things have changed since then. > > > >> 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > > > Thx for explaining, that why I suspected with the sentence in the comment: > > > > "which do not give full-ordering with .aqrl" > > > > > > Sorry, I'm not quite sure what you're saying here. My understanding is > > > that this change results in mappings that violate LKMM, based on the > > > rationale in that previous commit. IIUC that all still holds, but I'm > > > not really a memory model person so I frequently miss stuff around > > > there. > > 5ce6c1f3535f ("riscv/atomic: Strengthen implementations with fences") > > is about fixup wrong spinlock/unlock implementation and not relate to > > this patch. > > No. The commit in question is evidence of the fact that the changes > you are presenting here (as an optimization) were buggy/incorrect at > the time in which that commit was worked out. I think I've mixed it with 0123f4d76ca6 ("riscv/spinlock: Strengthen implementations with fences"). > > > > Actually, sc.w.aqrl is very strong and the same with: > > fence rw, rw > > sc.w > > fence rw,rw > > > > So "which do not give full-ordering with .aqrl" is not writen in > > RISC-V ISA and we could use sc.w/d.aqrl with LKMM. > > > > > > > > >> describes the issue more specifically, that's when we added these > > > >> fences. There have certainly been complains that these fences are too > > > >> heavyweight for the HW to go fast, but IIUC it's the best option we have > > > > Yeah, it would reduce the performance on D1 and our next-generation > > > > processor has optimized fence performance a lot. > > > > > > Definately a bummer that the fences make the HW go slow, but I don't > > > really see any other way to go about this. If you think these mappings > > > are valid for LKMM and RVWMO then we should figure this out, but trying > > > to drop fences to make HW go faster in ways that violate the memory > > > model is going to lead to insanity. > > Actually, this patch is okay with the ISA spec, and Dan also thought > > it was valid. > > > > ref: https://lore.kernel.org/lkml/41e01514-74ca-84f2-f5cc-2645c444fd8e@nvidia.com/raw > > "Thoughts" on this regard have _changed_. Please compare that quote > with, e.g. > > https://lore.kernel.org/linux-riscv/ddd5ca34-805b-60c4-bf2a-d6a9d95d89e7@nvidia.com/ > > So here's a suggestion: > > Reviewers of your patches have asked: How come that code we used to > consider as buggy is now considered "an optimization" (correct)? > > Denying the evidence or going around it is not making their job (and > this upstreaming) easier, so why don't you address it? Take time to > review previous works and discussions in this area, understand them, > and integrate such knowledge in future submissions. > > Andrea > > > > ------ > > > 2. Using .aqrl to replace the fence rw, rw is okay to ISA manual, > > > right? And reducing a fence instruction to gain better performance: > > > "0: lr.w %[p], %[c]\n" > > > " sub %[rc], %[p], %[o]\n" > > > " bltz %[rc], 1f\n". > > > - " sc.w.rl %[rc], %[rc], %[c]\n" > > > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > > " bnez %[rc], 0b\n" > > > - " fence rw, rw\n" > > > > Yes, using .aqrl is valid. > > > > Dan > > ------ > > > > > > > > > > I can definately buy the argument that we have mappings of various > > > application memory models that are difficult to make fast given the ISA > > > encodings of RVWMO we have, but that's not really an issue we can fix in > > > the atomic mappings. > > > > > > >> given the current set of memory model primitives we implement in the > > > >> ISA (ie, there's more in RVWMO but no way to encode that). > > > >> > > > >> The others all look good, though, and as these are really all > > > >> independent cleanups I'm going to go ahead and put those three on > > > >> for-next. > > > >> > > > >> There's also a bunch of checkpatch errors. The ones about "*" seem > > > >> spurious, but the alignment ones aren't. Here's my fixups: > > > > Thx for the fixup. > > > > > > > >> > > > >> diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > > > >> index 34f757dfc8f2..0bde499fa8bc 100644 > > > >> --- a/arch/riscv/include/asm/atomic.h > > > >> +++ b/arch/riscv/include/asm/atomic.h > > > >> @@ -86,9 +86,9 @@ ATOMIC_OPS(xor, xor, i) > > > >> * versions, while the logical ops only have fetch versions. > > > >> */ > > > >> #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix) \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > > > >> - atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > > > >> + atomic##prefix##_t *v) \ > > > >> { \ > > > >> register c_type ret; \ > > > >> __asm__ __volatile__ ( \ > > > >> @@ -98,9 +98,9 @@ c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i, \ > > > >> : "memory"); \ > > > >> return ret; \ > > > >> } \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > > > >> - atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > > > >> + atomic##prefix##_t *v) \ > > > >> { \ > > > >> register c_type ret; \ > > > >> __asm__ __volatile__ ( \ > > > >> @@ -110,9 +110,9 @@ c_type arch_atomic##prefix##_fetch_##op##_acquire(c_type i, \ > > > >> : "memory"); \ > > > >> return ret; \ > > > >> } \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > > > >> - atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > > > >> + atomic##prefix##_t *v) \ > > > >> { \ > > > >> register c_type ret; \ > > > >> __asm__ __volatile__ ( \ > > > >> @@ -122,8 +122,8 @@ c_type arch_atomic##prefix##_fetch_##op##_release(c_type i, \ > > > >> : "memory"); \ > > > >> return ret; \ > > > >> } \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > > > >> { \ > > > >> register c_type ret; \ > > > >> __asm__ __volatile__ ( \ > > > >> @@ -135,28 +135,28 @@ c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v) \ > > > >> } > > > >> > > > >> #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix) \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > > > >> - atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_##op##_return_relaxed(c_type i, \ > > > >> + atomic##prefix##_t *v) \ > > > >> { \ > > > >> - return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > > > >> + return arch_atomic##prefix##_fetch_##op##_relaxed(i, v) c_op I; \ > > > >> } \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > > > >> - atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_##op##_return_acquire(c_type i, \ > > > >> + atomic##prefix##_t *v) \ > > > >> { \ > > > >> - return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > > > >> + return arch_atomic##prefix##_fetch_##op##_acquire(i, v) c_op I; \ > > > >> } \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_##op##_return_release(c_type i, \ > > > >> - atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_##op##_return_release(c_type i, \ > > > >> + atomic##prefix##_t *v) \ > > > >> { \ > > > >> - return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > > > >> + return arch_atomic##prefix##_fetch_##op##_release(i, v) c_op I; \ > > > >> } \ > > > >> -static __always_inline \ > > > >> -c_type arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > > > >> +static __always_inline c_type \ > > > >> +arch_atomic##prefix##_##op##_return(c_type i, atomic##prefix##_t *v) \ > > > >> { \ > > > >> - return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > > > >> + return arch_atomic##prefix##_fetch_##op(i, v) c_op I; \ > > > >> } > > > >> > > > >> #ifdef CONFIG_GENERIC_ATOMIC64 > > > >> > > > >> > > > >> > > > > >> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > >> > Signed-off-by: Guo Ren <guoren@kernel.org> > > > >> > Cc: Palmer Dabbelt <palmer@dabbelt.com> > > > >> > Cc: Mark Rutland <mark.rutland@arm.com> > > > >> > Cc: Dan Lustig <dlustig@nvidia.com> > > > >> > Cc: Andrea Parri <parri.andrea@gmail.com> > > > >> > --- > > > >> > arch/riscv/include/asm/atomic.h | 24 ++++++++---------------- > > > >> > arch/riscv/include/asm/cmpxchg.h | 6 ++---- > > > >> > 2 files changed, 10 insertions(+), 20 deletions(-) > > > >> > > > > >> > diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h > > > >> > index 34f757dfc8f2..aef8aa9ac4f4 100644 > > > >> > --- a/arch/riscv/include/asm/atomic.h > > > >> > +++ b/arch/riscv/include/asm/atomic.h > > > >> > @@ -269,9 +269,8 @@ static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int > > > >> > "0: lr.w %[p], %[c]\n" > > > >> > " beq %[p], %[u], 1f\n" > > > >> > " add %[rc], %[p], %[a]\n" > > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : [a]"r" (a), [u]"r" (u) > > > >> > @@ -290,9 +289,8 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, > > > >> > "0: lr.d %[p], %[c]\n" > > > >> > " beq %[p], %[u], 1f\n" > > > >> > " add %[rc], %[p], %[a]\n" > > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : [a]"r" (a), [u]"r" (u) > > > >> > @@ -382,9 +380,8 @@ static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v) > > > >> > "0: lr.w %[p], %[c]\n" > > > >> > " bltz %[p], 1f\n" > > > >> > " addi %[rc], %[p], 1\n" > > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : > > > >> > @@ -402,9 +399,8 @@ static __always_inline bool arch_atomic_dec_unless_positive(atomic_t *v) > > > >> > "0: lr.w %[p], %[c]\n" > > > >> > " bgtz %[p], 1f\n" > > > >> > " addi %[rc], %[p], -1\n" > > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : > > > >> > @@ -422,9 +418,8 @@ static __always_inline int arch_atomic_dec_if_positive(atomic_t *v) > > > >> > "0: lr.w %[p], %[c]\n" > > > >> > " addi %[rc], %[p], -1\n" > > > >> > " bltz %[rc], 1f\n" > > > >> > - " sc.w.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.w.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : > > > >> > @@ -444,9 +439,8 @@ static __always_inline bool arch_atomic64_inc_unless_negative(atomic64_t *v) > > > >> > "0: lr.d %[p], %[c]\n" > > > >> > " bltz %[p], 1f\n" > > > >> > " addi %[rc], %[p], 1\n" > > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : > > > >> > @@ -465,9 +459,8 @@ static __always_inline bool arch_atomic64_dec_unless_positive(atomic64_t *v) > > > >> > "0: lr.d %[p], %[c]\n" > > > >> > " bgtz %[p], 1f\n" > > > >> > " addi %[rc], %[p], -1\n" > > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : > > > >> > @@ -486,9 +479,8 @@ static __always_inline s64 arch_atomic64_dec_if_positive(atomic64_t *v) > > > >> > "0: lr.d %[p], %[c]\n" > > > >> > " addi %[rc], %[p], -1\n" > > > >> > " bltz %[rc], 1f\n" > > > >> > - " sc.d.rl %[rc], %[rc], %[c]\n" > > > >> > + " sc.d.aqrl %[rc], %[rc], %[c]\n" > > > >> > " bnez %[rc], 0b\n" > > > >> > - " fence rw, rw\n" > > > >> > "1:\n" > > > >> > : [p]"=&r" (prev), [rc]"=&r" (rc), [c]"+A" (v->counter) > > > >> > : > > > >> > diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h > > > >> > index 1af8db92250b..9269fceb86e0 100644 > > > >> > --- a/arch/riscv/include/asm/cmpxchg.h > > > >> > +++ b/arch/riscv/include/asm/cmpxchg.h > > > >> > @@ -307,9 +307,8 @@ > > > >> > __asm__ __volatile__ ( \ > > > >> > "0: lr.w %0, %2\n" \ > > > >> > " bne %0, %z3, 1f\n" \ > > > >> > - " sc.w.rl %1, %z4, %2\n" \ > > > >> > + " sc.w.aqrl %1, %z4, %2\n" \ > > > >> > " bnez %1, 0b\n" \ > > > >> > - " fence rw, rw\n" \ > > > >> > "1:\n" \ > > > >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > > > >> > : "rJ" ((long)__old), "rJ" (__new) \ > > > >> > @@ -319,9 +318,8 @@ > > > >> > __asm__ __volatile__ ( \ > > > >> > "0: lr.d %0, %2\n" \ > > > >> > " bne %0, %z3, 1f\n" \ > > > >> > - " sc.d.rl %1, %z4, %2\n" \ > > > >> > + " sc.d.aqrl %1, %z4, %2\n" \ > > > >> > " bnez %1, 0b\n" \ > > > >> > - " fence rw, rw\n" \ > > > >> > "1:\n" \ > > > >> > : "=&r" (__ret), "=&r" (__rc), "+A" (*__ptr) \ > > > >> > : "rJ" (__old), "rJ" (__new) \ > > > > > > > > -- > > Best Regards > > Guo Ren > > > > ML: https://lore.kernel.org/linux-csky/ -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2022-08-09 7:06 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-05 3:55 [PATCH V4 0/5] riscv: Optimize atomic implementation guoren 2022-05-05 3:55 ` [PATCH V4 1/5] riscv: atomic: Cleanup unnecessary definition guoren 2022-05-05 3:55 ` [PATCH V4 2/5] riscv: atomic: Optimize dec_if_positive functions guoren 2022-05-05 3:55 ` [PATCH V4 3/5] riscv: atomic: Add custom conditional atomic operation implementation guoren 2022-05-05 3:55 ` [PATCH V4 4/5] riscv: atomic: Optimize atomic_ops & xchg with .aq/rl annotation guoren 2022-05-05 3:55 ` [PATCH V4 5/5] riscv: atomic: Optimize LRSC-pairs atomic ops with .aqrl annotation guoren 2022-05-21 20:46 ` Palmer Dabbelt 2022-05-22 13:12 ` Guo Ren 2022-06-02 5:59 ` Palmer Dabbelt 2022-06-13 11:49 ` Guo Ren 2022-06-14 11:03 ` Andrea Parri 2022-06-23 3:31 ` Boqun Feng 2022-06-23 17:09 ` Dan Lustig 2022-06-23 17:55 ` Boqun Feng 2022-06-23 22:15 ` Palmer Dabbelt 2022-06-24 3:34 ` Guo Ren 2022-06-25 5:29 ` Guo Ren 2022-07-07 0:03 ` Boqun Feng 2022-07-13 13:38 ` Dan Lustig 2022-07-13 23:34 ` Guo Ren 2022-07-13 23:47 ` Guo Ren 2022-07-14 13:06 ` Dan Lustig 2022-08-09 7:06 ` Guo Ren 2022-06-24 3:28 ` Guo Ren
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).