All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] arm64: atomics: cleanups and codegen improvements
@ 2021-12-10 15:14 Mark Rutland
  2021-12-10 15:14 ` [PATCH 1/5] arm64: atomics: format whitespace consistently Mark Rutland
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Mark Rutland @ 2021-12-10 15:14 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: boqun.feng, catalin.marinas, mark.rutland, peterz, will

While looking at Peter's recent refcount rework, I spotted that we have
some unfortunate code generation for the LSE atomics. Due to a
combination of assembly constraints and manipulation performed in
assembly which the compiler has no visibilty of, the compiler ends up
generating unnecessary register shuffling and redundant manipulation.

This series (based on v5.16-rc4) attempts to improve this by improving
the constraints, and moving value manipulation to C where the compiler
can perofrm a number of optimizations. This also has the benefit of
simplifying the implementation and deleting 100+ lines of code.

This is purely a cleanup and optimization; there should be no functional
change as a result of the series.

I've pushed the series out to my arm64/atomics/improvements branch:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/atomics/improvements
  git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/atomics/improvements

For comparison, using GCC 11.1.0 to compile the following code:

| s64 example64_fetch_and(s64 i, atomic64_t *v)
| {
| 	return __lse_atomic64_fetch_and(i, v);
| }
| 
| s64 example64_fetch_and_f(atomic64_t *v)
| {
| 	return __lse_atomic64_fetch_and(0xf, v);
| }
| 
| s64 example64_fetch_sub(s64 i, atomic64_t *v)
| {
| 	return __lse_atomic64_fetch_sub(i, v);
| }
| 
| s64 example64_fetch_sub_f(atomic64_t *v)
| {
| 	return __lse_atomic64_fetch_sub(0xf, v);
| }
| 
| s64 example64_sub_return(s64 i, atomic64_t *v)
| {
| 	return __lse_atomic64_sub_return(i, v);
| }
| 
| s64 example64_sub_return_f(atomic64_t *v)
| {
| 	return __lse_atomic64_sub_return(0xf, v);
| }

Before this series:

| 0000000000000000 <example64_fetch_and>:
|    0:   aa2003e0        mvn     x0, x0
|    4:   f8e01020        ldclral x0, x0, [x1]
|    8:   d65f03c0        ret
|    c:   d503201f        nop
| 
| 0000000000000010 <example64_fetch_and_f>:
|   10:   aa0003e2        mov     x2, x0
|   14:   d28001e1        mov     x1, #0xf                        // #15
|   18:   aa0103e0        mov     x0, x1
|   1c:   aa2003e0        mvn     x0, x0
|   20:   f8e01040        ldclral x0, x0, [x2]
|   24:   d65f03c0        ret
|   28:   d503201f        nop
|   2c:   d503201f        nop
| 
| 0000000000000030 <example64_fetch_sub>:
|   30:   cb0003e0        neg     x0, x0
|   34:   f8e00020        ldaddal x0, x0, [x1]
|   38:   d65f03c0        ret
|   3c:   d503201f        nop
| 
| 0000000000000040 <example64_fetch_sub_f>:
|   40:   aa0003e2        mov     x2, x0
|   44:   d28001e1        mov     x1, #0xf                        // #15
|   48:   aa0103e0        mov     x0, x1
|   4c:   cb0003e0        neg     x0, x0
|   50:   f8e00040        ldaddal x0, x0, [x2]
|   54:   d65f03c0        ret
|   58:   d503201f        nop
|   5c:   d503201f        nop
| 
| 0000000000000060 <example64_sub_return>:
|   60:   cb0003e0        neg     x0, x0
|   64:   f8e00022        ldaddal x0, x2, [x1]
|   68:   8b020000        add     x0, x0, x2
|   6c:   d65f03c0        ret
| 
| 0000000000000070 <example64_sub_return_f>:
|   70:   aa0003e2        mov     x2, x0
|   74:   d28001e1        mov     x1, #0xf                        // #15
|   78:   aa0103e0        mov     x0, x1
|   7c:   cb0003e0        neg     x0, x0
|   80:   f8e00041        ldaddal x0, x1, [x2]
|   84:   8b010000        add     x0, x0, x1
|   88:   d65f03c0        ret
|   8c:   d503201f        nop

After this series:

| 0000000000000000 <example64_fetch_and>:
|    0:   aa2003e0        mvn     x0, x0
|    4:   f8e01020        ldclral x0, x0, [x1]
|    8:   d65f03c0        ret
|    c:   d503201f        nop
| 
| 0000000000000010 <example64_fetch_and_f>:
|   10:   928001e1        mov     x1, #0xfffffffffffffff0         // #-16
|   14:   f8e11001        ldclral x1, x1, [x0]
|   18:   aa0103e0        mov     x0, x1
|   1c:   d65f03c0        ret
| 
| 0000000000000020 <example64_fetch_sub>:
|   20:   cb0003e0        neg     x0, x0
|   24:   f8e00020        ldaddal x0, x0, [x1]
|   28:   d65f03c0        ret
|   2c:   d503201f        nop
| 
| 0000000000000030 <example64_fetch_sub_f>:
|   30:   928001c1        mov     x1, #0xfffffffffffffff1         // #-15
|   34:   f8e10001        ldaddal x1, x1, [x0]
|   38:   aa0103e0        mov     x0, x1
|   3c:   d65f03c0        ret
| 
| 0000000000000040 <example64_sub_return>:
|   40:   cb0003e2        neg     x2, x0
|   44:   f8e20022        ldaddal x2, x2, [x1]
|   48:   cb000040        sub     x0, x2, x0
|   4c:   d65f03c0        ret
| 
| 0000000000000050 <example64_sub_return_f>:
|   50:   928001c1        mov     x1, #0xfffffffffffffff1         // #-15
|   54:   f8e10001        ldaddal x1, x1, [x0]
|   58:   d1003c20        sub     x0, x1, #0xf
|   5c:   d65f03c0        ret

Thanks,
Mark.

Mark Rutland (5):
  arm64: atomics: format whitespace consistently
  arm64: atomics lse: define SUBs in terms of ADDs
  arm64: atomics: lse: define ANDs in terms of ANDNOTs
  arm64: atomics: lse: improve constraints for simple ops
  arm64: atomics: lse: define RETURN ops in terms of FETCH ops

 arch/arm64/include/asm/atomic_ll_sc.h |  86 ++++----
 arch/arm64/include/asm/atomic_lse.h   | 270 ++++++++------------------
 2 files changed, 126 insertions(+), 230 deletions(-)

-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] arm64: atomics: format whitespace consistently
  2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
@ 2021-12-10 15:14 ` Mark Rutland
  2021-12-13 19:20   ` Will Deacon
  2021-12-10 15:14 ` [PATCH 2/5] arm64: atomics lse: define SUBs in terms of ADDs Mark Rutland
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2021-12-10 15:14 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: boqun.feng, catalin.marinas, mark.rutland, peterz, will

The code for the atomic ops is formatted inconsistently, and while this
is not a functional problem it is rather distracting when working on
them.

Some have ops have consistent indentation, e.g.

| #define ATOMIC_OP_ADD_RETURN(name, mb, cl...)                           \
| static inline int __lse_atomic_add_return##name(int i, atomic_t *v)     \
| {                                                                       \
|         u32 tmp;                                                        \
|                                                                         \
|         asm volatile(                                                   \
|         __LSE_PREAMBLE                                                  \
|         "       ldadd" #mb "    %w[i], %w[tmp], %[v]\n"                 \
|         "       add     %w[i], %w[i], %w[tmp]"                          \
|         : [i] "+r" (i), [v] "+Q" (v->counter), [tmp] "=&r" (tmp)        \
|         : "r" (v)                                                       \
|         : cl);                                                          \
|                                                                         \
|         return i;                                                       \
| }

While others have negative indentation for some lines, and/or have
misaligned trailing backslashes, e.g.

| static inline void __lse_atomic_##op(int i, atomic_t *v)                        \
| {                                                                       \
|         asm volatile(                                                   \
|         __LSE_PREAMBLE                                                  \
| "       " #asm_op "     %w[i], %[v]\n"                                  \
|         : [i] "+r" (i), [v] "+Q" (v->counter)                           \
|         : "r" (v));                                                     \
| }

This patch makes the indentation consistent and also aligns the trailing
backslashes. This makes the code easier to read for those (like myself)
who are easily distracted by these inconsistencies.

This is intended as a cleanup.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/atomic_ll_sc.h | 86 +++++++++++++--------------
 arch/arm64/include/asm/atomic_lse.h   | 14 ++---
 2 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index 13869b76b58c..fe0db8d416fb 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -44,11 +44,11 @@ __ll_sc_atomic_##op(int i, atomic_t *v)					\
 									\
 	asm volatile("// atomic_" #op "\n"				\
 	__LL_SC_FALLBACK(						\
-"	prfm	pstl1strm, %2\n"					\
-"1:	ldxr	%w0, %2\n"						\
-"	" #asm_op "	%w0, %w0, %w3\n"				\
-"	stxr	%w1, %w0, %2\n"						\
-"	cbnz	%w1, 1b\n")						\
+	"	prfm	pstl1strm, %2\n"				\
+	"1:	ldxr	%w0, %2\n"					\
+	"	" #asm_op "	%w0, %w0, %w3\n"			\
+	"	stxr	%w1, %w0, %2\n"					\
+	"	cbnz	%w1, 1b\n")					\
 	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
 	: __stringify(constraint) "r" (i));				\
 }
@@ -62,12 +62,12 @@ __ll_sc_atomic_##op##_return##name(int i, atomic_t *v)			\
 									\
 	asm volatile("// atomic_" #op "_return" #name "\n"		\
 	__LL_SC_FALLBACK(						\
-"	prfm	pstl1strm, %2\n"					\
-"1:	ld" #acq "xr	%w0, %2\n"					\
-"	" #asm_op "	%w0, %w0, %w3\n"				\
-"	st" #rel "xr	%w1, %w0, %2\n"					\
-"	cbnz	%w1, 1b\n"						\
-"	" #mb )								\
+	"	prfm	pstl1strm, %2\n"				\
+	"1:	ld" #acq "xr	%w0, %2\n"				\
+	"	" #asm_op "	%w0, %w0, %w3\n"			\
+	"	st" #rel "xr	%w1, %w0, %2\n"				\
+	"	cbnz	%w1, 1b\n"					\
+	"	" #mb )							\
 	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
 	: __stringify(constraint) "r" (i)				\
 	: cl);								\
@@ -84,12 +84,12 @@ __ll_sc_atomic_fetch_##op##name(int i, atomic_t *v)			\
 									\
 	asm volatile("// atomic_fetch_" #op #name "\n"			\
 	__LL_SC_FALLBACK(						\
-"	prfm	pstl1strm, %3\n"					\
-"1:	ld" #acq "xr	%w0, %3\n"					\
-"	" #asm_op "	%w1, %w0, %w4\n"				\
-"	st" #rel "xr	%w2, %w1, %3\n"					\
-"	cbnz	%w2, 1b\n"						\
-"	" #mb )								\
+	"	prfm	pstl1strm, %3\n"				\
+	"1:	ld" #acq "xr	%w0, %3\n"				\
+	"	" #asm_op "	%w1, %w0, %w4\n"			\
+	"	st" #rel "xr	%w2, %w1, %3\n"				\
+	"	cbnz	%w2, 1b\n"					\
+	"	" #mb )							\
 	: "=&r" (result), "=&r" (val), "=&r" (tmp), "+Q" (v->counter)	\
 	: __stringify(constraint) "r" (i)				\
 	: cl);								\
@@ -143,11 +143,11 @@ __ll_sc_atomic64_##op(s64 i, atomic64_t *v)				\
 									\
 	asm volatile("// atomic64_" #op "\n"				\
 	__LL_SC_FALLBACK(						\
-"	prfm	pstl1strm, %2\n"					\
-"1:	ldxr	%0, %2\n"						\
-"	" #asm_op "	%0, %0, %3\n"					\
-"	stxr	%w1, %0, %2\n"						\
-"	cbnz	%w1, 1b")						\
+	"	prfm	pstl1strm, %2\n"				\
+	"1:	ldxr	%0, %2\n"					\
+	"	" #asm_op "	%0, %0, %3\n"				\
+	"	stxr	%w1, %0, %2\n"					\
+	"	cbnz	%w1, 1b")					\
 	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
 	: __stringify(constraint) "r" (i));				\
 }
@@ -161,12 +161,12 @@ __ll_sc_atomic64_##op##_return##name(s64 i, atomic64_t *v)		\
 									\
 	asm volatile("// atomic64_" #op "_return" #name "\n"		\
 	__LL_SC_FALLBACK(						\
-"	prfm	pstl1strm, %2\n"					\
-"1:	ld" #acq "xr	%0, %2\n"					\
-"	" #asm_op "	%0, %0, %3\n"					\
-"	st" #rel "xr	%w1, %0, %2\n"					\
-"	cbnz	%w1, 1b\n"						\
-"	" #mb )								\
+	"	prfm	pstl1strm, %2\n"				\
+	"1:	ld" #acq "xr	%0, %2\n"				\
+	"	" #asm_op "	%0, %0, %3\n"				\
+	"	st" #rel "xr	%w1, %0, %2\n"				\
+	"	cbnz	%w1, 1b\n"					\
+	"	" #mb )							\
 	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
 	: __stringify(constraint) "r" (i)				\
 	: cl);								\
@@ -176,19 +176,19 @@ __ll_sc_atomic64_##op##_return##name(s64 i, atomic64_t *v)		\
 
 #define ATOMIC64_FETCH_OP(name, mb, acq, rel, cl, op, asm_op, constraint)\
 static inline long							\
-__ll_sc_atomic64_fetch_##op##name(s64 i, atomic64_t *v)		\
+__ll_sc_atomic64_fetch_##op##name(s64 i, atomic64_t *v)			\
 {									\
 	s64 result, val;						\
 	unsigned long tmp;						\
 									\
 	asm volatile("// atomic64_fetch_" #op #name "\n"		\
 	__LL_SC_FALLBACK(						\
-"	prfm	pstl1strm, %3\n"					\
-"1:	ld" #acq "xr	%0, %3\n"					\
-"	" #asm_op "	%1, %0, %4\n"					\
-"	st" #rel "xr	%w2, %1, %3\n"					\
-"	cbnz	%w2, 1b\n"						\
-"	" #mb )								\
+	"	prfm	pstl1strm, %3\n"				\
+	"1:	ld" #acq "xr	%0, %3\n"				\
+	"	" #asm_op "	%1, %0, %4\n"				\
+	"	st" #rel "xr	%w2, %1, %3\n"				\
+	"	cbnz	%w2, 1b\n"					\
+	"	" #mb )							\
 	: "=&r" (result), "=&r" (val), "=&r" (tmp), "+Q" (v->counter)	\
 	: __stringify(constraint) "r" (i)				\
 	: cl);								\
@@ -241,14 +241,14 @@ __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
 
 	asm volatile("// atomic64_dec_if_positive\n"
 	__LL_SC_FALLBACK(
-"	prfm	pstl1strm, %2\n"
-"1:	ldxr	%0, %2\n"
-"	subs	%0, %0, #1\n"
-"	b.lt	2f\n"
-"	stlxr	%w1, %0, %2\n"
-"	cbnz	%w1, 1b\n"
-"	dmb	ish\n"
-"2:")
+	"	prfm	pstl1strm, %2\n"
+	"1:	ldxr	%0, %2\n"
+	"	subs	%0, %0, #1\n"
+	"	b.lt	2f\n"
+	"	stlxr	%w1, %0, %2\n"
+	"	cbnz	%w1, 1b\n"
+	"	dmb	ish\n"
+	"2:")
 	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
 	:
 	: "cc", "memory");
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index da3280f639cd..ab661375835e 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -11,11 +11,11 @@
 #define __ASM_ATOMIC_LSE_H
 
 #define ATOMIC_OP(op, asm_op)						\
-static inline void __lse_atomic_##op(int i, atomic_t *v)			\
+static inline void __lse_atomic_##op(int i, atomic_t *v)		\
 {									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-"	" #asm_op "	%w[i], %[v]\n"					\
+	"	" #asm_op "	%w[i], %[v]\n"				\
 	: [i] "+r" (i), [v] "+Q" (v->counter)				\
 	: "r" (v));							\
 }
@@ -32,7 +32,7 @@ static inline int __lse_atomic_fetch_##op##name(int i, atomic_t *v)	\
 {									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-"	" #asm_op #mb "	%w[i], %w[i], %[v]"				\
+	"	" #asm_op #mb "	%w[i], %w[i], %[v]"			\
 	: [i] "+r" (i), [v] "+Q" (v->counter)				\
 	: "r" (v)							\
 	: cl);								\
@@ -130,7 +130,7 @@ static inline int __lse_atomic_sub_return##name(int i, atomic_t *v)	\
 	"	add	%w[i], %w[i], %w[tmp]"				\
 	: [i] "+&r" (i), [v] "+Q" (v->counter), [tmp] "=&r" (tmp)	\
 	: "r" (v)							\
-	: cl);							\
+	: cl);								\
 									\
 	return i;							\
 }
@@ -168,7 +168,7 @@ static inline void __lse_atomic64_##op(s64 i, atomic64_t *v)		\
 {									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-"	" #asm_op "	%[i], %[v]\n"					\
+	"	" #asm_op "	%[i], %[v]\n"				\
 	: [i] "+r" (i), [v] "+Q" (v->counter)				\
 	: "r" (v));							\
 }
@@ -185,7 +185,7 @@ static inline long __lse_atomic64_fetch_##op##name(s64 i, atomic64_t *v)\
 {									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-"	" #asm_op #mb "	%[i], %[i], %[v]"				\
+	"	" #asm_op #mb "	%[i], %[i], %[v]"			\
 	: [i] "+r" (i), [v] "+Q" (v->counter)				\
 	: "r" (v)							\
 	: cl);								\
@@ -272,7 +272,7 @@ static inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
 }
 
 #define ATOMIC64_OP_SUB_RETURN(name, mb, cl...)				\
-static inline long __lse_atomic64_sub_return##name(s64 i, atomic64_t *v)	\
+static inline long __lse_atomic64_sub_return##name(s64 i, atomic64_t *v)\
 {									\
 	unsigned long tmp;						\
 									\
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] arm64: atomics lse: define SUBs in terms of ADDs
  2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
  2021-12-10 15:14 ` [PATCH 1/5] arm64: atomics: format whitespace consistently Mark Rutland
@ 2021-12-10 15:14 ` Mark Rutland
  2021-12-13 19:27   ` Will Deacon
  2021-12-10 15:14 ` [PATCH 3/5] arm64: atomics: lse: define ANDs in terms of ANDNOTs Mark Rutland
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2021-12-10 15:14 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: boqun.feng, catalin.marinas, mark.rutland, peterz, will

The FEAT_LSE atomic instructions include atomic ADD instructions
(`stadd*` and `ldadd*`), but do not include atomic SUB instructions, so
we must build all of the SUB operations using the ADD instructions. We
open-code these today, with each SUB op implemented as a copy of the
corresponding ADD op with a leading `neg` instruction in the inline
assembly to negate the `i` argument.

As the compiler has no visibility of the `neg`, this leads to less than
optimal code generation when generating `i` into a register. For
example, __les_atomic_fetch_sub(1, v) can be compiled to:

	mov     w1, #0x1
	neg     w1, w1
	ldaddal w1, w1, [x2]

This patch improves this by replacing the `neg` with negation in C
before the inline assembly block, e.g.

	i = -i;

This allows the compiler to generate `i` into a register more optimally,
e.g.

	mov     w1, #0xffffffff
	ldaddal w1, w1, [x2]

With this change the assembly for each SUB op is identical to the
corresponding ADD op (including barriers and clobbers), so I've removed
the inline assembly and rewritten each SUB op in terms of the
corresponding ADD op, e.g.

| static inline void __lse_atomic_sub(int i, atomic_t *v)
| {
| 	__lse_atomic_add(-i, v);
| }

For clarity I've moved the definition of each SUB op immediately after
the corresponding ADD op, and used a single macro to create the RETURN
forms of both ops.

This is intended as an optimization and cleanup.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/atomic_lse.h | 180 +++++++++-------------------
 1 file changed, 58 insertions(+), 122 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index ab661375835e..7454febb6d77 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -25,6 +25,11 @@ ATOMIC_OP(or, stset)
 ATOMIC_OP(xor, steor)
 ATOMIC_OP(add, stadd)
 
+static inline void __lse_atomic_sub(int i, atomic_t *v)
+{
+	__lse_atomic_add(-i, v);
+}
+
 #undef ATOMIC_OP
 
 #define ATOMIC_FETCH_OP(name, mb, op, asm_op, cl...)			\
@@ -54,7 +59,20 @@ ATOMIC_FETCH_OPS(add, ldadd)
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_FETCH_OPS
 
-#define ATOMIC_OP_ADD_RETURN(name, mb, cl...)				\
+#define ATOMIC_FETCH_OP_SUB(name)					\
+static inline int __lse_atomic_fetch_sub##name(int i, atomic_t *v)	\
+{									\
+	return __lse_atomic_fetch_add##name(-i, v);			\
+}
+
+ATOMIC_FETCH_OP_SUB(_relaxed)
+ATOMIC_FETCH_OP_SUB(_acquire)
+ATOMIC_FETCH_OP_SUB(_release)
+ATOMIC_FETCH_OP_SUB(        )
+
+#undef ATOMIC_FETCH_OP_SUB
+
+#define ATOMIC_OP_ADD_SUB_RETURN(name, mb, cl...)			\
 static inline int __lse_atomic_add_return##name(int i, atomic_t *v)	\
 {									\
 	u32 tmp;							\
@@ -68,14 +86,19 @@ static inline int __lse_atomic_add_return##name(int i, atomic_t *v)	\
 	: cl);								\
 									\
 	return i;							\
+}									\
+									\
+static inline int __lse_atomic_sub_return##name(int i, atomic_t *v)	\
+{									\
+	return __lse_atomic_add_return##name(-i, v);			\
 }
 
-ATOMIC_OP_ADD_RETURN(_relaxed,   )
-ATOMIC_OP_ADD_RETURN(_acquire,  a, "memory")
-ATOMIC_OP_ADD_RETURN(_release,  l, "memory")
-ATOMIC_OP_ADD_RETURN(        , al, "memory")
+ATOMIC_OP_ADD_SUB_RETURN(_relaxed,   )
+ATOMIC_OP_ADD_SUB_RETURN(_acquire,  a, "memory")
+ATOMIC_OP_ADD_SUB_RETURN(_release,  l, "memory")
+ATOMIC_OP_ADD_SUB_RETURN(        , al, "memory")
 
-#undef ATOMIC_OP_ADD_RETURN
+#undef ATOMIC_OP_ADD_SUB_RETURN
 
 static inline void __lse_atomic_and(int i, atomic_t *v)
 {
@@ -108,61 +131,6 @@ ATOMIC_FETCH_OP_AND(        , al, "memory")
 
 #undef ATOMIC_FETCH_OP_AND
 
-static inline void __lse_atomic_sub(int i, atomic_t *v)
-{
-	asm volatile(
-	__LSE_PREAMBLE
-	"	neg	%w[i], %w[i]\n"
-	"	stadd	%w[i], %[v]"
-	: [i] "+&r" (i), [v] "+Q" (v->counter)
-	: "r" (v));
-}
-
-#define ATOMIC_OP_SUB_RETURN(name, mb, cl...)				\
-static inline int __lse_atomic_sub_return##name(int i, atomic_t *v)	\
-{									\
-	u32 tmp;							\
-									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	neg	%w[i], %w[i]\n"					\
-	"	ldadd" #mb "	%w[i], %w[tmp], %[v]\n"			\
-	"	add	%w[i], %w[i], %w[tmp]"				\
-	: [i] "+&r" (i), [v] "+Q" (v->counter), [tmp] "=&r" (tmp)	\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
-}
-
-ATOMIC_OP_SUB_RETURN(_relaxed,   )
-ATOMIC_OP_SUB_RETURN(_acquire,  a, "memory")
-ATOMIC_OP_SUB_RETURN(_release,  l, "memory")
-ATOMIC_OP_SUB_RETURN(        , al, "memory")
-
-#undef ATOMIC_OP_SUB_RETURN
-
-#define ATOMIC_FETCH_OP_SUB(name, mb, cl...)				\
-static inline int __lse_atomic_fetch_sub##name(int i, atomic_t *v)	\
-{									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	neg	%w[i], %w[i]\n"					\
-	"	ldadd" #mb "	%w[i], %w[i], %[v]"			\
-	: [i] "+&r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
-}
-
-ATOMIC_FETCH_OP_SUB(_relaxed,   )
-ATOMIC_FETCH_OP_SUB(_acquire,  a, "memory")
-ATOMIC_FETCH_OP_SUB(_release,  l, "memory")
-ATOMIC_FETCH_OP_SUB(        , al, "memory")
-
-#undef ATOMIC_FETCH_OP_SUB
-
 #define ATOMIC64_OP(op, asm_op)						\
 static inline void __lse_atomic64_##op(s64 i, atomic64_t *v)		\
 {									\
@@ -178,6 +146,11 @@ ATOMIC64_OP(or, stset)
 ATOMIC64_OP(xor, steor)
 ATOMIC64_OP(add, stadd)
 
+static inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
+{
+	__lse_atomic64_add(-i, v);
+}
+
 #undef ATOMIC64_OP
 
 #define ATOMIC64_FETCH_OP(name, mb, op, asm_op, cl...)			\
@@ -207,7 +180,20 @@ ATOMIC64_FETCH_OPS(add, ldadd)
 #undef ATOMIC64_FETCH_OP
 #undef ATOMIC64_FETCH_OPS
 
-#define ATOMIC64_OP_ADD_RETURN(name, mb, cl...)				\
+#define ATOMIC64_FETCH_OP_SUB(name)					\
+static inline long __lse_atomic64_fetch_sub##name(s64 i, atomic64_t *v)	\
+{									\
+	return __lse_atomic64_fetch_add##name(-i, v);			\
+}
+
+ATOMIC64_FETCH_OP_SUB(_relaxed)
+ATOMIC64_FETCH_OP_SUB(_acquire)
+ATOMIC64_FETCH_OP_SUB(_release)
+ATOMIC64_FETCH_OP_SUB(        )
+
+#undef ATOMIC64_FETCH_OP_SUB
+
+#define ATOMIC64_OP_ADD_SUB_RETURN(name, mb, cl...)			\
 static inline long __lse_atomic64_add_return##name(s64 i, atomic64_t *v)\
 {									\
 	unsigned long tmp;						\
@@ -221,14 +207,19 @@ static inline long __lse_atomic64_add_return##name(s64 i, atomic64_t *v)\
 	: cl);								\
 									\
 	return i;							\
+}									\
+									\
+static inline long __lse_atomic64_sub_return##name(s64 i, atomic64_t *v)\
+{									\
+	return __lse_atomic64_add_return##name(-i, v);			\
 }
 
-ATOMIC64_OP_ADD_RETURN(_relaxed,   )
-ATOMIC64_OP_ADD_RETURN(_acquire,  a, "memory")
-ATOMIC64_OP_ADD_RETURN(_release,  l, "memory")
-ATOMIC64_OP_ADD_RETURN(        , al, "memory")
+ATOMIC64_OP_ADD_SUB_RETURN(_relaxed,   )
+ATOMIC64_OP_ADD_SUB_RETURN(_acquire,  a, "memory")
+ATOMIC64_OP_ADD_SUB_RETURN(_release,  l, "memory")
+ATOMIC64_OP_ADD_SUB_RETURN(        , al, "memory")
 
-#undef ATOMIC64_OP_ADD_RETURN
+#undef ATOMIC64_OP_ADD_SUB_RETURN
 
 static inline void __lse_atomic64_and(s64 i, atomic64_t *v)
 {
@@ -261,61 +252,6 @@ ATOMIC64_FETCH_OP_AND(        , al, "memory")
 
 #undef ATOMIC64_FETCH_OP_AND
 
-static inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
-{
-	asm volatile(
-	__LSE_PREAMBLE
-	"	neg	%[i], %[i]\n"
-	"	stadd	%[i], %[v]"
-	: [i] "+&r" (i), [v] "+Q" (v->counter)
-	: "r" (v));
-}
-
-#define ATOMIC64_OP_SUB_RETURN(name, mb, cl...)				\
-static inline long __lse_atomic64_sub_return##name(s64 i, atomic64_t *v)\
-{									\
-	unsigned long tmp;						\
-									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	neg	%[i], %[i]\n"					\
-	"	ldadd" #mb "	%[i], %x[tmp], %[v]\n"			\
-	"	add	%[i], %[i], %x[tmp]"				\
-	: [i] "+&r" (i), [v] "+Q" (v->counter), [tmp] "=&r" (tmp)	\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
-}
-
-ATOMIC64_OP_SUB_RETURN(_relaxed,   )
-ATOMIC64_OP_SUB_RETURN(_acquire,  a, "memory")
-ATOMIC64_OP_SUB_RETURN(_release,  l, "memory")
-ATOMIC64_OP_SUB_RETURN(        , al, "memory")
-
-#undef ATOMIC64_OP_SUB_RETURN
-
-#define ATOMIC64_FETCH_OP_SUB(name, mb, cl...)				\
-static inline long __lse_atomic64_fetch_sub##name(s64 i, atomic64_t *v)	\
-{									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	neg	%[i], %[i]\n"					\
-	"	ldadd" #mb "	%[i], %[i], %[v]"			\
-	: [i] "+&r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
-}
-
-ATOMIC64_FETCH_OP_SUB(_relaxed,   )
-ATOMIC64_FETCH_OP_SUB(_acquire,  a, "memory")
-ATOMIC64_FETCH_OP_SUB(_release,  l, "memory")
-ATOMIC64_FETCH_OP_SUB(        , al, "memory")
-
-#undef ATOMIC64_FETCH_OP_SUB
-
 static inline s64 __lse_atomic64_dec_if_positive(atomic64_t *v)
 {
 	unsigned long tmp;
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] arm64: atomics: lse: define ANDs in terms of ANDNOTs
  2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
  2021-12-10 15:14 ` [PATCH 1/5] arm64: atomics: format whitespace consistently Mark Rutland
  2021-12-10 15:14 ` [PATCH 2/5] arm64: atomics lse: define SUBs in terms of ADDs Mark Rutland
@ 2021-12-10 15:14 ` Mark Rutland
  2021-12-13 19:29   ` Will Deacon
  2021-12-10 15:14 ` [PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops Mark Rutland
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2021-12-10 15:14 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: boqun.feng, catalin.marinas, mark.rutland, peterz, will

The FEAT_LSE atomic instructions include atomic bit-clear instructions
(`ldclr*` and `stclr*`) which can be used to directly implement ANDNOT
operations. Each AND op is implemented as a copy of the corresponding
ANDNOT op with a leading `mvn` instruction to apply a bitwise NOT to the
`i` argument.

As the compiler has no visibility of the `mvn`, this leads to less than
optimal code generation when generating `i` into a register. For
example, __lse_atomic_fetch_and(0xf, v) can be compiled to:

	mov     w1, #0xf
	mvn     w1, w1
	ldclral w1, w1, [x2]

This patch improves this by replacing the `mvn` with NOT in C before the
inline assembly block, e.g.

	i = ~i;

This allows the compiler to generate `i` into a register more optimally,
e.g.

	mov     w1, #0xfffffff0
	ldclral w1, w1, [x2]

With this change the assembly for each AND op is identical to the
corresponding ANDNOT op (including barriers and clobbers), so I've
removed the inline assembly and rewritten each AND op in terms of the
corresponding ANDNOT op, e.g.

| static inline void __lse_atomic_and(int i, atomic_t *v)
| {
| 	return __lse_atomic_andnot(~i, v);
| }

This is intended as an optimization and cleanup.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/atomic_lse.h | 34 ++++-------------------------
 1 file changed, 4 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index 7454febb6d77..d707eafb7677 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -102,26 +102,13 @@ ATOMIC_OP_ADD_SUB_RETURN(        , al, "memory")
 
 static inline void __lse_atomic_and(int i, atomic_t *v)
 {
-	asm volatile(
-	__LSE_PREAMBLE
-	"	mvn	%w[i], %w[i]\n"
-	"	stclr	%w[i], %[v]"
-	: [i] "+&r" (i), [v] "+Q" (v->counter)
-	: "r" (v));
+	return __lse_atomic_andnot(~i, v);
 }
 
 #define ATOMIC_FETCH_OP_AND(name, mb, cl...)				\
 static inline int __lse_atomic_fetch_and##name(int i, atomic_t *v)	\
 {									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	mvn	%w[i], %w[i]\n"					\
-	"	ldclr" #mb "	%w[i], %w[i], %[v]"			\
-	: [i] "+&r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
+	return __lse_atomic_fetch_andnot##name(~i, v);			\
 }
 
 ATOMIC_FETCH_OP_AND(_relaxed,   )
@@ -223,26 +210,13 @@ ATOMIC64_OP_ADD_SUB_RETURN(        , al, "memory")
 
 static inline void __lse_atomic64_and(s64 i, atomic64_t *v)
 {
-	asm volatile(
-	__LSE_PREAMBLE
-	"	mvn	%[i], %[i]\n"
-	"	stclr	%[i], %[v]"
-	: [i] "+&r" (i), [v] "+Q" (v->counter)
-	: "r" (v));
+	return __lse_atomic64_andnot(~i, v);
 }
 
 #define ATOMIC64_FETCH_OP_AND(name, mb, cl...)				\
 static inline long __lse_atomic64_fetch_and##name(s64 i, atomic64_t *v)	\
 {									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	mvn	%[i], %[i]\n"					\
-	"	ldclr" #mb "	%[i], %[i], %[v]"			\
-	: [i] "+&r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
+	return __lse_atomic64_fetch_andnot##name(~i, v);		\
 }
 
 ATOMIC64_FETCH_OP_AND(_relaxed,   )
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops
  2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
                   ` (2 preceding siblings ...)
  2021-12-10 15:14 ` [PATCH 3/5] arm64: atomics: lse: define ANDs in terms of ANDNOTs Mark Rutland
@ 2021-12-10 15:14 ` Mark Rutland
  2021-12-13 19:40   ` Will Deacon
  2021-12-10 15:14 ` [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops Mark Rutland
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2021-12-10 15:14 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: boqun.feng, catalin.marinas, mark.rutland, peterz, will

We have overly conservative assembly constraints for the basic FEAT_LSE
atomic instructions, and using more accurate and permissive constraints
will allow for better code generation.

The FEAT_LSE basic atomic instructions have come in two forms:

	LD{op}{order}{size} <Rs>, <Rt>, [<Rn>]
	ST{op}{order}{size} <Rs>, [<Rn>]

The ST* forms are aliases of the LD* forms where:

	ST{op}{order}{size} <Rs>, [<Rn>]
Is:
	LD{op}{order}{size} <Rs>, XZR, [<Rn>]

For either form, both <Rs> and <Rn> are read but not written back to,
and <Rt> is written with the original value of the memory location.
Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the
other register value(s) are consumed. There are no UNPREDICTABLE or
CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or
<Rn> are the same register.

Our current inline assembly always uses <Rs> == <Rt>, treating this
register as both an input and an output (using a '+r' constraint). This
forces the compiler to do some unnecessary register shuffling and/or
redundant value generation.

For example, the compiler cannot reuse the <Rs> value, and currently GCC
11.1.0 will compile:

	__lse_atomic_add(1, a);
	__lse_atomic_add(1, b);
	__lse_atomic_add(1, c);

As:

	mov     w3, #0x1
	mov     w4, w3
	stadd   w4, [x0]
	mov     w0, w3
	stadd   w0, [x1]
	stadd   w3, [x2]

We can improve this with more accurate constraints, separating <Rs> and
<Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an
output-only value ('=r'). As <Rt> is written back after <Rs> is
consumed, it does not need to be earlyclobber ('=&r'), leaving the
compiler free to use the same register for both <Rs> and <Rt> where this
is desirable.

At the same time, the redundant 'r' constraint for `v` is removed, as
the `+Q` constraint is sufficient.

With this change, the above example becomes:

	mov     w3, #0x1
	stadd   w3, [x0]
	stadd   w3, [x1]
	stadd   w3, [x2]

I've made this change for the non-value-returning and FETCH ops. The
RETURN ops have a multi-instruction sequence for which we cannot use the
same constraints, and a subsequent patch will rewrite hte RETURN ops in
terms of the FETCH ops, relying on the ability for the compiler to reuse
the <Rs> value.

This is intended as an optimization.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/atomic_lse.h | 30 +++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index d707eafb7677..e4c5c4c34ce6 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -16,8 +16,8 @@ static inline void __lse_atomic_##op(int i, atomic_t *v)		\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
 	"	" #asm_op "	%w[i], %[v]\n"				\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v));							\
+	: [v] "+Q" (v->counter)						\
+	: [i] "r" (i));							\
 }
 
 ATOMIC_OP(andnot, stclr)
@@ -35,14 +35,17 @@ static inline void __lse_atomic_sub(int i, atomic_t *v)
 #define ATOMIC_FETCH_OP(name, mb, op, asm_op, cl...)			\
 static inline int __lse_atomic_fetch_##op##name(int i, atomic_t *v)	\
 {									\
+	int old;							\
+									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-	"	" #asm_op #mb "	%w[i], %w[i], %[v]"			\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
+	"	" #asm_op #mb "	%w[i], %w[old], %[v]"			\
+	: [v] "+Q" (v->counter),					\
+	  [old] "=r" (old)						\
+	: [i] "r" (i)							\
 	: cl);								\
 									\
-	return i;							\
+	return old;							\
 }
 
 #define ATOMIC_FETCH_OPS(op, asm_op)					\
@@ -124,8 +127,8 @@ static inline void __lse_atomic64_##op(s64 i, atomic64_t *v)		\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
 	"	" #asm_op "	%[i], %[v]\n"				\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v));							\
+	: [v] "+Q" (v->counter)						\
+	: [i] "r" (i));							\
 }
 
 ATOMIC64_OP(andnot, stclr)
@@ -143,14 +146,17 @@ static inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
 #define ATOMIC64_FETCH_OP(name, mb, op, asm_op, cl...)			\
 static inline long __lse_atomic64_fetch_##op##name(s64 i, atomic64_t *v)\
 {									\
+	s64 old;							\
+									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-	"	" #asm_op #mb "	%[i], %[i], %[v]"			\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
+	"	" #asm_op #mb "	%[i], %[old], %[v]"			\
+	: [v] "+Q" (v->counter),					\
+	  [old] "=r" (old)						\
+	: [i] "r" (i) 							\
 	: cl);								\
 									\
-	return i;							\
+	return old;							\
 }
 
 #define ATOMIC64_FETCH_OPS(op, asm_op)					\
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops
  2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
                   ` (3 preceding siblings ...)
  2021-12-10 15:14 ` [PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops Mark Rutland
@ 2021-12-10 15:14 ` Mark Rutland
  2021-12-10 22:19   ` Peter Zijlstra
  2021-12-13 19:43   ` Will Deacon
  2021-12-10 22:26 ` [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Peter Zijlstra
  2021-12-14 16:54 ` Catalin Marinas
  6 siblings, 2 replies; 16+ messages in thread
From: Mark Rutland @ 2021-12-10 15:14 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: boqun.feng, catalin.marinas, mark.rutland, peterz, will

The FEAT_LSE atomic instructions include LD* instructions which return
the original value of a memory location can be used to directly
implement FETCH opertations. Each RETURN op is implemented as a copy of
the corresponding FETCH op with a trailing instruction instruction to
generate the new value of the memory location. We only
directly implement *_fetch_add*(), for which we have a trailing `add`
instruction.

As the compiler has no visibility of the `add`, this leads to less than
optimal code generation when consuming the result.

For example, the compiler cannot constant-fold the addition into later
operations, and currently GCC 11.1.0 will compile:

       return __lse_atomic_sub_return(1, v) == 0;

As:

	mov     w1, #0xffffffff
	ldaddal w1, w2, [x0]
	add     w1, w1, w2
	cmp     w1, #0x0
	cset    w0, eq  // eq = none
	ret

This patch improves this by replacing the `add` with C addition after
the inline assembly block, e.g.

	ret += i;

This allows the compiler to manipulate `i`. This permits the compiler to
merge the `add` and `cmp` for the above, e.g.

	mov     w1, #0xffffffff
	ldaddal w1, w1, [x0]
	cmp     w1, #0x1
	cset    w0, eq  // eq = none
	ret

With this change the assembly for each RETURN op is identical to the
corresponding FETCH op (including barriers and clobbers) so I've removed
the inline assembly and rewritten each RETURN op in terms of the
corresponding FETCH op, e.g.

| static inline void __lse_atomic_add_return(int i, atomic_t *v)
| {
|       return __lse_atomic_fetch_add(i, v) + i
| }

The new construction does not adversely affect the common case, and
before and after this patch GCC 11.1.0 can compile:

	__lse_atomic_add_return(i, v)

As:

	ldaddal w0, w2, [x1]
	add     w0, w0, w2

... while having the freedom to do better elsewhere.

This is intended as an optimization and cleanup.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/atomic_lse.h | 48 +++++++++--------------------
 1 file changed, 14 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index e4c5c4c34ce6..d955ade5df7c 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -75,31 +75,21 @@ ATOMIC_FETCH_OP_SUB(        )
 
 #undef ATOMIC_FETCH_OP_SUB
 
-#define ATOMIC_OP_ADD_SUB_RETURN(name, mb, cl...)			\
+#define ATOMIC_OP_ADD_SUB_RETURN(name)					\
 static inline int __lse_atomic_add_return##name(int i, atomic_t *v)	\
 {									\
-	u32 tmp;							\
-									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	ldadd" #mb "	%w[i], %w[tmp], %[v]\n"			\
-	"	add	%w[i], %w[i], %w[tmp]"				\
-	: [i] "+r" (i), [v] "+Q" (v->counter), [tmp] "=&r" (tmp)	\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
+	return __lse_atomic_fetch_add##name(i, v) + i;			\
 }									\
 									\
 static inline int __lse_atomic_sub_return##name(int i, atomic_t *v)	\
 {									\
-	return __lse_atomic_add_return##name(-i, v);			\
+	return __lse_atomic_fetch_sub(i, v) - i;			\
 }
 
-ATOMIC_OP_ADD_SUB_RETURN(_relaxed,   )
-ATOMIC_OP_ADD_SUB_RETURN(_acquire,  a, "memory")
-ATOMIC_OP_ADD_SUB_RETURN(_release,  l, "memory")
-ATOMIC_OP_ADD_SUB_RETURN(        , al, "memory")
+ATOMIC_OP_ADD_SUB_RETURN(_relaxed)
+ATOMIC_OP_ADD_SUB_RETURN(_acquire)
+ATOMIC_OP_ADD_SUB_RETURN(_release)
+ATOMIC_OP_ADD_SUB_RETURN(        )
 
 #undef ATOMIC_OP_ADD_SUB_RETURN
 
@@ -186,31 +176,21 @@ ATOMIC64_FETCH_OP_SUB(        )
 
 #undef ATOMIC64_FETCH_OP_SUB
 
-#define ATOMIC64_OP_ADD_SUB_RETURN(name, mb, cl...)			\
+#define ATOMIC64_OP_ADD_SUB_RETURN(name)				\
 static inline long __lse_atomic64_add_return##name(s64 i, atomic64_t *v)\
 {									\
-	unsigned long tmp;						\
-									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	ldadd" #mb "	%[i], %x[tmp], %[v]\n"			\
-	"	add	%[i], %[i], %x[tmp]"				\
-	: [i] "+r" (i), [v] "+Q" (v->counter), [tmp] "=&r" (tmp)	\
-	: "r" (v)							\
-	: cl);								\
-									\
-	return i;							\
+	return __lse_atomic64_fetch_add##name(i, v) + i;		\
 }									\
 									\
 static inline long __lse_atomic64_sub_return##name(s64 i, atomic64_t *v)\
 {									\
-	return __lse_atomic64_add_return##name(-i, v);			\
+	return __lse_atomic64_fetch_sub##name(i, v) - i;		\
 }
 
-ATOMIC64_OP_ADD_SUB_RETURN(_relaxed,   )
-ATOMIC64_OP_ADD_SUB_RETURN(_acquire,  a, "memory")
-ATOMIC64_OP_ADD_SUB_RETURN(_release,  l, "memory")
-ATOMIC64_OP_ADD_SUB_RETURN(        , al, "memory")
+ATOMIC64_OP_ADD_SUB_RETURN(_relaxed)
+ATOMIC64_OP_ADD_SUB_RETURN(_acquire)
+ATOMIC64_OP_ADD_SUB_RETURN(_release)
+ATOMIC64_OP_ADD_SUB_RETURN(        )
 
 #undef ATOMIC64_OP_ADD_SUB_RETURN
 
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops
  2021-12-10 15:14 ` [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops Mark Rutland
@ 2021-12-10 22:19   ` Peter Zijlstra
  2021-12-13 16:49     ` Mark Rutland
  2021-12-13 19:43   ` Will Deacon
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-12-10 22:19 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, will

On Fri, Dec 10, 2021 at 03:14:10PM +0000, Mark Rutland wrote:
> The FEAT_LSE atomic instructions include LD* instructions which return
> the original value of a memory location can be used to directly
> implement FETCH opertations. Each RETURN op is implemented as a copy of
> the corresponding FETCH op with a trailing instruction instruction to

instruction instruction instruction is probably even more better :-)

> generate the new value of the memory location. We only
> directly implement *_fetch_add*(), for which we have a trailing `add`
> instruction.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] arm64: atomics: cleanups and codegen improvements
  2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
                   ` (4 preceding siblings ...)
  2021-12-10 15:14 ` [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops Mark Rutland
@ 2021-12-10 22:26 ` Peter Zijlstra
  2021-12-14 16:54 ` Catalin Marinas
  6 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2021-12-10 22:26 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, will

On Fri, Dec 10, 2021 at 03:14:05PM +0000, Mark Rutland wrote:
> Mark Rutland (5):
>   arm64: atomics: format whitespace consistently
>   arm64: atomics lse: define SUBs in terms of ADDs
>   arm64: atomics: lse: define ANDs in terms of ANDNOTs
>   arm64: atomics: lse: improve constraints for simple ops
>   arm64: atomics: lse: define RETURN ops in terms of FETCH ops
> 
>  arch/arm64/include/asm/atomic_ll_sc.h |  86 ++++----
>  arch/arm64/include/asm/atomic_lse.h   | 270 ++++++++------------------
>  2 files changed, 126 insertions(+), 230 deletions(-)

Very nice cleanups,

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops
  2021-12-10 22:19   ` Peter Zijlstra
@ 2021-12-13 16:49     ` Mark Rutland
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Rutland @ 2021-12-13 16:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, will

On Fri, Dec 10, 2021 at 11:19:40PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 10, 2021 at 03:14:10PM +0000, Mark Rutland wrote:
> > The FEAT_LSE atomic instructions include LD* instructions which return
> > the original value of a memory location can be used to directly
> > implement FETCH opertations. Each RETURN op is implemented as a copy of
> > the corresponding FETCH op with a trailing instruction instruction to
> 
> instruction instruction instruction is probably even more better :-)

I've gone the other way and just made that a single `instruction` in the
branch. :)

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] arm64: atomics: format whitespace consistently
  2021-12-10 15:14 ` [PATCH 1/5] arm64: atomics: format whitespace consistently Mark Rutland
@ 2021-12-13 19:20   ` Will Deacon
  0 siblings, 0 replies; 16+ messages in thread
From: Will Deacon @ 2021-12-13 19:20 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, peterz

On Fri, Dec 10, 2021 at 03:14:06PM +0000, Mark Rutland wrote:
> The code for the atomic ops is formatted inconsistently, and while this
> is not a functional problem it is rather distracting when working on
> them.
> 
> Some have ops have consistent indentation, e.g.
> 
> | #define ATOMIC_OP_ADD_RETURN(name, mb, cl...)                           \
> | static inline int __lse_atomic_add_return##name(int i, atomic_t *v)     \
> | {                                                                       \
> |         u32 tmp;                                                        \
> |                                                                         \
> |         asm volatile(                                                   \
> |         __LSE_PREAMBLE                                                  \
> |         "       ldadd" #mb "    %w[i], %w[tmp], %[v]\n"                 \
> |         "       add     %w[i], %w[i], %w[tmp]"                          \
> |         : [i] "+r" (i), [v] "+Q" (v->counter), [tmp] "=&r" (tmp)        \
> |         : "r" (v)                                                       \
> |         : cl);                                                          \
> |                                                                         \
> |         return i;                                                       \
> | }
> 
> While others have negative indentation for some lines, and/or have
> misaligned trailing backslashes, e.g.
> 
> | static inline void __lse_atomic_##op(int i, atomic_t *v)                        \
> | {                                                                       \
> |         asm volatile(                                                   \
> |         __LSE_PREAMBLE                                                  \
> | "       " #asm_op "     %w[i], %[v]\n"                                  \
> |         : [i] "+r" (i), [v] "+Q" (v->counter)                           \
> |         : "r" (v));                                                     \
> | }
> 
> This patch makes the indentation consistent and also aligns the trailing
> backslashes. This makes the code easier to read for those (like myself)
> who are easily distracted by these inconsistencies.
> 
> This is intended as a cleanup.
> There should be no functional change as a result of this patch.

Looks the same to me:

Acked-by: Will Deacon <will@kernel.org>

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] arm64: atomics lse: define SUBs in terms of ADDs
  2021-12-10 15:14 ` [PATCH 2/5] arm64: atomics lse: define SUBs in terms of ADDs Mark Rutland
@ 2021-12-13 19:27   ` Will Deacon
  0 siblings, 0 replies; 16+ messages in thread
From: Will Deacon @ 2021-12-13 19:27 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, peterz

On Fri, Dec 10, 2021 at 03:14:07PM +0000, Mark Rutland wrote:
> The FEAT_LSE atomic instructions include atomic ADD instructions
> (`stadd*` and `ldadd*`), but do not include atomic SUB instructions, so
> we must build all of the SUB operations using the ADD instructions. We
> open-code these today, with each SUB op implemented as a copy of the
> corresponding ADD op with a leading `neg` instruction in the inline
> assembly to negate the `i` argument.
> 
> As the compiler has no visibility of the `neg`, this leads to less than
> optimal code generation when generating `i` into a register. For
> example, __les_atomic_fetch_sub(1, v) can be compiled to:
> 
> 	mov     w1, #0x1
> 	neg     w1, w1
> 	ldaddal w1, w1, [x2]
> 
> This patch improves this by replacing the `neg` with negation in C
> before the inline assembly block, e.g.
> 
> 	i = -i;
> 
> This allows the compiler to generate `i` into a register more optimally,
> e.g.
> 
> 	mov     w1, #0xffffffff
> 	ldaddal w1, w1, [x2]
> 
> With this change the assembly for each SUB op is identical to the
> corresponding ADD op (including barriers and clobbers), so I've removed
> the inline assembly and rewritten each SUB op in terms of the
> corresponding ADD op, e.g.
> 
> | static inline void __lse_atomic_sub(int i, atomic_t *v)
> | {
> | 	__lse_atomic_add(-i, v);
> | }
> 
> For clarity I've moved the definition of each SUB op immediately after
> the corresponding ADD op, and used a single macro to create the RETURN
> forms of both ops.
> 
> This is intended as an optimization and cleanup.
> There should be no functional change as a result of this patch.
>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/atomic_lse.h | 180 +++++++++-------------------
>  1 file changed, 58 insertions(+), 122 deletions(-)

Great diffstat!

Acked-by: Will Deacon <will@kernel.org>

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] arm64: atomics: lse: define ANDs in terms of ANDNOTs
  2021-12-10 15:14 ` [PATCH 3/5] arm64: atomics: lse: define ANDs in terms of ANDNOTs Mark Rutland
@ 2021-12-13 19:29   ` Will Deacon
  0 siblings, 0 replies; 16+ messages in thread
From: Will Deacon @ 2021-12-13 19:29 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, peterz

On Fri, Dec 10, 2021 at 03:14:08PM +0000, Mark Rutland wrote:
> The FEAT_LSE atomic instructions include atomic bit-clear instructions
> (`ldclr*` and `stclr*`) which can be used to directly implement ANDNOT
> operations. Each AND op is implemented as a copy of the corresponding
> ANDNOT op with a leading `mvn` instruction to apply a bitwise NOT to the
> `i` argument.
> 
> As the compiler has no visibility of the `mvn`, this leads to less than
> optimal code generation when generating `i` into a register. For
> example, __lse_atomic_fetch_and(0xf, v) can be compiled to:
> 
> 	mov     w1, #0xf
> 	mvn     w1, w1
> 	ldclral w1, w1, [x2]
> 
> This patch improves this by replacing the `mvn` with NOT in C before the
> inline assembly block, e.g.
> 
> 	i = ~i;
> 
> This allows the compiler to generate `i` into a register more optimally,
> e.g.
> 
> 	mov     w1, #0xfffffff0
> 	ldclral w1, w1, [x2]
> 
> With this change the assembly for each AND op is identical to the
> corresponding ANDNOT op (including barriers and clobbers), so I've
> removed the inline assembly and rewritten each AND op in terms of the
> corresponding ANDNOT op, e.g.
> 
> | static inline void __lse_atomic_and(int i, atomic_t *v)
> | {
> | 	return __lse_atomic_andnot(~i, v);
> | }
> 
> This is intended as an optimization and cleanup.
> There should be no functional change as a result of this patch.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/atomic_lse.h | 34 ++++-------------------------
>  1 file changed, 4 insertions(+), 30 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops
  2021-12-10 15:14 ` [PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops Mark Rutland
@ 2021-12-13 19:40   ` Will Deacon
  2021-12-14 13:04     ` Mark Rutland
  0 siblings, 1 reply; 16+ messages in thread
From: Will Deacon @ 2021-12-13 19:40 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, peterz

On Fri, Dec 10, 2021 at 03:14:09PM +0000, Mark Rutland wrote:
> We have overly conservative assembly constraints for the basic FEAT_LSE
> atomic instructions, and using more accurate and permissive constraints
> will allow for better code generation.
> 
> The FEAT_LSE basic atomic instructions have come in two forms:
> 
> 	LD{op}{order}{size} <Rs>, <Rt>, [<Rn>]
> 	ST{op}{order}{size} <Rs>, [<Rn>]
> 
> The ST* forms are aliases of the LD* forms where:
> 
> 	ST{op}{order}{size} <Rs>, [<Rn>]
> Is:
> 	LD{op}{order}{size} <Rs>, XZR, [<Rn>]
> 
> For either form, both <Rs> and <Rn> are read but not written back to,
> and <Rt> is written with the original value of the memory location.
> Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the
> other register value(s) are consumed. There are no UNPREDICTABLE or
> CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or
> <Rn> are the same register.
> 
> Our current inline assembly always uses <Rs> == <Rt>, treating this
> register as both an input and an output (using a '+r' constraint). This
> forces the compiler to do some unnecessary register shuffling and/or
> redundant value generation.
> 
> For example, the compiler cannot reuse the <Rs> value, and currently GCC
> 11.1.0 will compile:
> 
> 	__lse_atomic_add(1, a);
> 	__lse_atomic_add(1, b);
> 	__lse_atomic_add(1, c);
> 
> As:
> 
> 	mov     w3, #0x1
> 	mov     w4, w3
> 	stadd   w4, [x0]
> 	mov     w0, w3
> 	stadd   w0, [x1]
> 	stadd   w3, [x2]
> 
> We can improve this with more accurate constraints, separating <Rs> and
> <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an
> output-only value ('=r'). As <Rt> is written back after <Rs> is
> consumed, it does not need to be earlyclobber ('=&r'), leaving the
> compiler free to use the same register for both <Rs> and <Rt> where this
> is desirable.
> 
> At the same time, the redundant 'r' constraint for `v` is removed, as
> the `+Q` constraint is sufficient.
> 
> With this change, the above example becomes:
> 
> 	mov     w3, #0x1
> 	stadd   w3, [x0]
> 	stadd   w3, [x1]
> 	stadd   w3, [x2]
> 
> I've made this change for the non-value-returning and FETCH ops. The
> RETURN ops have a multi-instruction sequence for which we cannot use the
> same constraints, and a subsequent patch will rewrite hte RETURN ops in
> terms of the FETCH ops, relying on the ability for the compiler to reuse
> the <Rs> value.
> 
> This is intended as an optimization.
> There should be no functional change as a result of this patch.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/atomic_lse.h | 30 +++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)

Makes sense to me. I'm not sure _why_ the current constraints are so weird;
maybe a hangover from when we patched them inline? Anywho:

Acked-by: Will Deacon <will@kernel.org>

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops
  2021-12-10 15:14 ` [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops Mark Rutland
  2021-12-10 22:19   ` Peter Zijlstra
@ 2021-12-13 19:43   ` Will Deacon
  1 sibling, 0 replies; 16+ messages in thread
From: Will Deacon @ 2021-12-13 19:43 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, peterz

On Fri, Dec 10, 2021 at 03:14:10PM +0000, Mark Rutland wrote:
> The FEAT_LSE atomic instructions include LD* instructions which return
> the original value of a memory location can be used to directly
> implement FETCH opertations. Each RETURN op is implemented as a copy of
> the corresponding FETCH op with a trailing instruction instruction to
> generate the new value of the memory location. We only
> directly implement *_fetch_add*(), for which we have a trailing `add`
> instruction.
> 
> As the compiler has no visibility of the `add`, this leads to less than
> optimal code generation when consuming the result.
> 
> For example, the compiler cannot constant-fold the addition into later
> operations, and currently GCC 11.1.0 will compile:
> 
>        return __lse_atomic_sub_return(1, v) == 0;
> 
> As:
> 
> 	mov     w1, #0xffffffff
> 	ldaddal w1, w2, [x0]
> 	add     w1, w1, w2
> 	cmp     w1, #0x0
> 	cset    w0, eq  // eq = none
> 	ret
> 
> This patch improves this by replacing the `add` with C addition after
> the inline assembly block, e.g.
> 
> 	ret += i;
> 
> This allows the compiler to manipulate `i`. This permits the compiler to
> merge the `add` and `cmp` for the above, e.g.
> 
> 	mov     w1, #0xffffffff
> 	ldaddal w1, w1, [x0]
> 	cmp     w1, #0x1
> 	cset    w0, eq  // eq = none
> 	ret
> 
> With this change the assembly for each RETURN op is identical to the
> corresponding FETCH op (including barriers and clobbers) so I've removed
> the inline assembly and rewritten each RETURN op in terms of the
> corresponding FETCH op, e.g.
> 
> | static inline void __lse_atomic_add_return(int i, atomic_t *v)
> | {
> |       return __lse_atomic_fetch_add(i, v) + i
> | }
> 
> The new construction does not adversely affect the common case, and
> before and after this patch GCC 11.1.0 can compile:
> 
> 	__lse_atomic_add_return(i, v)
> 
> As:
> 
> 	ldaddal w0, w2, [x1]
> 	add     w0, w0, w2
> 
> ... while having the freedom to do better elsewhere.
> 
> This is intended as an optimization and cleanup.
> There should be no functional change as a result of this patch.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/atomic_lse.h | 48 +++++++++--------------------
>  1 file changed, 14 insertions(+), 34 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops
  2021-12-13 19:40   ` Will Deacon
@ 2021-12-14 13:04     ` Mark Rutland
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Rutland @ 2021-12-14 13:04 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-arm-kernel, boqun.feng, catalin.marinas, peterz

On Mon, Dec 13, 2021 at 07:40:23PM +0000, Will Deacon wrote:
> On Fri, Dec 10, 2021 at 03:14:09PM +0000, Mark Rutland wrote:
> > We have overly conservative assembly constraints for the basic FEAT_LSE
> > atomic instructions, and using more accurate and permissive constraints
> > will allow for better code generation.
> > 
> > The FEAT_LSE basic atomic instructions have come in two forms:
> > 
> > 	LD{op}{order}{size} <Rs>, <Rt>, [<Rn>]
> > 	ST{op}{order}{size} <Rs>, [<Rn>]

> > For either form, both <Rs> and <Rn> are read but not written back to,
> > and <Rt> is written with the original value of the memory location.
> > Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the
> > other register value(s) are consumed. There are no UNPREDICTABLE or
> > CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or
> > <Rn> are the same register.

> > Our current inline assembly always uses <Rs> == <Rt>, treating this
> > register as both an input and an output (using a '+r' constraint). This
> > forces the compiler to do some unnecessary register shuffling and/or
> > redundant value generation.

> > We can improve this with more accurate constraints, separating <Rs> and
> > <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an
> > output-only value ('=r'). As <Rt> is written back after <Rs> is
> > consumed, it does not need to be earlyclobber ('=&r'), leaving the
> > compiler free to use the same register for both <Rs> and <Rt> where this
> > is desirable.
> > 
> > At the same time, the redundant 'r' constraint for `v` is removed, as
> > the `+Q` constraint is sufficient.

> Makes sense to me. I'm not sure _why_ the current constraints are so weird;
> maybe a hangover from when we patched them inline?

Yup; the constraints used to make sense when we were patching the instructions,
and we just never cleaned them up. Details below -- I can add a link tag to
this mail if you'd like?

For example, back when we introduced the ops in commit:

  c0385b24af15020a ("arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics")

We had:

| #define ATOMIC_OP(op, asm_op)                                          \
| static inline void atomic_##op(int i, atomic_t *v)                     \
| {                                                                      \
|        register int w0 asm ("w0") = i;                                 \
|        register atomic_t *x1 asm ("x1") = v;                           \
|                                                                        \
|        asm volatile(                                                   \
|        __LL_SC_CALL(op)                                                \
|        : "+r" (w0), "+Q" (v->counter)                                  \
|        : "r" (x1)                                                      \
|        : "x30");                                                       \
| }                                                                      \

IIUC, for those constraints:

* "+r" (w0) -- necessary because the LL-SC forms might clobber w0 (e.g. if
  used for their `tmp` variable). For the value-returning ops we had to use
  w0/x0 for the return value as the out-of-line LL-SC ops followed the AAPCS
  parameter passing rules.

* "+Q" (v->counter) -- necessary to clobber the memory location, "Q"
  specifically to ensure the compiler didn't expect some write-back addressing
  mode to have adjusted a register.

* "r" (x1) -- necessary to ensure the address was available in X1 for the LL-SC
  code, since I beleive the "+Q" constraint *could* have used a copy in another
  register. I could be mistaken on this one, since there's no rationale in the
  original commit message or code.

When that approach was thrown away in commits

  addfc38672c73efd ("arm64: atomics: avoid out-of-line ll/sc atomics")
  3337cb5aea594e40 ("arm64: avoid using hard-coded registers for LSE atomics")

... we stopped hard-coding the registers, and added a `tmp` register for ops
with a return value, but otherwise left the constraints as-is.

> Anywho:
> 
> Acked-by: Will Deacon <will@kernel.org>

Thanks!

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] arm64: atomics: cleanups and codegen improvements
  2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
                   ` (5 preceding siblings ...)
  2021-12-10 22:26 ` [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Peter Zijlstra
@ 2021-12-14 16:54 ` Catalin Marinas
  6 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2021-12-14 16:54 UTC (permalink / raw)
  To: linux-arm-kernel, Mark Rutland; +Cc: Will Deacon, peterz, boqun.feng

On Fri, 10 Dec 2021 15:14:05 +0000, Mark Rutland wrote:
> While looking at Peter's recent refcount rework, I spotted that we have
> some unfortunate code generation for the LSE atomics. Due to a
> combination of assembly constraints and manipulation performed in
> assembly which the compiler has no visibilty of, the compiler ends up
> generating unnecessary register shuffling and redundant manipulation.
> 
> This series (based on v5.16-rc4) attempts to improve this by improving
> the constraints, and moving value manipulation to C where the compiler
> can perofrm a number of optimizations. This also has the benefit of
> simplifying the implementation and deleting 100+ lines of code.
> 
> [...]

Applied to arm64 (for-next/atomics), thanks! (also fixed the double
"instruction" in the last patch).

[1/5] arm64: atomics: format whitespace consistently
      https://git.kernel.org/arm64/c/8e6082e94aac
[2/5] arm64: atomics lse: define SUBs in terms of ADDs
      https://git.kernel.org/arm64/c/ef5324506098
[3/5] arm64: atomics: lse: define ANDs in terms of ANDNOTs
      https://git.kernel.org/arm64/c/5e9e43c987b2
[4/5] arm64: atomics: lse: improve constraints for simple ops
      https://git.kernel.org/arm64/c/8a578a759ad6
[5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops
      https://git.kernel.org/arm64/c/053f58bab331

-- 
Catalin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-12-14 16:56 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-10 15:14 [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Mark Rutland
2021-12-10 15:14 ` [PATCH 1/5] arm64: atomics: format whitespace consistently Mark Rutland
2021-12-13 19:20   ` Will Deacon
2021-12-10 15:14 ` [PATCH 2/5] arm64: atomics lse: define SUBs in terms of ADDs Mark Rutland
2021-12-13 19:27   ` Will Deacon
2021-12-10 15:14 ` [PATCH 3/5] arm64: atomics: lse: define ANDs in terms of ANDNOTs Mark Rutland
2021-12-13 19:29   ` Will Deacon
2021-12-10 15:14 ` [PATCH 4/5] arm64: atomics: lse: improve constraints for simple ops Mark Rutland
2021-12-13 19:40   ` Will Deacon
2021-12-14 13:04     ` Mark Rutland
2021-12-10 15:14 ` [PATCH 5/5] arm64: atomics: lse: define RETURN ops in terms of FETCH ops Mark Rutland
2021-12-10 22:19   ` Peter Zijlstra
2021-12-13 16:49     ` Mark Rutland
2021-12-13 19:43   ` Will Deacon
2021-12-10 22:26 ` [PATCH 0/5] arm64: atomics: cleanups and codegen improvements Peter Zijlstra
2021-12-14 16:54 ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.