[PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
@ 2023-02-02 14:50 Peter Zijlstra
  2023-02-02 14:50 ` [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128 Peter Zijlstra
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

Hi!

Since Linus hated on cmpxchg_double(), a few patches to get rid of it, as
proposed here:

  https://lkml.kernel.org/r/Y2U3WdU61FvYlpUh@hirez.programming.kicks-ass.net


These patches are based on 6.2.0-rc6 + cryptodev-2.6, but also apply to next/master.

Available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git core/wip-u128

New since v1:

 - rebaed on Eric's ghash cleanups (hence the cryptodev-2.6 dependency)
 - rebased on Heiko's s390/cpum_sf CDSG patch
 - fixed up a bunch of arch code
 - fixed up the inline asm to use 'u128 *' mem argument so the compiler knows
   how wide the modification is.
 - reworked the percpu thing to use union based type-punning instead of
   _Generic() based casts.

---
 Documentation/core-api/this_cpu_ops.rst     |   2 -
 arch/arm64/include/asm/atomic_ll_sc.h       |  56 ++++++-----
 arch/arm64/include/asm/atomic_lse.h         |  39 ++++----
 arch/arm64/include/asm/cmpxchg.h            |  48 +++-------
 arch/arm64/include/asm/percpu.h             |  31 ++++--
 arch/s390/include/asm/cmpxchg.h             |  32 ++-----
 arch/s390/include/asm/cpu_mf.h              |   2 +-
 arch/s390/include/asm/percpu.h              |  35 ++++---
 arch/s390/kernel/perf_cpum_sf.c             |  22 ++---
 arch/x86/include/asm/cmpxchg.h              |  25 -----
 arch/x86/include/asm/cmpxchg_32.h           |   2 +-
 arch/x86/include/asm/cmpxchg_64.h           |  54 ++++++++++-
 arch/x86/include/asm/percpu.h               |  97 +++++++++++--------
 drivers/iommu/amd/amd_iommu_types.h         |   9 +-
 drivers/iommu/amd/iommu.c                   |  10 +-
 drivers/iommu/intel/irq_remapping.c         |   8 +-
 include/asm-generic/percpu.h                |  62 ++----------
 include/crypto/b128ops.h                    |  14 +--
 include/linux/atomic/atomic-arch-fallback.h |  95 ++++++++++++++++++-
 include/linux/atomic/atomic-instrumented.h  |  88 ++++++++++++++---
 include/linux/dmar.h                        | 125 ++++++++++++------------
 include/linux/percpu-defs.h                 |  46 +++------
 include/linux/slub_def.h                    |  12 ++-
 include/linux/types.h                       |   5 +
 include/uapi/linux/types.h                  |   4 +
 lib/crypto/curve25519-hacl64.c              |   2 -
 lib/crypto/poly1305-donna64.c               |   2 -
 mm/slab.h                                   |  45 ++++++++-
 mm/slub.c                                   | 142 +++++++++++++++++-----------
 scripts/atomic/gen-atomic-fallback.sh       |   4 +-
 scripts/atomic/gen-atomic-instrumented.sh   |  21 ++--
 31 files changed, 645 insertions(+), 494 deletions(-)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-02 20:21   ` H. Peter Anvin
  2023-02-03  1:06   ` Herbert Xu
  2023-02-02 14:50 ` [PATCH v2 02/10] types: Introduce [us]128 Peter Zijlstra
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

Per git-grep u128_xor() and its related struct u128 are unused except
to implement {be,le}128_xor(). Remove them to free up the namespace.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/crypto/b128ops.h |   14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

--- a/include/crypto/b128ops.h
+++ b/include/crypto/b128ops.h
@@ -50,10 +50,6 @@
 #include <linux/types.h>
 
 typedef struct {
-	u64 a, b;
-} u128;
-
-typedef struct {
 	__be64 a, b;
 } be128;
 
@@ -61,20 +57,16 @@ typedef struct {
 	__le64 b, a;
 } le128;
 
-static inline void u128_xor(u128 *r, const u128 *p, const u128 *q)
+static inline void be128_xor(be128 *r, const be128 *p, const be128 *q)
 {
 	r->a = p->a ^ q->a;
 	r->b = p->b ^ q->b;
 }
 
-static inline void be128_xor(be128 *r, const be128 *p, const be128 *q)
-{
-	u128_xor((u128 *)r, (u128 *)p, (u128 *)q);
-}
-
 static inline void le128_xor(le128 *r, const le128 *p, const le128 *q)
 {
-	u128_xor((u128 *)r, (u128 *)p, (u128 *)q);
+	r->a = p->a ^ q->a;
+	r->b = p->b ^ q->b;
 }
 
 #endif /* _CRYPTO_B128OPS_H */



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 02/10] types: Introduce [us]128
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
  2023-02-02 14:50 ` [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128 Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-03  3:37   ` Herbert Xu
  2023-02-02 14:50 ` [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}() Peter Zijlstra
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

Introduce [us]128 (when available). Unlike [us]64, ensure they are
always naturally aligned.

This also enables 128bit wide atomics (which require natural
alignment) such as cmpxchg128().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/types.h      |    5 +++++
 include/uapi/linux/types.h |    4 ++++
 2 files changed, 9 insertions(+)

--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -10,6 +10,11 @@
 #define DECLARE_BITMAP(name,bits) \
 	unsigned long name[BITS_TO_LONGS(bits)]
 
+#ifdef __SIZEOF_INT128__
+typedef __s128 s128;
+typedef __u128 u128;
+#endif
+
 typedef u32 __kernel_dev_t;
 
 typedef __kernel_fd_set		fd_set;
--- a/include/uapi/linux/types.h
+++ b/include/uapi/linux/types.h
@@ -13,6 +13,10 @@
 
 #include <linux/posix_types.h>
 
+#ifdef __SIZEOF_INT128__
+typedef __signed__ __int128 __s128 __attribute__((aligned(16)));
+typedef unsigned __int128 __u128 __attribute__((aligned(16)));
+#endif
 
 /*
  * Below are truly Linux-specific types that should never collide with
--- a/lib/crypto/curve25519-hacl64.c
+++ b/lib/crypto/curve25519-hacl64.c
@@ -14,8 +14,6 @@
 #include <crypto/curve25519.h>
 #include <linux/string.h>
 
-typedef __uint128_t u128;
-
 static __always_inline u64 u64_eq_mask(u64 a, u64 b)
 {
 	u64 x = a ^ b;
--- a/lib/crypto/poly1305-donna64.c
+++ b/lib/crypto/poly1305-donna64.c
@@ -10,8 +10,6 @@
 #include <asm/unaligned.h>
 #include <crypto/internal/poly1305.h>
 
-typedef __uint128_t u128;
-
 void poly1305_core_setkey(struct poly1305_core_key *key,
 			  const u8 raw_key[POLY1305_BLOCK_SIZE])
 {



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}()
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
  2023-02-02 14:50 ` [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128 Peter Zijlstra
  2023-02-02 14:50 ` [PATCH v2 02/10] types: Introduce [us]128 Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-02 17:04   ` Heiko Carstens
  2023-02-03 16:52   ` Mark Rutland
  2023-02-02 14:50 ` [PATCH v2 04/10] instrumentation: Wire up cmpxchg128() Peter Zijlstra
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

For all architectures that currently support cmpxchg_double()
implement the cmpxchg128() family of functions that is basically the
same but with a saner interface.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/arm64/include/asm/atomic_ll_sc.h |   41 +++++++++++++++++++++++++
 arch/arm64/include/asm/atomic_lse.h   |   31 +++++++++++++++++++
 arch/arm64/include/asm/cmpxchg.h      |   26 ++++++++++++++++
 arch/s390/include/asm/cmpxchg.h       |   14 ++++++++
 arch/x86/include/asm/cmpxchg_32.h     |    3 +
 arch/x86/include/asm/cmpxchg_64.h     |   55 +++++++++++++++++++++++++++++++++-
 6 files changed, 168 insertions(+), 2 deletions(-)

--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -326,6 +326,47 @@ __CMPXCHG_DBL(   ,        ,  ,         )
 __CMPXCHG_DBL(_mb, dmb ish, l, "memory")
 
 #undef __CMPXCHG_DBL
+
+union __u128_halves {
+	u128 full;
+	struct {
+		u64 low, high;
+	};
+};
+
+#define __CMPXCHG128(name, mb, rel, cl...)                             \
+static __always_inline u128						\
+__ll_sc__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new)	\
+{									\
+	union __u128_halves r, o = { .full = (old) },			\
+			       n = { .full = (new) };			\
+       unsigned int tmp;                                               \
+									\
+	asm volatile("// __cmpxchg128" #name "\n"			\
+       "       prfm    pstl1strm, %[v]\n"                              \
+       "1:     ldxp    %[rl], %[rh], %[v]\n"                           \
+       "       cmp     %[rl], %[ol]\n"                                 \
+       "       ccmp    %[rh], %[oh], 0, eq\n"                          \
+       "       b.ne    2f\n"                                           \
+       "       st" #rel "xp    %w[tmp], %[nl], %[nh], %[v]\n"          \
+       "       cbnz    %w[tmp], 1b\n"                                  \
+	"	" #mb "\n"						\
+	"2:"								\
+       : [v] "+Q" (*(u128 *)ptr),                                      \
+         [rl] "=&r" (r.low), [rh] "=&r" (r.high),                      \
+         [tmp] "=&r" (tmp)                                             \
+       : [ol] "r" (o.low), [oh] "r" (o.high),                          \
+         [nl] "r" (n.low), [nh] "r" (n.high)                           \
+       : "cc", ##cl);                                                  \
+									\
+	return r.full;							\
+}
+
+__CMPXCHG128(   ,        ,  )
+__CMPXCHG128(_mb, dmb ish, l, "memory")
+
+#undef __CMPXCHG128
+
 #undef K
 
 #endif	/* __ASM_ATOMIC_LL_SC_H */
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -324,4 +324,35 @@ __CMPXCHG_DBL(_mb, al, "memory")
 
 #undef __CMPXCHG_DBL
 
+#define __CMPXCHG128(name, mb, cl...)					\
+static __always_inline u128						\
+__lse__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new)		\
+{									\
+	union __u128_halves r, o = { .full = (old) },			\
+			       n = { .full = (new) };			\
+	register unsigned long x0 asm ("x0") = o.low;			\
+	register unsigned long x1 asm ("x1") = o.high;			\
+	register unsigned long x2 asm ("x2") = n.low;			\
+	register unsigned long x3 asm ("x3") = n.high;			\
+	register unsigned long x4 asm ("x4") = (unsigned long)ptr;	\
+									\
+	asm volatile(							\
+	__LSE_PREAMBLE							\
+	"	casp" #mb "\t%[old1], %[old2], %[new1], %[new2], %[v]\n"\
+	: [old1] "+&r" (x0), [old2] "+&r" (x1),				\
+	  [v] "+Q" (*(u128 *)ptr)					\
+	: [new1] "r" (x2), [new2] "r" (x3), [ptr] "r" (x4),		\
+	  [oldval1] "r" (o.low), [oldval2] "r" (o.high)			\
+	: cl);								\
+									\
+	r.low = x0; r.high = x1;					\
+									\
+	return r.full;							\
+}
+
+__CMPXCHG128(   ,   )
+__CMPXCHG128(_mb, al, "memory")
+
+#undef __CMPXCHG128
+
 #endif	/* __ASM_ATOMIC_LSE_H */
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -147,6 +147,19 @@ __CMPXCHG_DBL(_mb)
 
 #undef __CMPXCHG_DBL
 
+#define __CMPXCHG128(name)						\
+static inline u128 __cmpxchg128##name(volatile u128 *ptr,		\
+				      u128 old, u128 new)		\
+{									\
+	return __lse_ll_sc_body(_cmpxchg128##name,			\
+				ptr, old, new);				\
+}
+
+__CMPXCHG128(   )
+__CMPXCHG128(_mb)
+
+#undef __CMPXCHG128
+
 #define __CMPXCHG_GEN(sfx)						\
 static __always_inline unsigned long __cmpxchg##sfx(volatile void *ptr,	\
 					   unsigned long old,		\
@@ -229,6 +242,19 @@ __CMPXCHG_GEN(_mb)
 	__ret;									\
 })
 
+/* cmpxchg128 */
+#define system_has_cmpxchg128()		1
+
+#define arch_cmpxchg128(ptr, o, n)						\
+({										\
+	__cmpxchg128_mb((ptr), (o), (n));					\
+})
+
+#define arch_cmpxchg128_local(ptr, o, n)					\
+({										\
+	__cmpxchg128((ptr), (o), (n));						\
+})
+
 #define __CMPWAIT_CASE(w, sfx, sz)					\
 static inline void __cmpwait_case_##sz(volatile void *ptr,		\
 				       unsigned long val)		\
--- a/arch/s390/include/asm/cmpxchg.h
+++ b/arch/s390/include/asm/cmpxchg.h
@@ -201,4 +201,18 @@ static __always_inline int __cmpxchg_dou
 			 (unsigned long)(n1), (unsigned long)(n2));	\
 })
 
+#define system_has_cmpxchg128()		1
+
+static __always_inline u128 arch_cmpxchg128(volatile u128 *ptr, u128 old, u128 new)
+{
+	asm volatile(
+		"	cdsg	%[old],%[new],%[ptr]\n"
+		: [old] "+d" (old), [ptr] "+QS" (*ptr)
+		: [new] "d" (new)
+		: "memory", "cc");
+	return old;
+}
+
+#define arch_cmpxchg128		arch_cmpxchg128
+
 #endif /* __ASM_CMPXCHG_H */
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -103,6 +103,7 @@ static inline bool __try_cmpxchg64(volat
 
 #endif
 
-#define system_has_cmpxchg_double() boot_cpu_has(X86_FEATURE_CX8)
+#define system_has_cmpxchg_double()	boot_cpu_has(X86_FEATURE_CX8)
+#define system_has_cmpxchg64()		boot_cpu_has(X86_FEATURE_CX8)
 
 #endif /* _ASM_X86_CMPXCHG_32_H */
--- a/arch/x86/include/asm/cmpxchg_64.h
+++ b/arch/x86/include/asm/cmpxchg_64.h
@@ -20,6 +20,59 @@
 	arch_try_cmpxchg((ptr), (po), (n));				\
 })
 
-#define system_has_cmpxchg_double() boot_cpu_has(X86_FEATURE_CX16)
+union __u128_halves {
+	u128 full;
+	struct {
+		u64 low, high;
+	};
+};
+
+static __always_inline u128 arch_cmpxchg128(volatile u128 *ptr, u128 old, u128 new)
+{
+	union __u128_halves o = { .full = old, }, n = { .full = new, };
+
+	asm volatile(LOCK_PREFIX "cmpxchg16b %[ptr]"
+		     : [ptr] "+m" (*ptr),
+		       "+a" (o.low), "+d" (o.high)
+		     : "b" (n.low), "c" (n.high)
+		     : "memory");
+
+	return o.full;
+}
+
+static __always_inline u128 arch_cmpxchg128_local(volatile u128 *ptr, u128 old, u128 new)
+{
+	union __u128_halves o = { .full = old, }, n = { .full = new, };
+
+	asm volatile("cmpxchg16b %[ptr]"
+		     : [ptr] "+m" (*ptr),
+		       "+a" (o.low), "+d" (o.high)
+		     : "b" (n.low), "c" (n.high)
+		     : "memory");
+
+	return o.full;
+}
+
+static __always_inline bool arch_try_cmpxchg128(volatile u128 *ptr, u128 *old, u128 new)
+{
+	union __u128_halves o = { .full = *old, }, n = { .full = new, };
+	bool ret;
+
+	asm volatile(LOCK_PREFIX "cmpxchg16b %[ptr]"
+		     CC_SET(e)
+		     : CC_OUT(e) (ret),
+		       [ptr] "+m" (*ptr),
+		       "+a" (o.low), "+d" (o.high)
+		     : "b" (n.low), "c" (n.high)
+		     : "memory");
+
+	if (unlikely(!ret))
+		*old = o.full;
+
+	return likely(ret);
+}
+
+#define system_has_cmpxchg_double()	boot_cpu_has(X86_FEATURE_CX16)
+#define system_has_cmpxchg128()		boot_cpu_has(X86_FEATURE_CX16)
 
 #endif /* _ASM_X86_CMPXCHG_64_H */



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 04/10] instrumentation: Wire up cmpxchg128()
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (2 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}() Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-03 16:55   ` Mark Rutland
  2023-02-02 14:50 ` [PATCH v2 05/10] percpu: Wire up cmpxchg128 Peter Zijlstra
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

Wire up the cmpxchg128 familty in the atomic wrappery scripts.

These provide the generic cmpxchg128 family of functions from the
arch_ prefixed version, adding explicit instrumentation where needed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/atomic/atomic-arch-fallback.h |   95 +++++++++++++++++++++++++++-
 include/linux/atomic/atomic-instrumented.h  |   77 ++++++++++++++++++++++
 scripts/atomic/gen-atomic-fallback.sh       |    4 -
 scripts/atomic/gen-atomic-instrumented.sh   |    4 -
 4 files changed, 174 insertions(+), 6 deletions(-)

--- a/include/linux/atomic/atomic-arch-fallback.h
+++ b/include/linux/atomic/atomic-arch-fallback.h
@@ -77,6 +77,29 @@
 
 #endif /* arch_cmpxchg64_relaxed */
 
+#ifndef arch_cmpxchg128_relaxed
+#define arch_cmpxchg128_acquire arch_cmpxchg128
+#define arch_cmpxchg128_release arch_cmpxchg128
+#define arch_cmpxchg128_relaxed arch_cmpxchg128
+#else /* arch_cmpxchg128_relaxed */
+
+#ifndef arch_cmpxchg128_acquire
+#define arch_cmpxchg128_acquire(...) \
+	__atomic_op_acquire(arch_cmpxchg128, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg128_release
+#define arch_cmpxchg128_release(...) \
+	__atomic_op_release(arch_cmpxchg128, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg128
+#define arch_cmpxchg128(...) \
+	__atomic_op_fence(arch_cmpxchg128, __VA_ARGS__)
+#endif
+
+#endif /* arch_cmpxchg128_relaxed */
+
 #ifndef arch_try_cmpxchg_relaxed
 #ifdef arch_try_cmpxchg
 #define arch_try_cmpxchg_acquire arch_try_cmpxchg
@@ -217,6 +240,76 @@
 
 #endif /* arch_try_cmpxchg64_relaxed */
 
+#ifndef arch_try_cmpxchg128_relaxed
+#ifdef arch_try_cmpxchg128
+#define arch_try_cmpxchg128_acquire arch_try_cmpxchg128
+#define arch_try_cmpxchg128_release arch_try_cmpxchg128
+#define arch_try_cmpxchg128_relaxed arch_try_cmpxchg128
+#endif /* arch_try_cmpxchg128 */
+
+#ifndef arch_try_cmpxchg128
+#define arch_try_cmpxchg128(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg128((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg128 */
+
+#ifndef arch_try_cmpxchg128_acquire
+#define arch_try_cmpxchg128_acquire(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg128_acquire((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg128_acquire */
+
+#ifndef arch_try_cmpxchg128_release
+#define arch_try_cmpxchg128_release(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg128_release((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg128_release */
+
+#ifndef arch_try_cmpxchg128_relaxed
+#define arch_try_cmpxchg128_relaxed(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg128_relaxed((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg128_relaxed */
+
+#else /* arch_try_cmpxchg128_relaxed */
+
+#ifndef arch_try_cmpxchg128_acquire
+#define arch_try_cmpxchg128_acquire(...) \
+	__atomic_op_acquire(arch_try_cmpxchg128, __VA_ARGS__)
+#endif
+
+#ifndef arch_try_cmpxchg128_release
+#define arch_try_cmpxchg128_release(...) \
+	__atomic_op_release(arch_try_cmpxchg128, __VA_ARGS__)
+#endif
+
+#ifndef arch_try_cmpxchg128
+#define arch_try_cmpxchg128(...) \
+	__atomic_op_fence(arch_try_cmpxchg128, __VA_ARGS__)
+#endif
+
+#endif /* arch_try_cmpxchg128_relaxed */
+
 #ifndef arch_atomic_read_acquire
 static __always_inline int
 arch_atomic_read_acquire(const atomic_t *v)
@@ -2456,4 +2549,4 @@ arch_atomic64_dec_if_positive(atomic64_t
 #endif
 
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// b5e87bdd5ede61470c29f7a7e4de781af3770f09
+// 46357a526de89c762d30fb238f35a7d5950a670b
--- a/include/linux/atomic/atomic-instrumented.h
+++ b/include/linux/atomic/atomic-instrumented.h
@@ -1968,6 +1968,36 @@ atomic_long_dec_if_positive(atomic_long_
 	arch_cmpxchg64_relaxed(__ai_ptr, __VA_ARGS__); \
 })
 
+#define cmpxchg128(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	kcsan_mb(); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg128(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128_acquire(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg128_acquire(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128_release(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	kcsan_release(); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg128_release(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128_relaxed(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg128_relaxed(__ai_ptr, __VA_ARGS__); \
+})
+
 #define try_cmpxchg(ptr, oldp, ...) \
 ({ \
 	typeof(ptr) __ai_ptr = (ptr); \
@@ -2044,6 +2074,44 @@ atomic_long_dec_if_positive(atomic_long_
 	arch_try_cmpxchg64_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
+#define try_cmpxchg128(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	kcsan_mb(); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg128(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
+#define try_cmpxchg128_acquire(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg128_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
+#define try_cmpxchg128_release(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	kcsan_release(); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg128_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
+#define try_cmpxchg128_relaxed(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg128_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
 #define cmpxchg_local(ptr, ...) \
 ({ \
 	typeof(ptr) __ai_ptr = (ptr); \
@@ -2058,6 +2126,13 @@ atomic_long_dec_if_positive(atomic_long_
 	arch_cmpxchg64_local(__ai_ptr, __VA_ARGS__); \
 })
 
+#define cmpxchg128_local(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg128_local(__ai_ptr, __VA_ARGS__); \
+})
+
 #define sync_cmpxchg(ptr, ...) \
 ({ \
 	typeof(ptr) __ai_ptr = (ptr); \
@@ -2083,4 +2158,4 @@ atomic_long_dec_if_positive(atomic_long_
 })
 
 #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
-// 764f741eb77a7ad565dc8d99ce2837d5542e8aee
+// 27320c1ec2bf2878ecb9df3ea4816a7bc0c57a52
--- a/scripts/atomic/gen-atomic-fallback.sh
+++ b/scripts/atomic/gen-atomic-fallback.sh
@@ -217,11 +217,11 @@ cat << EOF
 
 EOF
 
-for xchg in "arch_xchg" "arch_cmpxchg" "arch_cmpxchg64"; do
+for xchg in "arch_xchg" "arch_cmpxchg" "arch_cmpxchg64" "arch_cmpxchg128"; do
 	gen_xchg_fallbacks "${xchg}"
 done
 
-for cmpxchg in "cmpxchg" "cmpxchg64"; do
+for cmpxchg in "cmpxchg" "cmpxchg64" "cmpxchg128"; do
 	gen_try_cmpxchg_fallbacks "${cmpxchg}"
 done
 
--- a/scripts/atomic/gen-atomic-instrumented.sh
+++ b/scripts/atomic/gen-atomic-instrumented.sh
@@ -166,14 +166,14 @@ grep '^[a-z]' "$1" | while read name met
 done
 
 
-for xchg in "xchg" "cmpxchg" "cmpxchg64" "try_cmpxchg" "try_cmpxchg64"; do
+for xchg in "xchg" "cmpxchg" "cmpxchg64" "cmpxchg128" "try_cmpxchg" "try_cmpxchg64" "try_cmpxchg128"; do
 	for order in "" "_acquire" "_release" "_relaxed"; do
 		gen_xchg "${xchg}" "${order}" ""
 		printf "\n"
 	done
 done
 
-for xchg in "cmpxchg_local" "cmpxchg64_local" "sync_cmpxchg"; do
+for xchg in "cmpxchg_local" "cmpxchg64_local" "cmpxchg128_local" "sync_cmpxchg"; do
 	gen_xchg "${xchg}" "" ""
 	printf "\n"
 done



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (3 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 04/10] instrumentation: Wire up cmpxchg128() Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-02 17:05   ` Heiko Carstens
                     ` (2 more replies)
  2023-02-02 14:50 ` [PATCH v2 06/10] x86,amd_iommu: Replace cmpxchg_double() Peter Zijlstra
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

In order to replace cmpxchg_double() with the newly minted
cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/arm64/include/asm/percpu.h |   21 +++++++++++++++
 arch/s390/include/asm/percpu.h  |   17 ++++++++++++
 arch/x86/include/asm/percpu.h   |   56 ++++++++++++++++++++++++++++++++++++++++
 include/asm-generic/percpu.h    |    8 +++++
 include/linux/percpu-defs.h     |   20 ++++++++++++--
 5 files changed, 120 insertions(+), 2 deletions(-)

--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -140,6 +140,10 @@ PERCPU_RET_OP(add, add, ldadd)
  * re-enabling preemption for preemptible kernels, but doing that in a way
  * which builds inside a module would mean messing directly with the preempt
  * count. If you do this, peterz and tglx will hunt you down.
+ *
+ * Not to mention it'll break the actual preemption model for missing a
+ * preemption point when TIF_NEED_RESCHED gets set while preemption is
+ * disabled.
  */
 #define this_cpu_cmpxchg_double_8(ptr1, ptr2, o1, o2, n1, n2)		\
 ({									\
@@ -240,6 +244,23 @@ PERCPU_RET_OP(add, add, ldadd)
 #define this_cpu_cmpxchg_8(pcp, o, n)	\
 	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
 
+#define this_cpu_cmpxchg_16(pcp, o, n)					\
+({									\
+	typedef typeof(pcp) pcp_op_T__;					\
+	union {								\
+		pcp_op_T__ pot;						\
+		u128 val;						\
+	} old__, new__, ret__;						\
+	pcp_op_T__ *ptr__;						\
+	old__.pot = o;							\
+	new__.pot = n;							\
+	preempt_disable_notrace();					\
+	ptr__ = raw_cpu_ptr(&(pcp));					\
+	ret__.val = cmpxchg128_local((void *)ptr__, old__.val, new__.val); \
+	preempt_enable_notrace();					\
+	ret__.pot;							\
+})
+
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);
 #define __per_cpu_offset
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -148,6 +148,23 @@
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
+#define this_cpu_cmpxchg_16(pcp, oval, nval)				\
+({									\
+	typedef typeof(pcp) pcp_op_T__;					\
+	union {								\
+		pcp_op_T__ pot;						\
+		u128 val;						\
+	} old__, new__, ret__;						\
+	pcp_op_T__ *ptr__;						\
+	old__.pot = oval;						\
+	new__.pot = nval;						\
+	preempt_disable_notrace();					\
+	ptr__ = raw_cpu_ptr(&(pcp));					\
+	ret__.val = cmpxchg128((void *)ptr__, old__.val, new__.val);	\
+	preempt_enable_notrace();					\
+	ret__.pot;							\
+})
+
 #define arch_this_cpu_xchg(pcp, nval)					\
 ({									\
 	typeof(pcp) *ptr__;						\
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -210,6 +210,62 @@ do {									\
 	(typeof(_var))(unsigned long) pco_old__;			\
 })
 
+#if defined(CONFIG_X86_32) && defined(CONFIG_X86_CMPXCHG64)
+#define percpu_cmpxchg64_op(size, qual, _var, _oval, _nval)		\
+({									\
+	union {								\
+		typeof(_var) var;					\
+		struct {						\
+			u32 low, high;					\
+		};							\
+	} old__, new__;							\
+									\
+	old__.var = _oval;						\
+	new__.var = _nval;						\
+									\
+	asm qual ("cmpxchg8b " __percpu_arg([var])			\
+		  : [var] "+m" (_var),					\
+		    "+a" (old__.low),					\
+		    "+d" (old__.high)					\
+		  : "b" (new__.low),					\
+		    "c" (new__.high)					\
+		  : "memory");						\
+									\
+	old__.var;							\
+})
+
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg64_op(8,         , pcp, oval, nval)
+#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg64_op(8, volatile, pcp, oval, nval)
+#endif
+
+#ifdef CONFIG_X86_64
+#define percpu_cmpxchg128_op(size, qual, _var, _oval, _nval)		\
+({									\
+	union {								\
+		typeof(_var) var;					\
+		struct {						\
+			u64 low, high;					\
+		};							\
+	} old__, new__;							\
+									\
+	old__.var = _oval;						\
+	new__.var = _nval;						\
+									\
+	asm qual ("cmpxchg16b " __percpu_arg([var])			\
+		  : [var] "+m" (_var),					\
+		    "+a" (old__.low),					\
+		    "+d" (old__.high)					\
+		  : "b" (new__.low),					\
+		    "c" (new__.high)					\
+		  : "memory");						\
+									\
+	old__.var;							\
+})
+
+#define raw_cpu_cmpxchg_16(pcp, oval, nval)	percpu_cmpxchg128_op(16,         , pcp, oval, nval)
+#define this_cpu_cmpxchg_16(pcp, oval, nval)	percpu_cmpxchg128_op(16, volatile, pcp, oval, nval)
+#endif
+
 /*
  * this_cpu_read() makes gcc load the percpu variable every time it is
  * accessed while this_cpu_read_stable() allows the value to be cached.
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -298,6 +298,10 @@ do {									\
 #define raw_cpu_cmpxchg_8(pcp, oval, nval) \
 	raw_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
+#ifndef raw_cpu_cmpxchg_16
+#define raw_cpu_cmpxchg_16(pcp, oval, nval) \
+	raw_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
 
 #ifndef raw_cpu_cmpxchg_double_1
 #define raw_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
@@ -423,6 +427,10 @@ do {									\
 #define this_cpu_cmpxchg_8(pcp, oval, nval) \
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
+#ifndef this_cpu_cmpxchg_16
+#define this_cpu_cmpxchg_16(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
 
 #ifndef this_cpu_cmpxchg_double_1
 #define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -343,6 +343,22 @@ static inline void __this_cpu_preempt_ch
 	pscr2_ret__;							\
 })
 
+#define __pcpu_size16_call_return2(stem, variable, ...)			\
+({									\
+	typeof(variable) pscr2_ret__;					\
+	__verify_pcpu_ptr(&(variable));					\
+	switch(sizeof(variable)) {					\
+	case 1: pscr2_ret__ = stem##1(variable, __VA_ARGS__); break;	\
+	case 2: pscr2_ret__ = stem##2(variable, __VA_ARGS__); break;	\
+	case 4: pscr2_ret__ = stem##4(variable, __VA_ARGS__); break;	\
+	case 8: pscr2_ret__ = stem##8(variable, __VA_ARGS__); break;	\
+	case 16: pscr2_ret__ = stem##16(variable, __VA_ARGS__); break;	\
+	default:							\
+		__bad_size_call_parameter(); break;			\
+	}								\
+	pscr2_ret__;							\
+})
+
 /*
  * Special handling for cmpxchg_double.  cmpxchg_double is passed two
  * percpu variables.  The first has to be aligned to a double word
@@ -425,7 +441,7 @@ do {									\
 #define raw_cpu_add_return(pcp, val)	__pcpu_size_call_return2(raw_cpu_add_return_, pcp, val)
 #define raw_cpu_xchg(pcp, nval)		__pcpu_size_call_return2(raw_cpu_xchg_, pcp, nval)
 #define raw_cpu_cmpxchg(pcp, oval, nval) \
-	__pcpu_size_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
+	__pcpu_size16_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
 #define raw_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	__pcpu_double_call_return_bool(raw_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
 
@@ -512,7 +528,7 @@ do {									\
 #define this_cpu_add_return(pcp, val)	__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
-	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
+	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
 #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 06/10] x86,amd_iommu: Replace cmpxchg_double()
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (4 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 05/10] percpu: Wire up cmpxchg128 Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-02 14:50 ` [PATCH v2 07/10] x86,intel_iommu: " Peter Zijlstra
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto, Vasant Hegde


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vasant Hegde <vasant.hegde@amd.com>
---
 drivers/iommu/amd/amd_iommu_types.h |    9 +++++++--
 drivers/iommu/amd/iommu.c           |   10 ++++------
 2 files changed, 11 insertions(+), 8 deletions(-)

--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -979,8 +979,13 @@ union irte_ga_hi {
 };
 
 struct irte_ga {
-	union irte_ga_lo lo;
-	union irte_ga_hi hi;
+	union {
+		struct {
+			union irte_ga_lo lo;
+			union irte_ga_hi hi;
+		};
+		u128 irte;
+	};
 };
 
 struct irq_2_irte {
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2992,10 +2992,10 @@ static int alloc_irq_index(struct amd_io
 static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
 			  struct irte_ga *irte, struct amd_ir_data *data)
 {
-	bool ret;
 	struct irq_remap_table *table;
-	unsigned long flags;
 	struct irte_ga *entry;
+	unsigned long flags;
+	u128 old;
 
 	table = get_irq_table(iommu, devid);
 	if (!table)
@@ -3006,16 +3006,14 @@ static int modify_irte_ga(struct amd_iom
 	entry = (struct irte_ga *)table->table;
 	entry = &entry[index];
 
-	ret = cmpxchg_double(&entry->lo.val, &entry->hi.val,
-			     entry->lo.val, entry->hi.val,
-			     irte->lo.val, irte->hi.val);
 	/*
 	 * We use cmpxchg16 to atomically update the 128-bit IRTE,
 	 * and it cannot be updated by the hardware or other processors
 	 * behind us, so the return value of cmpxchg16 should be the
 	 * same as the old value.
 	 */
-	WARN_ON(!ret);
+	old = entry->irte;
+	WARN_ON(!try_cmpxchg128(&entry->irte, &old, irte->irte));
 
 	if (data)
 		data->ref = entry;



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 07/10] x86,intel_iommu: Replace cmpxchg_double()
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (5 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 06/10] x86,amd_iommu: Replace cmpxchg_double() Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-03  2:51   ` Baolu Lu
  2023-02-02 14:50 ` [PATCH v2 08/10] slub: " Peter Zijlstra
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/iommu/intel/irq_remapping.c |    8 --
 include/linux/dmar.h                |  125 +++++++++++++++++++-----------------
 2 files changed, 68 insertions(+), 65 deletions(-)

--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -174,18 +174,14 @@ static int modify_irte(struct irq_2_iomm
 	irte = &iommu->ir_table->base[index];
 
 	if ((irte->pst == 1) || (irte_modified->pst == 1)) {
-		bool ret;
-
-		ret = cmpxchg_double(&irte->low, &irte->high,
-				     irte->low, irte->high,
-				     irte_modified->low, irte_modified->high);
 		/*
 		 * We use cmpxchg16 to atomically update the 128-bit IRTE,
 		 * and it cannot be updated by the hardware or other processors
 		 * behind us, so the return value of cmpxchg16 should be the
 		 * same as the old value.
 		 */
-		WARN_ON(!ret);
+		u128 old = irte->irte;
+		WARN_ON(!try_cmpxchg128(&irte->irte, &old, irte_modified->irte));
 	} else {
 		WRITE_ONCE(irte->low, irte_modified->low);
 		WRITE_ONCE(irte->high, irte_modified->high);
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -201,67 +201,74 @@ static inline void detect_intel_iommu(vo
 
 struct irte {
 	union {
-		/* Shared between remapped and posted mode*/
 		struct {
-			__u64	present		: 1,  /*  0      */
-				fpd		: 1,  /*  1      */
-				__res0		: 6,  /*  2 -  6 */
-				avail		: 4,  /*  8 - 11 */
-				__res1		: 3,  /* 12 - 14 */
-				pst		: 1,  /* 15      */
-				vector		: 8,  /* 16 - 23 */
-				__res2		: 40; /* 24 - 63 */
+			union {
+				/* Shared between remapped and posted mode*/
+				struct {
+					__u64	present		: 1,  /*  0      */
+						fpd		: 1,  /*  1      */
+						__res0		: 6,  /*  2 -  6 */
+						avail		: 4,  /*  8 - 11 */
+						__res1		: 3,  /* 12 - 14 */
+						pst		: 1,  /* 15      */
+						vector		: 8,  /* 16 - 23 */
+						__res2		: 40; /* 24 - 63 */
+				};
+
+				/* Remapped mode */
+				struct {
+					__u64	r_present	: 1,  /*  0      */
+						r_fpd		: 1,  /*  1      */
+						dst_mode	: 1,  /*  2      */
+						redir_hint	: 1,  /*  3      */
+						trigger_mode	: 1,  /*  4      */
+						dlvry_mode	: 3,  /*  5 -  7 */
+						r_avail		: 4,  /*  8 - 11 */
+						r_res0		: 4,  /* 12 - 15 */
+						r_vector	: 8,  /* 16 - 23 */
+						r_res1		: 8,  /* 24 - 31 */
+						dest_id		: 32; /* 32 - 63 */
+				};
+
+				/* Posted mode */
+				struct {
+					__u64	p_present	: 1,  /*  0      */
+						p_fpd		: 1,  /*  1      */
+						p_res0		: 6,  /*  2 -  7 */
+						p_avail		: 4,  /*  8 - 11 */
+						p_res1		: 2,  /* 12 - 13 */
+						p_urgent	: 1,  /* 14      */
+						p_pst		: 1,  /* 15      */
+						p_vector	: 8,  /* 16 - 23 */
+						p_res2		: 14, /* 24 - 37 */
+						pda_l		: 26; /* 38 - 63 */
+				};
+				__u64 low;
+			};
+
+			union {
+				/* Shared between remapped and posted mode*/
+				struct {
+					__u64	sid		: 16,  /* 64 - 79  */
+						sq		: 2,   /* 80 - 81  */
+						svt		: 2,   /* 82 - 83  */
+						__res3		: 44;  /* 84 - 127 */
+				};
+
+				/* Posted mode*/
+				struct {
+					__u64	p_sid		: 16,  /* 64 - 79  */
+						p_sq		: 2,   /* 80 - 81  */
+						p_svt		: 2,   /* 82 - 83  */
+						p_res3		: 12,  /* 84 - 95  */
+						pda_h		: 32;  /* 96 - 127 */
+				};
+				__u64 high;
+			};
 		};
-
-		/* Remapped mode */
-		struct {
-			__u64	r_present	: 1,  /*  0      */
-				r_fpd		: 1,  /*  1      */
-				dst_mode	: 1,  /*  2      */
-				redir_hint	: 1,  /*  3      */
-				trigger_mode	: 1,  /*  4      */
-				dlvry_mode	: 3,  /*  5 -  7 */
-				r_avail		: 4,  /*  8 - 11 */
-				r_res0		: 4,  /* 12 - 15 */
-				r_vector	: 8,  /* 16 - 23 */
-				r_res1		: 8,  /* 24 - 31 */
-				dest_id		: 32; /* 32 - 63 */
-		};
-
-		/* Posted mode */
-		struct {
-			__u64	p_present	: 1,  /*  0      */
-				p_fpd		: 1,  /*  1      */
-				p_res0		: 6,  /*  2 -  7 */
-				p_avail		: 4,  /*  8 - 11 */
-				p_res1		: 2,  /* 12 - 13 */
-				p_urgent	: 1,  /* 14      */
-				p_pst		: 1,  /* 15      */
-				p_vector	: 8,  /* 16 - 23 */
-				p_res2		: 14, /* 24 - 37 */
-				pda_l		: 26; /* 38 - 63 */
-		};
-		__u64 low;
-	};
-
-	union {
-		/* Shared between remapped and posted mode*/
-		struct {
-			__u64	sid		: 16,  /* 64 - 79  */
-				sq		: 2,   /* 80 - 81  */
-				svt		: 2,   /* 82 - 83  */
-				__res3		: 44;  /* 84 - 127 */
-		};
-
-		/* Posted mode*/
-		struct {
-			__u64	p_sid		: 16,  /* 64 - 79  */
-				p_sq		: 2,   /* 80 - 81  */
-				p_svt		: 2,   /* 82 - 83  */
-				p_res3		: 12,  /* 84 - 95  */
-				pda_h		: 32;  /* 96 - 127 */
-		};
-		__u64 high;
+#ifdef CONFIG_IRQ_REMAP
+		__u128 irte;
+#endif
 	};
 };
 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 08/10] slub: Replace cmpxchg_double()
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (6 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 07/10] x86,intel_iommu: " Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-08 13:31   ` Hyeonggon Yoo
  2023-02-02 14:50 ` [PATCH v2 09/10] arch: Remove cmpxchg_double Peter Zijlstra
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slub_def.h |   12 ++-
 mm/slab.h                |   45 +++++++++++++-
 mm/slub.c                |  142 ++++++++++++++++++++++++++++-------------------
 3 files changed, 135 insertions(+), 64 deletions(-)

--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -39,7 +39,8 @@ enum stat_item {
 	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
 	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
-	NR_SLUB_STAT_ITEMS };
+	NR_SLUB_STAT_ITEMS
+};
 
 #ifndef CONFIG_SLUB_TINY
 /*
@@ -47,8 +48,13 @@ enum stat_item {
  * with this_cpu_cmpxchg_double() alignment requirements.
  */
 struct kmem_cache_cpu {
-	void **freelist;	/* Pointer to next available object */
-	unsigned long tid;	/* Globally unique transaction id */
+	union {
+		struct {
+			void **freelist;	/* Pointer to next available object */
+			unsigned long tid;	/* Globally unique transaction id */
+		};
+		freelist_aba_t freelist_tid;
+	};
 	struct slab *slab;	/* The slab from which we are allocating */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct slab *partial;	/* Partially allocated frozen slabs */
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -5,6 +5,34 @@
  * Internal slab definitions
  */
 
+/*
+ * Freelist pointer and counter to cmpxchg together, avoids the typical ABA
+ * problems with cmpxchg of just a pointer.
+ */
+typedef union {
+	struct {
+		void *freelist;
+		unsigned long counter;
+	};
+#ifdef CONFIG_64BIT
+	u128 full;
+#else
+	u64 full;
+#endif
+} freelist_aba_t;
+
+#ifdef CONFIG_64BIT
+# ifdef system_has_cmpxchg128
+# define system_has_freelist_aba()	system_has_cmpxchg128()
+# define try_cmpxchg_freelist		try_cmpxchg128
+# endif
+#else /* CONFIG_64BIT */
+# ifdef system_has_cmpxchg64
+# define system_has_freelist_aba()	system_has_cmpxchg64()
+# define try_cmpxchg_freelist		try_cmpxchg64
+# endif
+#endif /* CONFIG_64BIT */
+
 /* Reuses the bits in struct page */
 struct slab {
 	unsigned long __page_flags;
@@ -37,14 +65,21 @@ struct slab {
 #endif
 			};
 			/* Double-word boundary */
-			void *freelist;		/* first free object */
 			union {
-				unsigned long counters;
 				struct {
-					unsigned inuse:16;
-					unsigned objects:15;
-					unsigned frozen:1;
+					void *freelist;		/* first free object */
+					union {
+						unsigned long counters;
+						struct {
+							unsigned inuse:16;
+							unsigned objects:15;
+							unsigned frozen:1;
+						};
+					};
 				};
+#ifdef system_has_freelist_aba
+				freelist_aba_t freelist_counter;
+#endif
 			};
 		};
 		struct rcu_head rcu_head;
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -292,7 +292,13 @@ static inline bool kmem_cache_has_cpu_pa
 /* Poison object */
 #define __OBJECT_POISON		((slab_flags_t __force)0x80000000U)
 /* Use cmpxchg_double */
+
+#if defined(system_has_freelist_aba) && \
+    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 #define __CMPXCHG_DOUBLE	((slab_flags_t __force)0x40000000U)
+#else
+#define __CMPXCHG_DOUBLE	((slab_flags_t __force)0U)
+#endif
 
 /*
  * Tracking user of a slab.
@@ -512,6 +518,43 @@ static __always_inline void slab_unlock(
 	__bit_spin_unlock(PG_locked, &page->flags);
 }
 
+static inline bool
+__update_freelist_fast(struct slab *slab,
+		      void *freelist_old, unsigned long counters_old,
+		      void *freelist_new, unsigned long counters_new)
+{
+
+	bool ret = false;
+
+#ifdef system_has_freelist_aba
+	freelist_aba_t old = { .freelist = freelist_old, .counter = counters_old };
+	freelist_aba_t new = { .freelist = freelist_new, .counter = counters_new };
+
+	ret = try_cmpxchg_freelist(&slab->freelist_counter.full, &old.full, new.full);
+#endif /* system_has_freelist_aba */
+
+	return ret;
+}
+
+static inline bool
+__update_freelist_slow(struct slab *slab,
+		      void *freelist_old, unsigned long counters_old,
+		      void *freelist_new, unsigned long counters_new)
+{
+	bool ret = false;
+
+	slab_lock(slab);
+	if (slab->freelist == freelist_old &&
+	    slab->counters == counters_old) {
+		slab->freelist = freelist_new;
+		slab->counters = counters_new;
+		ret = true;
+	}
+	slab_unlock(slab);
+
+	return ret;
+}
+
 /*
  * Interrupts must be disabled (for the fallback code to work right), typically
  * by an _irqsave() lock variant. On PREEMPT_RT the preempt_disable(), which is
@@ -519,33 +562,25 @@ static __always_inline void slab_unlock(
  * allocation/ free operation in hardirq context. Therefore nothing can
  * interrupt the operation.
  */
-static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
+static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *slab,
 		void *freelist_old, unsigned long counters_old,
 		void *freelist_new, unsigned long counters_new,
 		const char *n)
 {
+	bool ret;
+
 	if (USE_LOCKLESS_FAST_PATH())
 		lockdep_assert_irqs_disabled();
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&slab->freelist, &slab->counters,
-				   freelist_old, counters_old,
-				   freelist_new, counters_new))
-			return true;
-	} else
-#endif
-	{
-		slab_lock(slab);
-		if (slab->freelist == freelist_old &&
-					slab->counters == counters_old) {
-			slab->freelist = freelist_new;
-			slab->counters = counters_new;
-			slab_unlock(slab);
-			return true;
-		}
-		slab_unlock(slab);
+		ret = __update_freelist_fast(slab, freelist_old, counters_old,
+				            freelist_new, counters_new);
+	} else {
+		ret = __update_freelist_slow(slab, freelist_old, counters_old,
+				            freelist_new, counters_new);
 	}
+	if (likely(ret))
+		return true;
 
 	cpu_relax();
 	stat(s, CMPXCHG_DOUBLE_FAIL);
@@ -557,36 +592,26 @@ static inline bool __cmpxchg_double_slab
 	return false;
 }
 
-static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
+static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,
 		void *freelist_old, unsigned long counters_old,
 		void *freelist_new, unsigned long counters_new,
 		const char *n)
 {
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+	bool ret;
+
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&slab->freelist, &slab->counters,
-				   freelist_old, counters_old,
-				   freelist_new, counters_new))
-			return true;
-	} else
-#endif
-	{
+		ret = __update_freelist_fast(slab, freelist_old, counters_old,
+				            freelist_new, counters_new);
+	} else {
 		unsigned long flags;
 
 		local_irq_save(flags);
-		slab_lock(slab);
-		if (slab->freelist == freelist_old &&
-					slab->counters == counters_old) {
-			slab->freelist = freelist_new;
-			slab->counters = counters_new;
-			slab_unlock(slab);
-			local_irq_restore(flags);
-			return true;
-		}
-		slab_unlock(slab);
+		ret = __update_freelist_slow(slab, freelist_old, counters_old,
+				            freelist_new, counters_new);
 		local_irq_restore(flags);
 	}
+	if (likely(ret))
+		return true;
 
 	cpu_relax();
 	stat(s, CMPXCHG_DOUBLE_FAIL);
@@ -2229,7 +2254,7 @@ static inline void *acquire_slab(struct
 	VM_BUG_ON(new.frozen);
 	new.frozen = 1;
 
-	if (!__cmpxchg_double_slab(s, slab,
+	if (!__slab_update_freelist(s, slab,
 			freelist, counters,
 			new.freelist, new.counters,
 			"acquire_slab"))
@@ -2555,7 +2580,7 @@ static void deactivate_slab(struct kmem_
 	}
 
 
-	if (!cmpxchg_double_slab(s, slab,
+	if (!slab_update_freelist(s, slab,
 				old.freelist, old.counters,
 				new.freelist, new.counters,
 				"unfreezing slab")) {
@@ -2612,7 +2637,7 @@ static void __unfreeze_partials(struct k
 
 			new.frozen = 0;
 
-		} while (!__cmpxchg_double_slab(s, slab,
+		} while (!__slab_update_freelist(s, slab,
 				old.freelist, old.counters,
 				new.freelist, new.counters,
 				"unfreezing slab"));
@@ -3009,6 +3034,18 @@ static inline bool pfmemalloc_match(stru
 }
 
 #ifndef CONFIG_SLUB_TINY
+static inline bool
+__update_cpu_freelist_fast(struct kmem_cache *s,
+			   void *freelist_old, void *freelist_new,
+			   unsigned long tid)
+{
+	freelist_aba_t old = { .freelist = freelist_old, .counter = tid };
+	freelist_aba_t new = { .freelist = freelist_new, .counter = next_tid(tid) };
+
+	return this_cpu_cmpxchg(s->cpu_slab->freelist_tid.full,
+				old.full, new.full) == old.full;
+}
+
 /*
  * Check the slab->freelist and either transfer the freelist to the
  * per cpu freelist or deactivate the slab.
@@ -3035,7 +3072,7 @@ static inline void *get_freelist(struct
 		new.inuse = slab->objects;
 		new.frozen = freelist != NULL;
 
-	} while (!__cmpxchg_double_slab(s, slab,
+	} while (!__slab_update_freelist(s, slab,
 		freelist, counters,
 		NULL, new.counters,
 		"get_freelist"));
@@ -3360,11 +3397,7 @@ static __always_inline void *__slab_allo
 		 * against code executing on this cpu *not* from access by
 		 * other cpus.
 		 */
-		if (unlikely(!this_cpu_cmpxchg_double(
-				s->cpu_slab->freelist, s->cpu_slab->tid,
-				object, tid,
-				next_object, next_tid(tid)))) {
-
+		if (unlikely(!__update_cpu_freelist_fast(s, object, next_object, tid))) {
 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
 		}
@@ -3632,7 +3665,7 @@ static void __slab_free(struct kmem_cach
 			}
 		}
 
-	} while (!cmpxchg_double_slab(s, slab,
+	} while (!slab_update_freelist(s, slab,
 		prior, counters,
 		head, new.counters,
 		"__slab_free"));
@@ -3737,11 +3770,7 @@ static __always_inline void do_slab_free
 
 		set_freepointer(s, tail_obj, freelist);
 
-		if (unlikely(!this_cpu_cmpxchg_double(
-				s->cpu_slab->freelist, s->cpu_slab->tid,
-				freelist, tid,
-				head, next_tid(tid)))) {
-
+		if (unlikely(!__update_cpu_freelist_fast(s, freelist, head, tid))) {
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
@@ -4505,11 +4534,12 @@ static int kmem_cache_open(struct kmem_c
 		}
 	}
 
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
+#if defined(system_has_freelist_aba) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-	if (system_has_cmpxchg_double() && (s->flags & SLAB_NO_CMPXCHG) == 0)
+	if (system_has_freelist_aba() && !(s->flags & SLAB_NO_CMPXCHG)) {
 		/* Enable fast mode */
 		s->flags |= __CMPXCHG_DOUBLE;
+	}
 #endif
 
 	/*



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 09/10] arch: Remove cmpxchg_double
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (7 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 08/10] slub: " Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-08 13:44   ` Hyeonggon Yoo
  2023-02-02 14:50 ` [PATCH v2 10/10] s390/cpum_sf: Convert to cmpxchg128() Peter Zijlstra
  2023-02-02 19:39 ` [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Linus Torvalds
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

No moar users, remove the monster.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Documentation/core-api/this_cpu_ops.rst    |    2 -
 arch/arm64/include/asm/atomic_ll_sc.h      |   33 ----------------
 arch/arm64/include/asm/atomic_lse.h        |   36 ------------------
 arch/arm64/include/asm/cmpxchg.h           |   46 -----------------------
 arch/arm64/include/asm/percpu.h            |   10 -----
 arch/s390/include/asm/cmpxchg.h            |   34 -----------------
 arch/s390/include/asm/percpu.h             |   18 ---------
 arch/x86/include/asm/cmpxchg.h             |   25 ------------
 arch/x86/include/asm/cmpxchg_32.h          |    1 
 arch/x86/include/asm/cmpxchg_64.h          |    1 
 arch/x86/include/asm/percpu.h              |   41 --------------------
 include/asm-generic/percpu.h               |   58 -----------------------------
 include/linux/atomic/atomic-instrumented.h |   17 --------
 include/linux/percpu-defs.h                |   38 -------------------
 scripts/atomic/gen-atomic-instrumented.sh  |   17 ++------
 15 files changed, 6 insertions(+), 371 deletions(-)

--- a/Documentation/core-api/this_cpu_ops.rst
+++ b/Documentation/core-api/this_cpu_ops.rst
@@ -53,7 +53,6 @@ are defined. These operations can be use
 	this_cpu_add_return(pcp, val)
 	this_cpu_xchg(pcp, nval)
 	this_cpu_cmpxchg(pcp, oval, nval)
-	this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
 	this_cpu_sub(pcp, val)
 	this_cpu_inc(pcp)
 	this_cpu_dec(pcp)
@@ -242,7 +241,6 @@ modifies the variable, then RMW actions
 	__this_cpu_add_return(pcp, val)
 	__this_cpu_xchg(pcp, nval)
 	__this_cpu_cmpxchg(pcp, oval, nval)
-	__this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
 	__this_cpu_sub(pcp, val)
 	__this_cpu_inc(pcp)
 	__this_cpu_dec(pcp)
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -294,39 +294,6 @@ __CMPXCHG_CASE( ,  ,  mb_, 64, dmb ish,
 
 #undef __CMPXCHG_CASE
 
-#define __CMPXCHG_DBL(name, mb, rel, cl)				\
-static __always_inline long						\
-__ll_sc__cmpxchg_double##name(unsigned long old1,			\
-				      unsigned long old2,		\
-				      unsigned long new1,		\
-				      unsigned long new2,		\
-				      volatile void *ptr)		\
-{									\
-	unsigned long tmp, ret;						\
-									\
-	asm volatile("// __cmpxchg_double" #name "\n"			\
-	"	prfm	pstl1strm, %2\n"				\
-	"1:	ldxp	%0, %1, %2\n"					\
-	"	eor	%0, %0, %3\n"					\
-	"	eor	%1, %1, %4\n"					\
-	"	orr	%1, %0, %1\n"					\
-	"	cbnz	%1, 2f\n"					\
-	"	st" #rel "xp	%w0, %5, %6, %2\n"			\
-	"	cbnz	%w0, 1b\n"					\
-	"	" #mb "\n"						\
-	"2:"								\
-	: "=&r" (tmp), "=&r" (ret), "+Q" (*(__uint128_t *)ptr)		\
-	: "r" (old1), "r" (old2), "r" (new1), "r" (new2)		\
-	: cl);								\
-									\
-	return ret;							\
-}
-
-__CMPXCHG_DBL(   ,        ,  ,         )
-__CMPXCHG_DBL(_mb, dmb ish, l, "memory")
-
-#undef __CMPXCHG_DBL
-
 union __u128_halves {
 	u128 full;
 	struct {
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -288,42 +288,6 @@ __CMPXCHG_CASE(x,  ,  mb_, 64, al, "memo
 
 #undef __CMPXCHG_CASE
 
-#define __CMPXCHG_DBL(name, mb, cl...)					\
-static __always_inline long						\
-__lse__cmpxchg_double##name(unsigned long old1,				\
-					 unsigned long old2,		\
-					 unsigned long new1,		\
-					 unsigned long new2,		\
-					 volatile void *ptr)		\
-{									\
-	unsigned long oldval1 = old1;					\
-	unsigned long oldval2 = old2;					\
-	register unsigned long x0 asm ("x0") = old1;			\
-	register unsigned long x1 asm ("x1") = old2;			\
-	register unsigned long x2 asm ("x2") = new1;			\
-	register unsigned long x3 asm ("x3") = new2;			\
-	register unsigned long x4 asm ("x4") = (unsigned long)ptr;	\
-									\
-	asm volatile(							\
-	__LSE_PREAMBLE							\
-	"	casp" #mb "\t%[old1], %[old2], %[new1], %[new2], %[v]\n"\
-	"	eor	%[old1], %[old1], %[oldval1]\n"			\
-	"	eor	%[old2], %[old2], %[oldval2]\n"			\
-	"	orr	%[old1], %[old1], %[old2]"			\
-	: [old1] "+&r" (x0), [old2] "+&r" (x1),				\
-	  [v] "+Q" (*(__uint128_t *)ptr)				\
-	: [new1] "r" (x2), [new2] "r" (x3), [ptr] "r" (x4),		\
-	  [oldval1] "r" (oldval1), [oldval2] "r" (oldval2)		\
-	: cl);								\
-									\
-	return x0;							\
-}
-
-__CMPXCHG_DBL(   ,   )
-__CMPXCHG_DBL(_mb, al, "memory")
-
-#undef __CMPXCHG_DBL
-
 #define __CMPXCHG128(name, mb, cl...)					\
 static __always_inline u128						\
 __lse__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new)		\
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -131,22 +131,6 @@ __CMPXCHG_CASE(mb_, 64)
 
 #undef __CMPXCHG_CASE
 
-#define __CMPXCHG_DBL(name)						\
-static inline long __cmpxchg_double##name(unsigned long old1,		\
-					 unsigned long old2,		\
-					 unsigned long new1,		\
-					 unsigned long new2,		\
-					 volatile void *ptr)		\
-{									\
-	return __lse_ll_sc_body(_cmpxchg_double##name, 			\
-				old1, old2, new1, new2, ptr);		\
-}
-
-__CMPXCHG_DBL(   )
-__CMPXCHG_DBL(_mb)
-
-#undef __CMPXCHG_DBL
-
 #define __CMPXCHG128(name)						\
 static inline u128 __cmpxchg128##name(volatile u128 *ptr,		\
 				      u128 old, u128 new)		\
@@ -212,36 +196,6 @@ __CMPXCHG_GEN(_mb)
 #define arch_cmpxchg64			arch_cmpxchg
 #define arch_cmpxchg64_local		arch_cmpxchg_local
 
-/* cmpxchg_double */
-#define system_has_cmpxchg_double()     1
-
-#define __cmpxchg_double_check(ptr1, ptr2)					\
-({										\
-	if (sizeof(*(ptr1)) != 8)						\
-		BUILD_BUG();							\
-	VM_BUG_ON((unsigned long *)(ptr2) - (unsigned long *)(ptr1) != 1);	\
-})
-
-#define arch_cmpxchg_double(ptr1, ptr2, o1, o2, n1, n2)				\
-({										\
-	int __ret;								\
-	__cmpxchg_double_check(ptr1, ptr2);					\
-	__ret = !__cmpxchg_double_mb((unsigned long)(o1), (unsigned long)(o2),	\
-				     (unsigned long)(n1), (unsigned long)(n2),	\
-				     ptr1);					\
-	__ret;									\
-})
-
-#define arch_cmpxchg_double_local(ptr1, ptr2, o1, o2, n1, n2)			\
-({										\
-	int __ret;								\
-	__cmpxchg_double_check(ptr1, ptr2);					\
-	__ret = !__cmpxchg_double((unsigned long)(o1), (unsigned long)(o2),	\
-				  (unsigned long)(n1), (unsigned long)(n2),	\
-				  ptr1);					\
-	__ret;									\
-})
-
 /* cmpxchg128 */
 #define system_has_cmpxchg128()		1
 
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -145,16 +145,6 @@ PERCPU_RET_OP(add, add, ldadd)
  * preemption point when TIF_NEED_RESCHED gets set while preemption is
  * disabled.
  */
-#define this_cpu_cmpxchg_double_8(ptr1, ptr2, o1, o2, n1, n2)		\
-({									\
-	int __ret;							\
-	preempt_disable_notrace();					\
-	__ret = cmpxchg_double_local(	raw_cpu_ptr(&(ptr1)),		\
-					raw_cpu_ptr(&(ptr2)),		\
-					o1, o2, n1, n2);		\
-	preempt_enable_notrace();					\
-	__ret;								\
-})
 
 #define _pcp_protect(op, pcp, ...)					\
 ({									\
--- a/arch/s390/include/asm/cmpxchg.h
+++ b/arch/s390/include/asm/cmpxchg.h
@@ -167,40 +167,6 @@ static __always_inline unsigned long __c
 #define arch_cmpxchg_local	arch_cmpxchg
 #define arch_cmpxchg64_local	arch_cmpxchg
 
-#define system_has_cmpxchg_double()	1
-
-static __always_inline int __cmpxchg_double(unsigned long p1, unsigned long p2,
-					    unsigned long o1, unsigned long o2,
-					    unsigned long n1, unsigned long n2)
-{
-	union register_pair old = { .even = o1, .odd = o2, };
-	union register_pair new = { .even = n1, .odd = n2, };
-	int cc;
-
-	asm volatile(
-		"	cdsg	%[old],%[new],%[ptr]\n"
-		"	ipm	%[cc]\n"
-		"	srl	%[cc],28\n"
-		: [cc] "=&d" (cc), [old] "+&d" (old.pair)
-		: [new] "d" (new.pair),
-		  [ptr] "QS" (*(unsigned long *)p1), "Q" (*(unsigned long *)p2)
-		: "memory", "cc");
-	return !cc;
-}
-
-#define arch_cmpxchg_double(p1, p2, o1, o2, n1, n2)			\
-({									\
-	typeof(p1) __p1 = (p1);						\
-	typeof(p2) __p2 = (p2);						\
-									\
-	BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long));			\
-	BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long));			\
-	VM_BUG_ON((unsigned long)((__p1) + 1) != (unsigned long)(__p2));\
-	__cmpxchg_double((unsigned long)__p1, (unsigned long)__p2,	\
-			 (unsigned long)(o1), (unsigned long)(o2),	\
-			 (unsigned long)(n1), (unsigned long)(n2));	\
-})
-
 #define system_has_cmpxchg128()		1
 
 static __always_inline u128 arch_cmpxchg128(volatile u128 *ptr, u128 old, u128 new)
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -181,24 +181,6 @@
 #define this_cpu_xchg_4(pcp, nval) arch_this_cpu_xchg(pcp, nval)
 #define this_cpu_xchg_8(pcp, nval) arch_this_cpu_xchg(pcp, nval)
 
-#define arch_this_cpu_cmpxchg_double(pcp1, pcp2, o1, o2, n1, n2)	    \
-({									    \
-	typeof(pcp1) *p1__;						    \
-	typeof(pcp2) *p2__;						    \
-	int ret__;							    \
-									    \
-	preempt_disable_notrace();					    \
-	p1__ = raw_cpu_ptr(&(pcp1));					    \
-	p2__ = raw_cpu_ptr(&(pcp2));					    \
-	ret__ = __cmpxchg_double((unsigned long)p1__, (unsigned long)p2__,  \
-				 (unsigned long)(o1), (unsigned long)(o2),  \
-				 (unsigned long)(n1), (unsigned long)(n2)); \
-	preempt_enable_notrace();					    \
-	ret__;								    \
-})
-
-#define this_cpu_cmpxchg_double_8 arch_this_cpu_cmpxchg_double
-
 #include <asm-generic/percpu.h>
 
 #endif /* __ARCH_S390_PERCPU__ */
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -233,29 +233,4 @@ extern void __add_wrong_size(void)
 #define __xadd(ptr, inc, lock)	__xchg_op((ptr), (inc), xadd, lock)
 #define xadd(ptr, inc)		__xadd((ptr), (inc), LOCK_PREFIX)
 
-#define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2)			\
-({									\
-	bool __ret;							\
-	__typeof__(*(p1)) __old1 = (o1), __new1 = (n1);			\
-	__typeof__(*(p2)) __old2 = (o2), __new2 = (n2);			\
-	BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long));			\
-	BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long));			\
-	VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long)));		\
-	VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2));	\
-	asm volatile(pfx "cmpxchg%c5b %1"				\
-		     CC_SET(e)						\
-		     : CC_OUT(e) (__ret),				\
-		       "+m" (*(p1)), "+m" (*(p2)),			\
-		       "+a" (__old1), "+d" (__old2)			\
-		     : "i" (2 * sizeof(long)),				\
-		       "b" (__new1), "c" (__new2));			\
-	__ret;								\
-})
-
-#define arch_cmpxchg_double(p1, p2, o1, o2, n1, n2) \
-	__cmpxchg_double(LOCK_PREFIX, p1, p2, o1, o2, n1, n2)
-
-#define arch_cmpxchg_double_local(p1, p2, o1, o2, n1, n2) \
-	__cmpxchg_double(, p1, p2, o1, o2, n1, n2)
-
 #endif	/* ASM_X86_CMPXCHG_H */
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -103,7 +103,6 @@ static inline bool __try_cmpxchg64(volat
 
 #endif
 
-#define system_has_cmpxchg_double()	boot_cpu_has(X86_FEATURE_CX8)
 #define system_has_cmpxchg64()		boot_cpu_has(X86_FEATURE_CX8)
 
 #endif /* _ASM_X86_CMPXCHG_32_H */
--- a/arch/x86/include/asm/cmpxchg_64.h
+++ b/arch/x86/include/asm/cmpxchg_64.h
@@ -72,7 +72,6 @@ static __always_inline bool arch_try_cmp
 	return likely(ret);
 }
 
-#define system_has_cmpxchg_double()	boot_cpu_has(X86_FEATURE_CX16)
 #define system_has_cmpxchg128()		boot_cpu_has(X86_FEATURE_CX16)
 
 #endif /* _ASM_X86_CMPXCHG_64_H */
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -346,23 +346,6 @@ do {									\
 #define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
 #define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
 
-#ifdef CONFIG_X86_CMPXCHG64
-#define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
-({									\
-	bool __ret;							\
-	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
-	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
-	asm volatile("cmpxchg8b "__percpu_arg(1)			\
-		     CC_SET(z)						\
-		     : CC_OUT(z) (__ret), "+m" (pcp1), "+m" (pcp2), "+a" (__o1), "+d" (__o2) \
-		     : "b" (__n1), "c" (__n2));				\
-	__ret;								\
-})
-
-#define raw_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
-#define this_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
-#endif /* CONFIG_X86_CMPXCHG64 */
-
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
  * 32 bit must fall back to generic operations.
@@ -385,30 +368,6 @@ do {									\
 #define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
 #define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
-
-/*
- * Pretty complex macro to generate cmpxchg16 instruction.  The instruction
- * is not supported on early AMD64 processors so we must be able to emulate
- * it in software.  The address used in the cmpxchg16 instruction must be
- * aligned to a 16 byte boundary.
- */
-#define percpu_cmpxchg16b_double(pcp1, pcp2, o1, o2, n1, n2)		\
-({									\
-	bool __ret;							\
-	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
-	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
-	alternative_io("leaq %P1,%%rsi\n\tcall this_cpu_cmpxchg16b_emu\n\t", \
-		       "cmpxchg16b " __percpu_arg(1) "\n\tsetz %0\n\t",	\
-		       X86_FEATURE_CX16,				\
-		       ASM_OUTPUT2("=a" (__ret), "+m" (pcp1),		\
-				   "+m" (pcp2), "+d" (__o2)),		\
-		       "b" (__n1), "c" (__n2), "a" (__o1) : "rsi");	\
-	__ret;								\
-})
-
-#define raw_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
-#define this_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
-
 #endif
 
 static __always_inline bool x86_this_cpu_constant_test_bit(unsigned int nr,
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -99,19 +99,6 @@ do {									\
 	__ret;								\
 })
 
-#define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-({									\
-	typeof(pcp1) *__p1 = raw_cpu_ptr(&(pcp1));			\
-	typeof(pcp2) *__p2 = raw_cpu_ptr(&(pcp2));			\
-	int __ret = 0;							\
-	if (*__p1 == (oval1) && *__p2  == (oval2)) {			\
-		*__p1 = nval1;						\
-		*__p2 = nval2;						\
-		__ret = 1;						\
-	}								\
-	(__ret);							\
-})
-
 #define __this_cpu_generic_read_nopreempt(pcp)				\
 ({									\
 	typeof(pcp) ___ret;						\
@@ -180,17 +167,6 @@ do {									\
 	__ret;								\
 })
 
-#define this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
-({									\
-	int __ret;							\
-	unsigned long __flags;						\
-	raw_local_irq_save(__flags);					\
-	__ret = raw_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
-			oval1, oval2, nval1, nval2);			\
-	raw_local_irq_restore(__flags);					\
-	__ret;								\
-})
-
 #ifndef raw_cpu_read_1
 #define raw_cpu_read_1(pcp)		raw_cpu_generic_read(pcp)
 #endif
@@ -303,23 +279,6 @@ do {									\
 	raw_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
-#ifndef raw_cpu_cmpxchg_double_1
-#define raw_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef raw_cpu_cmpxchg_double_2
-#define raw_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef raw_cpu_cmpxchg_double_4
-#define raw_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef raw_cpu_cmpxchg_double_8
-#define raw_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-
 #ifndef this_cpu_read_1
 #define this_cpu_read_1(pcp)		this_cpu_generic_read(pcp)
 #endif
@@ -432,21 +391,4 @@ do {									\
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
-#ifndef this_cpu_cmpxchg_double_1
-#define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef this_cpu_cmpxchg_double_2
-#define this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef this_cpu_cmpxchg_double_4
-#define this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef this_cpu_cmpxchg_double_8
-#define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-
 #endif /* _ASM_GENERIC_PERCPU_H_ */
--- a/include/linux/atomic/atomic-instrumented.h
+++ b/include/linux/atomic/atomic-instrumented.h
@@ -2141,21 +2141,6 @@ atomic_long_dec_if_positive(atomic_long_
 	arch_sync_cmpxchg(__ai_ptr, __VA_ARGS__); \
 })
 
-#define cmpxchg_double(ptr, ...) \
-({ \
-	typeof(ptr) __ai_ptr = (ptr); \
-	kcsan_mb(); \
-	instrument_atomic_write(__ai_ptr, 2 * sizeof(*__ai_ptr)); \
-	arch_cmpxchg_double(__ai_ptr, __VA_ARGS__); \
-})
-
-
-#define cmpxchg_double_local(ptr, ...) \
-({ \
-	typeof(ptr) __ai_ptr = (ptr); \
-	instrument_atomic_write(__ai_ptr, 2 * sizeof(*__ai_ptr)); \
-	arch_cmpxchg_double_local(__ai_ptr, __VA_ARGS__); \
-})
 
 #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
-// 27320c1ec2bf2878ecb9df3ea4816a7bc0c57a52
+// 416a741acbd4d28dbfa45f1b2a2c1b714454229f
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -359,33 +359,6 @@ static inline void __this_cpu_preempt_ch
 	pscr2_ret__;							\
 })
 
-/*
- * Special handling for cmpxchg_double.  cmpxchg_double is passed two
- * percpu variables.  The first has to be aligned to a double word
- * boundary and the second has to follow directly thereafter.
- * We enforce this on all architectures even if they don't support
- * a double cmpxchg instruction, since it's a cheap requirement, and it
- * avoids breaking the requirement for architectures with the instruction.
- */
-#define __pcpu_double_call_return_bool(stem, pcp1, pcp2, ...)		\
-({									\
-	bool pdcrb_ret__;						\
-	__verify_pcpu_ptr(&(pcp1));					\
-	BUILD_BUG_ON(sizeof(pcp1) != sizeof(pcp2));			\
-	VM_BUG_ON((unsigned long)(&(pcp1)) % (2 * sizeof(pcp1)));	\
-	VM_BUG_ON((unsigned long)(&(pcp2)) !=				\
-		  (unsigned long)(&(pcp1)) + sizeof(pcp1));		\
-	switch(sizeof(pcp1)) {						\
-	case 1: pdcrb_ret__ = stem##1(pcp1, pcp2, __VA_ARGS__); break;	\
-	case 2: pdcrb_ret__ = stem##2(pcp1, pcp2, __VA_ARGS__); break;	\
-	case 4: pdcrb_ret__ = stem##4(pcp1, pcp2, __VA_ARGS__); break;	\
-	case 8: pdcrb_ret__ = stem##8(pcp1, pcp2, __VA_ARGS__); break;	\
-	default:							\
-		__bad_size_call_parameter(); break;			\
-	}								\
-	pdcrb_ret__;							\
-})
-
 #define __pcpu_size_call(stem, variable, ...)				\
 do {									\
 	__verify_pcpu_ptr(&(variable));					\
@@ -442,9 +415,6 @@ do {									\
 #define raw_cpu_xchg(pcp, nval)		__pcpu_size_call_return2(raw_cpu_xchg_, pcp, nval)
 #define raw_cpu_cmpxchg(pcp, oval, nval) \
 	__pcpu_size16_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
-#define raw_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	__pcpu_double_call_return_bool(raw_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
-
 #define raw_cpu_sub(pcp, val)		raw_cpu_add(pcp, -(val))
 #define raw_cpu_inc(pcp)		raw_cpu_add(pcp, 1)
 #define raw_cpu_dec(pcp)		raw_cpu_sub(pcp, 1)
@@ -504,11 +474,6 @@ do {									\
 	raw_cpu_cmpxchg(pcp, oval, nval);				\
 })
 
-#define __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-({	__this_cpu_preempt_check("cmpxchg_double");			\
-	raw_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2);	\
-})
-
 #define __this_cpu_sub(pcp, val)	__this_cpu_add(pcp, -(typeof(pcp))(val))
 #define __this_cpu_inc(pcp)		__this_cpu_add(pcp, 1)
 #define __this_cpu_dec(pcp)		__this_cpu_sub(pcp, 1)
@@ -529,9 +494,6 @@ do {									\
 #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
 	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
-#define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
-
 #define this_cpu_sub(pcp, val)		this_cpu_add(pcp, -(typeof(pcp))(val))
 #define this_cpu_inc(pcp)		this_cpu_add(pcp, 1)
 #define this_cpu_dec(pcp)		this_cpu_sub(pcp, 1)
--- a/scripts/atomic/gen-atomic-instrumented.sh
+++ b/scripts/atomic/gen-atomic-instrumented.sh
@@ -84,7 +84,6 @@ gen_xchg()
 {
 	local xchg="$1"; shift
 	local order="$1"; shift
-	local mult="$1"; shift
 
 	kcsan_barrier=""
 	if [ "${xchg%_local}" = "${xchg}" ]; then
@@ -104,8 +103,8 @@ cat <<EOF
 EOF
 [ -n "$kcsan_barrier" ] && printf "\t${kcsan_barrier}; \\\\\n"
 cat <<EOF
-	instrument_atomic_write(__ai_ptr, ${mult}sizeof(*__ai_ptr)); \\
-	instrument_atomic_write(__ai_oldp, ${mult}sizeof(*__ai_oldp)); \\
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \\
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \\
 	arch_${xchg}${order}(__ai_ptr, __ai_oldp, __VA_ARGS__); \\
 })
 EOF
@@ -119,7 +118,7 @@ cat <<EOF
 EOF
 [ -n "$kcsan_barrier" ] && printf "\t${kcsan_barrier}; \\\\\n"
 cat <<EOF
-	instrument_atomic_write(__ai_ptr, ${mult}sizeof(*__ai_ptr)); \\
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \\
 	arch_${xchg}${order}(__ai_ptr, __VA_ARGS__); \\
 })
 EOF
@@ -168,22 +167,16 @@ done
 
 for xchg in "xchg" "cmpxchg" "cmpxchg64" "cmpxchg128" "try_cmpxchg" "try_cmpxchg64" "try_cmpxchg128"; do
 	for order in "" "_acquire" "_release" "_relaxed"; do
-		gen_xchg "${xchg}" "${order}" ""
+		gen_xchg "${xchg}" "${order}"
 		printf "\n"
 	done
 done
 
 for xchg in "cmpxchg_local" "cmpxchg64_local" "cmpxchg128_local" "sync_cmpxchg"; do
-	gen_xchg "${xchg}" "" ""
+	gen_xchg "${xchg}" ""
 	printf "\n"
 done
 
-gen_xchg "cmpxchg_double" "" "2 * "
-
-printf "\n\n"
-
-gen_xchg "cmpxchg_double_local" "" "2 * "
-
 cat <<EOF
 
 #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 10/10] s390/cpum_sf: Convert to cmpxchg128()
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (8 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 09/10] arch: Remove cmpxchg_double Peter Zijlstra
@ 2023-02-02 14:50 ` Peter Zijlstra
  2023-02-02 17:05   ` Heiko Carstens
  2023-02-02 19:39 ` [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Linus Torvalds
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-02 14:50 UTC (permalink / raw)
  To: torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, hpa, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

Now that there is a cross arch u128 and cmpxchg128(), use those
instead of the custom CDSG helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/s390/include/asm/cpu_mf.h  |    2 +-
 arch/s390/kernel/perf_cpum_sf.c |   22 ++++++----------------
 2 files changed, 7 insertions(+), 17 deletions(-)

--- a/arch/s390/include/asm/cpu_mf.h
+++ b/arch/s390/include/asm/cpu_mf.h
@@ -141,7 +141,7 @@ union hws_trailer_header {
 		unsigned int dsdes:16;	/* 48-63: size of diagnostic SDE */
 		unsigned long long overflow; /* 64 - Overflow Count   */
 	};
-	__uint128_t val;
+	u128 val;
 };
 
 struct hws_trailer_entry {
--- a/arch/s390/kernel/perf_cpum_sf.c
+++ b/arch/s390/kernel/perf_cpum_sf.c
@@ -1228,16 +1228,6 @@ static void hw_collect_samples(struct pe
 	}
 }
 
-static inline __uint128_t __cdsg(__uint128_t *ptr, __uint128_t old, __uint128_t new)
-{
-	asm volatile(
-		"	cdsg	%[old],%[new],%[ptr]\n"
-		: [old] "+d" (old), [ptr] "+QS" (*ptr)
-		: [new] "d" (new)
-		: "memory", "cc");
-	return old;
-}
-
 /* hw_perf_event_update() - Process sampling buffer
  * @event:	The perf event
  * @flush_all:	Flag to also flush partially filled sample-data-blocks
@@ -1307,14 +1297,14 @@ static void hw_perf_event_update(struct
 
 		/* Reset trailer (using compare-double-and-swap) */
 		/* READ_ONCE() 16 byte header */
-		prev.val = __cdsg(&te->header.val, 0, 0);
+		prev.val = cmpxchg128(&te->header.val, 0, 0);
 		do {
 			old.val = prev.val;
 			new.val = prev.val;
 			new.f = 0;
 			new.a = 1;
 			new.overflow = 0;
-			prev.val = __cdsg(&te->header.val, old.val, new.val);
+			prev.val = cmpxchg128(&te->header.val, old.val, new.val);
 		} while (prev.val != old.val);
 
 		/* Advance to next sample-data-block */
@@ -1496,7 +1486,7 @@ static bool aux_set_alert(struct aux_buf
 
 	te = aux_sdb_trailer(aux, alert_index);
 	/* READ_ONCE() 16 byte header */
-	prev.val = __cdsg(&te->header.val, 0, 0);
+	prev.val = cmpxchg128(&te->header.val, 0, 0);
 	do {
 		old.val = prev.val;
 		new.val = prev.val;
@@ -1511,7 +1501,7 @@ static bool aux_set_alert(struct aux_buf
 		}
 		new.a = 1;
 		new.overflow = 0;
-		prev.val = __cdsg(&te->header.val, old.val, new.val);
+		prev.val = cmpxchg128(&te->header.val, old.val, new.val);
 	} while (prev.val != old.val);
 	return true;
 }
@@ -1575,7 +1565,7 @@ static bool aux_reset_buffer(struct aux_
 	for (i = 0; i < range_scan; i++, idx++) {
 		te = aux_sdb_trailer(aux, idx);
 		/* READ_ONCE() 16 byte header */
-		prev.val = __cdsg(&te->header.val, 0, 0);
+		prev.val = cmpxchg128(&te->header.val, 0, 0);
 		do {
 			old.val = prev.val;
 			new.val = prev.val;
@@ -1586,7 +1576,7 @@ static bool aux_reset_buffer(struct aux_
 				new.a = 1;
 			else
 				new.a = 0;
-			prev.val = __cdsg(&te->header.val, old.val, new.val);
+			prev.val = cmpxchg128(&te->header.val, old.val, new.val);
 		} while (prev.val != old.val);
 		*overflow += orig_overflow;
 	}



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}()
  2023-02-02 14:50 ` [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}() Peter Zijlstra
@ 2023-02-02 17:04   ` Heiko Carstens
  2023-02-03 16:52   ` Mark Rutland
  1 sibling, 0 replies; 31+ messages in thread
From: Heiko Carstens @ 2023-02-02 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu,
	Arnd Bergmann, Herbert Xu, davem, penberg, rientjes,
	iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin, 42.hyeyoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On Thu, Feb 02, 2023 at 03:50:33PM +0100, Peter Zijlstra wrote:
> For all architectures that currently support cmpxchg_double()
> implement the cmpxchg128() family of functions that is basically the
> same but with a saner interface.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/arm64/include/asm/atomic_ll_sc.h |   41 +++++++++++++++++++++++++
>  arch/arm64/include/asm/atomic_lse.h   |   31 +++++++++++++++++++
>  arch/arm64/include/asm/cmpxchg.h      |   26 ++++++++++++++++
>  arch/s390/include/asm/cmpxchg.h       |   14 ++++++++
>  arch/x86/include/asm/cmpxchg_32.h     |    3 +
>  arch/x86/include/asm/cmpxchg_64.h     |   55 +++++++++++++++++++++++++++++++++-
>  6 files changed, 168 insertions(+), 2 deletions(-)

For s390:
Acked-by: Heiko Carstens <hca@linux.ibm.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-02 14:50 ` [PATCH v2 05/10] percpu: Wire up cmpxchg128 Peter Zijlstra
@ 2023-02-02 17:05   ` Heiko Carstens
  2023-02-03 17:02   ` Mark Rutland
  2023-02-03 17:25   ` Arnd Bergmann
  2 siblings, 0 replies; 31+ messages in thread
From: Heiko Carstens @ 2023-02-02 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu,
	Arnd Bergmann, Herbert Xu, davem, penberg, rientjes,
	iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin, 42.hyeyoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On Thu, Feb 02, 2023 at 03:50:35PM +0100, Peter Zijlstra wrote:
> In order to replace cmpxchg_double() with the newly minted
> cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/arm64/include/asm/percpu.h |   21 +++++++++++++++
>  arch/s390/include/asm/percpu.h  |   17 ++++++++++++
>  arch/x86/include/asm/percpu.h   |   56 ++++++++++++++++++++++++++++++++++++++++
>  include/asm-generic/percpu.h    |    8 +++++
>  include/linux/percpu-defs.h     |   20 ++++++++++++--
>  5 files changed, 120 insertions(+), 2 deletions(-)

For s390:
Acked-by: Heiko Carstens <hca@linux.ibm.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 10/10] s390/cpum_sf: Convert to cmpxchg128()
  2023-02-02 14:50 ` [PATCH v2 10/10] s390/cpum_sf: Convert to cmpxchg128() Peter Zijlstra
@ 2023-02-02 17:05   ` Heiko Carstens
  0 siblings, 0 replies; 31+ messages in thread
From: Heiko Carstens @ 2023-02-02 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu,
	Arnd Bergmann, Herbert Xu, davem, penberg, rientjes,
	iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin, 42.hyeyoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On Thu, Feb 02, 2023 at 03:50:40PM +0100, Peter Zijlstra wrote:
> Now that there is a cross arch u128 and cmpxchg128(), use those
> instead of the custom CDSG helper.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/s390/include/asm/cpu_mf.h  |    2 +-
>  arch/s390/kernel/perf_cpum_sf.c |   22 ++++++----------------
>  2 files changed, 7 insertions(+), 17 deletions(-)

Acked-by: Heiko Carstens <hca@linux.ibm.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
  2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
                   ` (9 preceding siblings ...)
  2023-02-02 14:50 ` [PATCH v2 10/10] s390/cpum_sf: Convert to cmpxchg128() Peter Zijlstra
@ 2023-02-02 19:39 ` Linus Torvalds
  2023-02-02 22:45   ` David Laight
  10 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2023-02-02 19:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: corbet, will, boqun.feng, mark.rutland, catalin.marinas, dennis,
	tj, cl, hca, gor, agordeev, borntraeger, svens, tglx, mingo, bp,
	dave.hansen, x86, hpa, joro, suravee.suthikulpanit, robin.murphy,
	dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem, penberg,
	rientjes, iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin,
	42.hyeyoo, linux-doc, linux-kernel, linux-mm, linux-s390, iommu,
	linux-arch, linux-crypto

On Thu, Feb 2, 2023 at 7:29 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>  - fixed up the inline asm to use 'u128 *' mem argument so the compiler knows
>    how wide the modification is.
>  - reworked the percpu thing to use union based type-punning instead of
>    _Generic() based casts.

Looks lovely to me. This removed all my concerns (except for the
testing one, but all the patches looked nice and clean to me, so
clearly it must be perfect).

                Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128
  2023-02-02 14:50 ` [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128 Peter Zijlstra
@ 2023-02-02 20:21   ` H. Peter Anvin
  2023-02-03  1:06   ` Herbert Xu
  1 sibling, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2023-02-02 20:21 UTC (permalink / raw)
  To: Peter Zijlstra, torvalds
  Cc: corbet, will, peterz, boqun.feng, mark.rutland, catalin.marinas,
	dennis, tj, cl, hca, gor, agordeev, borntraeger, svens, tglx,
	mingo, bp, dave.hansen, x86, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem,
	penberg, rientjes, iamjoonsoo.kim, Andrew Morton, vbabka,
	roman.gushchin, 42.hyeyoo, linux-doc, linux-kernel, linux-mm,
	linux-s390, iommu, linux-arch, linux-crypto

On February 2, 2023 6:50:31 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
>Per git-grep u128_xor() and its related struct u128 are unused except
>to implement {be,le}128_xor(). Remove them to free up the namespace.
>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>---
> include/crypto/b128ops.h |   14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
>
>--- a/include/crypto/b128ops.h
>+++ b/include/crypto/b128ops.h
>@@ -50,10 +50,6 @@
> #include <linux/types.h>
> 
> typedef struct {
>-	u64 a, b;
>-} u128;
>-
>-typedef struct {
> 	__be64 a, b;
> } be128;
> 
>@@ -61,20 +57,16 @@ typedef struct {
> 	__le64 b, a;
> } le128;
> 
>-static inline void u128_xor(u128 *r, const u128 *p, const u128 *q)
>+static inline void be128_xor(be128 *r, const be128 *p, const be128 *q)
> {
> 	r->a = p->a ^ q->a;
> 	r->b = p->b ^ q->b;
> }
> 
>-static inline void be128_xor(be128 *r, const be128 *p, const be128 *q)
>-{
>-	u128_xor((u128 *)r, (u128 *)p, (u128 *)q);
>-}
>-
> static inline void le128_xor(le128 *r, const le128 *p, const le128 *q)
> {
>-	u128_xor((u128 *)r, (u128 *)p, (u128 *)q);
>+	r->a = p->a ^ q->a;
>+	r->b = p->b ^ q->b;
> }
> 
> #endif /* _CRYPTO_B128OPS_H */
>
>

Can we centralize these ordered types, too?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
  2023-02-02 19:39 ` [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Linus Torvalds
@ 2023-02-02 22:45   ` David Laight
  0 siblings, 0 replies; 31+ messages in thread
From: David Laight @ 2023-02-02 22:45 UTC (permalink / raw)
  To: 'Linus Torvalds', Peter Zijlstra
  Cc: corbet, will, boqun.feng, mark.rutland, catalin.marinas, dennis,
	tj, cl, hca, gor, agordeev, borntraeger, svens, tglx, mingo, bp,
	dave.hansen, x86, hpa, joro, suravee.suthikulpanit, robin.murphy,
	dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem, penberg,
	rientjes, iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin,
	42.hyeyoo, linux-doc, linux-kernel, linux-mm, linux-s390, iommu,
	linux-arch, linux-crypto

From: Linus Torvalds
> Sent: 02 February 2023 19:39
> 
> On Thu, Feb 2, 2023 at 7:29 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >  - fixed up the inline asm to use 'u128 *' mem argument so the compiler knows
> >    how wide the modification is.
> >  - reworked the percpu thing to use union based type-punning instead of
> >    _Generic() based casts.
> 
> Looks lovely to me. This removed all my concerns (except for the
> testing one, but all the patches looked nice and clean to me, so
> clearly it must be perfect).

The change is almost certainly for the better.

But did I spot one of the bits using cmpxchg128 just to do an atomic write?
I think it was updating some interrupt info that was at first glance not
dissimilar to that used by MSI-X (it wasn't MSI-X).

If that was a hardware register then it could well require a full bus lock.
Using a write of a sse (or equiv) 128bit register would be an atomic write
without the bus lock problem.

Also, that is only going to work if the hardware/logic side guarantees to
treat a single write as atomic.
I know there are MSI-X implementations out there where the cpu write
will be split into four 32bit writes to some internal memory and the
hardware side will also do multiple accesses.
(Pretty much any implementation on an fpga will behave like that,
not just the one I wrote.)
I didn't see the MSI-X code there, but I do wonder how it safely changes
affinities.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128
  2023-02-02 14:50 ` [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128 Peter Zijlstra
  2023-02-02 20:21   ` H. Peter Anvin
@ 2023-02-03  1:06   ` Herbert Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Herbert Xu @ 2023-02-03  1:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, hca, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu,
	Arnd Bergmann, davem, penberg, rientjes, iamjoonsoo.kim,
	Andrew Morton, vbabka, roman.gushchin, 42.hyeyoo, linux-doc,
	linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On Thu, Feb 02, 2023 at 03:50:31PM +0100, Peter Zijlstra wrote:
> Per git-grep u128_xor() and its related struct u128 are unused except
> to implement {be,le}128_xor(). Remove them to free up the namespace.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/crypto/b128ops.h |   14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 07/10] x86,intel_iommu: Replace cmpxchg_double()
  2023-02-02 14:50 ` [PATCH v2 07/10] x86,intel_iommu: " Peter Zijlstra
@ 2023-02-03  2:51   ` Baolu Lu
  0 siblings, 0 replies; 31+ messages in thread
From: Baolu Lu @ 2023-02-03  2:51 UTC (permalink / raw)
  To: Peter Zijlstra, torvalds
  Cc: baolu.lu, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, hca, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, Arnd Bergmann,
	Herbert Xu, davem, penberg, rientjes, iamjoonsoo.kim,
	Andrew Morton, vbabka, roman.gushchin, 42.hyeyoo, linux-doc,
	linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On 2023/2/2 22:50, Peter Zijlstra wrote:
> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
> ---
>   drivers/iommu/intel/irq_remapping.c |    8 --
>   include/linux/dmar.h                |  125 +++++++++++++++++++-----------------
>   2 files changed, 68 insertions(+), 65 deletions(-)

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

Best regards,
baolu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 02/10] types: Introduce [us]128
  2023-02-02 14:50 ` [PATCH v2 02/10] types: Introduce [us]128 Peter Zijlstra
@ 2023-02-03  3:37   ` Herbert Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Herbert Xu @ 2023-02-03  3:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, hca, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu,
	Arnd Bergmann, davem, penberg, rientjes, iamjoonsoo.kim,
	Andrew Morton, vbabka, roman.gushchin, 42.hyeyoo, linux-doc,
	linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On Thu, Feb 02, 2023 at 03:50:32PM +0100, Peter Zijlstra wrote:
> Introduce [us]128 (when available). Unlike [us]64, ensure they are
> always naturally aligned.
> 
> This also enables 128bit wide atomics (which require natural
> alignment) such as cmpxchg128().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/types.h      |    5 +++++
>  include/uapi/linux/types.h |    4 ++++
>  2 files changed, 9 insertions(+)

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}()
  2023-02-02 14:50 ` [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}() Peter Zijlstra
  2023-02-02 17:04   ` Heiko Carstens
@ 2023-02-03 16:52   ` Mark Rutland
  1 sibling, 0 replies; 31+ messages in thread
From: Mark Rutland @ 2023-02-03 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, catalin.marinas, dennis, tj,
	cl, hca, gor, agordeev, borntraeger, svens, tglx, mingo, bp,
	dave.hansen, x86, hpa, joro, suravee.suthikulpanit, robin.murphy,
	dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem, penberg,
	rientjes, iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin,
	42.hyeyoo, linux-doc, linux-kernel, linux-mm, linux-s390, iommu,
	linux-arch, linux-crypto

On Thu, Feb 02, 2023 at 03:50:33PM +0100, Peter Zijlstra wrote:
> For all architectures that currently support cmpxchg_double()
> implement the cmpxchg128() family of functions that is basically the
> same but with a saner interface.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

For arm64:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/arm64/include/asm/atomic_ll_sc.h |   41 +++++++++++++++++++++++++
>  arch/arm64/include/asm/atomic_lse.h   |   31 +++++++++++++++++++
>  arch/arm64/include/asm/cmpxchg.h      |   26 ++++++++++++++++
>  arch/s390/include/asm/cmpxchg.h       |   14 ++++++++
>  arch/x86/include/asm/cmpxchg_32.h     |    3 +
>  arch/x86/include/asm/cmpxchg_64.h     |   55 +++++++++++++++++++++++++++++++++-
>  6 files changed, 168 insertions(+), 2 deletions(-)
> 
> --- a/arch/arm64/include/asm/atomic_ll_sc.h
> +++ b/arch/arm64/include/asm/atomic_ll_sc.h
> @@ -326,6 +326,47 @@ __CMPXCHG_DBL(   ,        ,  ,         )
>  __CMPXCHG_DBL(_mb, dmb ish, l, "memory")
>  
>  #undef __CMPXCHG_DBL
> +
> +union __u128_halves {
> +	u128 full;
> +	struct {
> +		u64 low, high;
> +	};
> +};
> +
> +#define __CMPXCHG128(name, mb, rel, cl...)                             \
> +static __always_inline u128						\
> +__ll_sc__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new)	\
> +{									\
> +	union __u128_halves r, o = { .full = (old) },			\
> +			       n = { .full = (new) };			\
> +       unsigned int tmp;                                               \
> +									\
> +	asm volatile("// __cmpxchg128" #name "\n"			\
> +       "       prfm    pstl1strm, %[v]\n"                              \
> +       "1:     ldxp    %[rl], %[rh], %[v]\n"                           \
> +       "       cmp     %[rl], %[ol]\n"                                 \
> +       "       ccmp    %[rh], %[oh], 0, eq\n"                          \
> +       "       b.ne    2f\n"                                           \
> +       "       st" #rel "xp    %w[tmp], %[nl], %[nh], %[v]\n"          \
> +       "       cbnz    %w[tmp], 1b\n"                                  \
> +	"	" #mb "\n"						\
> +	"2:"								\
> +       : [v] "+Q" (*(u128 *)ptr),                                      \
> +         [rl] "=&r" (r.low), [rh] "=&r" (r.high),                      \
> +         [tmp] "=&r" (tmp)                                             \
> +       : [ol] "r" (o.low), [oh] "r" (o.high),                          \
> +         [nl] "r" (n.low), [nh] "r" (n.high)                           \
> +       : "cc", ##cl);                                                  \
> +									\
> +	return r.full;							\
> +}
> +
> +__CMPXCHG128(   ,        ,  )
> +__CMPXCHG128(_mb, dmb ish, l, "memory")
> +
> +#undef __CMPXCHG128
> +
>  #undef K
>  
>  #endif	/* __ASM_ATOMIC_LL_SC_H */
> --- a/arch/arm64/include/asm/atomic_lse.h
> +++ b/arch/arm64/include/asm/atomic_lse.h
> @@ -324,4 +324,35 @@ __CMPXCHG_DBL(_mb, al, "memory")
>  
>  #undef __CMPXCHG_DBL
>  
> +#define __CMPXCHG128(name, mb, cl...)					\
> +static __always_inline u128						\
> +__lse__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new)		\
> +{									\
> +	union __u128_halves r, o = { .full = (old) },			\
> +			       n = { .full = (new) };			\
> +	register unsigned long x0 asm ("x0") = o.low;			\
> +	register unsigned long x1 asm ("x1") = o.high;			\
> +	register unsigned long x2 asm ("x2") = n.low;			\
> +	register unsigned long x3 asm ("x3") = n.high;			\
> +	register unsigned long x4 asm ("x4") = (unsigned long)ptr;	\
> +									\
> +	asm volatile(							\
> +	__LSE_PREAMBLE							\
> +	"	casp" #mb "\t%[old1], %[old2], %[new1], %[new2], %[v]\n"\
> +	: [old1] "+&r" (x0), [old2] "+&r" (x1),				\
> +	  [v] "+Q" (*(u128 *)ptr)					\
> +	: [new1] "r" (x2), [new2] "r" (x3), [ptr] "r" (x4),		\
> +	  [oldval1] "r" (o.low), [oldval2] "r" (o.high)			\
> +	: cl);								\
> +									\
> +	r.low = x0; r.high = x1;					\
> +									\
> +	return r.full;							\
> +}
> +
> +__CMPXCHG128(   ,   )
> +__CMPXCHG128(_mb, al, "memory")
> +
> +#undef __CMPXCHG128
> +
>  #endif	/* __ASM_ATOMIC_LSE_H */
> --- a/arch/arm64/include/asm/cmpxchg.h
> +++ b/arch/arm64/include/asm/cmpxchg.h
> @@ -147,6 +147,19 @@ __CMPXCHG_DBL(_mb)
>  
>  #undef __CMPXCHG_DBL
>  
> +#define __CMPXCHG128(name)						\
> +static inline u128 __cmpxchg128##name(volatile u128 *ptr,		\
> +				      u128 old, u128 new)		\
> +{									\
> +	return __lse_ll_sc_body(_cmpxchg128##name,			\
> +				ptr, old, new);				\
> +}
> +
> +__CMPXCHG128(   )
> +__CMPXCHG128(_mb)
> +
> +#undef __CMPXCHG128
> +
>  #define __CMPXCHG_GEN(sfx)						\
>  static __always_inline unsigned long __cmpxchg##sfx(volatile void *ptr,	\
>  					   unsigned long old,		\
> @@ -229,6 +242,19 @@ __CMPXCHG_GEN(_mb)
>  	__ret;									\
>  })
>  
> +/* cmpxchg128 */
> +#define system_has_cmpxchg128()		1
> +
> +#define arch_cmpxchg128(ptr, o, n)						\
> +({										\
> +	__cmpxchg128_mb((ptr), (o), (n));					\
> +})
> +
> +#define arch_cmpxchg128_local(ptr, o, n)					\
> +({										\
> +	__cmpxchg128((ptr), (o), (n));						\
> +})
> +
>  #define __CMPWAIT_CASE(w, sfx, sz)					\
>  static inline void __cmpwait_case_##sz(volatile void *ptr,		\
>  				       unsigned long val)		\
> --- a/arch/s390/include/asm/cmpxchg.h
> +++ b/arch/s390/include/asm/cmpxchg.h
> @@ -201,4 +201,18 @@ static __always_inline int __cmpxchg_dou
>  			 (unsigned long)(n1), (unsigned long)(n2));	\
>  })
>  
> +#define system_has_cmpxchg128()		1
> +
> +static __always_inline u128 arch_cmpxchg128(volatile u128 *ptr, u128 old, u128 new)
> +{
> +	asm volatile(
> +		"	cdsg	%[old],%[new],%[ptr]\n"
> +		: [old] "+d" (old), [ptr] "+QS" (*ptr)
> +		: [new] "d" (new)
> +		: "memory", "cc");
> +	return old;
> +}
> +
> +#define arch_cmpxchg128		arch_cmpxchg128
> +
>  #endif /* __ASM_CMPXCHG_H */
> --- a/arch/x86/include/asm/cmpxchg_32.h
> +++ b/arch/x86/include/asm/cmpxchg_32.h
> @@ -103,6 +103,7 @@ static inline bool __try_cmpxchg64(volat
>  
>  #endif
>  
> -#define system_has_cmpxchg_double() boot_cpu_has(X86_FEATURE_CX8)
> +#define system_has_cmpxchg_double()	boot_cpu_has(X86_FEATURE_CX8)
> +#define system_has_cmpxchg64()		boot_cpu_has(X86_FEATURE_CX8)
>  
>  #endif /* _ASM_X86_CMPXCHG_32_H */
> --- a/arch/x86/include/asm/cmpxchg_64.h
> +++ b/arch/x86/include/asm/cmpxchg_64.h
> @@ -20,6 +20,59 @@
>  	arch_try_cmpxchg((ptr), (po), (n));				\
>  })
>  
> -#define system_has_cmpxchg_double() boot_cpu_has(X86_FEATURE_CX16)
> +union __u128_halves {
> +	u128 full;
> +	struct {
> +		u64 low, high;
> +	};
> +};
> +
> +static __always_inline u128 arch_cmpxchg128(volatile u128 *ptr, u128 old, u128 new)
> +{
> +	union __u128_halves o = { .full = old, }, n = { .full = new, };
> +
> +	asm volatile(LOCK_PREFIX "cmpxchg16b %[ptr]"
> +		     : [ptr] "+m" (*ptr),
> +		       "+a" (o.low), "+d" (o.high)
> +		     : "b" (n.low), "c" (n.high)
> +		     : "memory");
> +
> +	return o.full;
> +}
> +
> +static __always_inline u128 arch_cmpxchg128_local(volatile u128 *ptr, u128 old, u128 new)
> +{
> +	union __u128_halves o = { .full = old, }, n = { .full = new, };
> +
> +	asm volatile("cmpxchg16b %[ptr]"
> +		     : [ptr] "+m" (*ptr),
> +		       "+a" (o.low), "+d" (o.high)
> +		     : "b" (n.low), "c" (n.high)
> +		     : "memory");
> +
> +	return o.full;
> +}
> +
> +static __always_inline bool arch_try_cmpxchg128(volatile u128 *ptr, u128 *old, u128 new)
> +{
> +	union __u128_halves o = { .full = *old, }, n = { .full = new, };
> +	bool ret;
> +
> +	asm volatile(LOCK_PREFIX "cmpxchg16b %[ptr]"
> +		     CC_SET(e)
> +		     : CC_OUT(e) (ret),
> +		       [ptr] "+m" (*ptr),
> +		       "+a" (o.low), "+d" (o.high)
> +		     : "b" (n.low), "c" (n.high)
> +		     : "memory");
> +
> +	if (unlikely(!ret))
> +		*old = o.full;
> +
> +	return likely(ret);
> +}
> +
> +#define system_has_cmpxchg_double()	boot_cpu_has(X86_FEATURE_CX16)
> +#define system_has_cmpxchg128()		boot_cpu_has(X86_FEATURE_CX16)
>  
>  #endif /* _ASM_X86_CMPXCHG_64_H */
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 04/10] instrumentation: Wire up cmpxchg128()
  2023-02-02 14:50 ` [PATCH v2 04/10] instrumentation: Wire up cmpxchg128() Peter Zijlstra
@ 2023-02-03 16:55   ` Mark Rutland
  0 siblings, 0 replies; 31+ messages in thread
From: Mark Rutland @ 2023-02-03 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, catalin.marinas, dennis, tj,
	cl, hca, gor, agordeev, borntraeger, svens, tglx, mingo, bp,
	dave.hansen, x86, hpa, joro, suravee.suthikulpanit, robin.murphy,
	dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem, penberg,
	rientjes, iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin,
	42.hyeyoo, linux-doc, linux-kernel, linux-mm, linux-s390, iommu,
	linux-arch, linux-crypto

On Thu, Feb 02, 2023 at 03:50:34PM +0100, Peter Zijlstra wrote:
> Wire up the cmpxchg128 familty in the atomic wrappery scripts.

s/familty/family/ (and s/wrappery/wrapper/ ?)

> 
> These provide the generic cmpxchg128 family of functions from the
> arch_ prefixed version, adding explicit instrumentation where needed.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  include/linux/atomic/atomic-arch-fallback.h |   95 +++++++++++++++++++++++++++-
>  include/linux/atomic/atomic-instrumented.h  |   77 ++++++++++++++++++++++
>  scripts/atomic/gen-atomic-fallback.sh       |    4 -
>  scripts/atomic/gen-atomic-instrumented.sh   |    4 -
>  4 files changed, 174 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/atomic/atomic-arch-fallback.h
> +++ b/include/linux/atomic/atomic-arch-fallback.h
> @@ -77,6 +77,29 @@
>  
>  #endif /* arch_cmpxchg64_relaxed */
>  
> +#ifndef arch_cmpxchg128_relaxed
> +#define arch_cmpxchg128_acquire arch_cmpxchg128
> +#define arch_cmpxchg128_release arch_cmpxchg128
> +#define arch_cmpxchg128_relaxed arch_cmpxchg128
> +#else /* arch_cmpxchg128_relaxed */
> +
> +#ifndef arch_cmpxchg128_acquire
> +#define arch_cmpxchg128_acquire(...) \
> +	__atomic_op_acquire(arch_cmpxchg128, __VA_ARGS__)
> +#endif
> +
> +#ifndef arch_cmpxchg128_release
> +#define arch_cmpxchg128_release(...) \
> +	__atomic_op_release(arch_cmpxchg128, __VA_ARGS__)
> +#endif
> +
> +#ifndef arch_cmpxchg128
> +#define arch_cmpxchg128(...) \
> +	__atomic_op_fence(arch_cmpxchg128, __VA_ARGS__)
> +#endif
> +
> +#endif /* arch_cmpxchg128_relaxed */
> +
>  #ifndef arch_try_cmpxchg_relaxed
>  #ifdef arch_try_cmpxchg
>  #define arch_try_cmpxchg_acquire arch_try_cmpxchg
> @@ -217,6 +240,76 @@
>  
>  #endif /* arch_try_cmpxchg64_relaxed */
>  
> +#ifndef arch_try_cmpxchg128_relaxed
> +#ifdef arch_try_cmpxchg128
> +#define arch_try_cmpxchg128_acquire arch_try_cmpxchg128
> +#define arch_try_cmpxchg128_release arch_try_cmpxchg128
> +#define arch_try_cmpxchg128_relaxed arch_try_cmpxchg128
> +#endif /* arch_try_cmpxchg128 */
> +
> +#ifndef arch_try_cmpxchg128
> +#define arch_try_cmpxchg128(_ptr, _oldp, _new) \
> +({ \
> +	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> +	___r = arch_cmpxchg128((_ptr), ___o, (_new)); \
> +	if (unlikely(___r != ___o)) \
> +		*___op = ___r; \
> +	likely(___r == ___o); \
> +})
> +#endif /* arch_try_cmpxchg128 */
> +
> +#ifndef arch_try_cmpxchg128_acquire
> +#define arch_try_cmpxchg128_acquire(_ptr, _oldp, _new) \
> +({ \
> +	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> +	___r = arch_cmpxchg128_acquire((_ptr), ___o, (_new)); \
> +	if (unlikely(___r != ___o)) \
> +		*___op = ___r; \
> +	likely(___r == ___o); \
> +})
> +#endif /* arch_try_cmpxchg128_acquire */
> +
> +#ifndef arch_try_cmpxchg128_release
> +#define arch_try_cmpxchg128_release(_ptr, _oldp, _new) \
> +({ \
> +	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> +	___r = arch_cmpxchg128_release((_ptr), ___o, (_new)); \
> +	if (unlikely(___r != ___o)) \
> +		*___op = ___r; \
> +	likely(___r == ___o); \
> +})
> +#endif /* arch_try_cmpxchg128_release */
> +
> +#ifndef arch_try_cmpxchg128_relaxed
> +#define arch_try_cmpxchg128_relaxed(_ptr, _oldp, _new) \
> +({ \
> +	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> +	___r = arch_cmpxchg128_relaxed((_ptr), ___o, (_new)); \
> +	if (unlikely(___r != ___o)) \
> +		*___op = ___r; \
> +	likely(___r == ___o); \
> +})
> +#endif /* arch_try_cmpxchg128_relaxed */
> +
> +#else /* arch_try_cmpxchg128_relaxed */
> +
> +#ifndef arch_try_cmpxchg128_acquire
> +#define arch_try_cmpxchg128_acquire(...) \
> +	__atomic_op_acquire(arch_try_cmpxchg128, __VA_ARGS__)
> +#endif
> +
> +#ifndef arch_try_cmpxchg128_release
> +#define arch_try_cmpxchg128_release(...) \
> +	__atomic_op_release(arch_try_cmpxchg128, __VA_ARGS__)
> +#endif
> +
> +#ifndef arch_try_cmpxchg128
> +#define arch_try_cmpxchg128(...) \
> +	__atomic_op_fence(arch_try_cmpxchg128, __VA_ARGS__)
> +#endif
> +
> +#endif /* arch_try_cmpxchg128_relaxed */
> +
>  #ifndef arch_atomic_read_acquire
>  static __always_inline int
>  arch_atomic_read_acquire(const atomic_t *v)
> @@ -2456,4 +2549,4 @@ arch_atomic64_dec_if_positive(atomic64_t
>  #endif
>  
>  #endif /* _LINUX_ATOMIC_FALLBACK_H */
> -// b5e87bdd5ede61470c29f7a7e4de781af3770f09
> +// 46357a526de89c762d30fb238f35a7d5950a670b
> --- a/include/linux/atomic/atomic-instrumented.h
> +++ b/include/linux/atomic/atomic-instrumented.h
> @@ -1968,6 +1968,36 @@ atomic_long_dec_if_positive(atomic_long_
>  	arch_cmpxchg64_relaxed(__ai_ptr, __VA_ARGS__); \
>  })
>  
> +#define cmpxchg128(ptr, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	kcsan_mb(); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	arch_cmpxchg128(__ai_ptr, __VA_ARGS__); \
> +})
> +
> +#define cmpxchg128_acquire(ptr, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	arch_cmpxchg128_acquire(__ai_ptr, __VA_ARGS__); \
> +})
> +
> +#define cmpxchg128_release(ptr, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	kcsan_release(); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	arch_cmpxchg128_release(__ai_ptr, __VA_ARGS__); \
> +})
> +
> +#define cmpxchg128_relaxed(ptr, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	arch_cmpxchg128_relaxed(__ai_ptr, __VA_ARGS__); \
> +})
> +
>  #define try_cmpxchg(ptr, oldp, ...) \
>  ({ \
>  	typeof(ptr) __ai_ptr = (ptr); \
> @@ -2044,6 +2074,44 @@ atomic_long_dec_if_positive(atomic_long_
>  	arch_try_cmpxchg64_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
>  })
>  
> +#define try_cmpxchg128(ptr, oldp, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	typeof(oldp) __ai_oldp = (oldp); \
> +	kcsan_mb(); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
> +	arch_try_cmpxchg128(__ai_ptr, __ai_oldp, __VA_ARGS__); \
> +})
> +
> +#define try_cmpxchg128_acquire(ptr, oldp, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	typeof(oldp) __ai_oldp = (oldp); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
> +	arch_try_cmpxchg128_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
> +})
> +
> +#define try_cmpxchg128_release(ptr, oldp, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	typeof(oldp) __ai_oldp = (oldp); \
> +	kcsan_release(); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
> +	arch_try_cmpxchg128_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
> +})
> +
> +#define try_cmpxchg128_relaxed(ptr, oldp, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	typeof(oldp) __ai_oldp = (oldp); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
> +	arch_try_cmpxchg128_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
> +})
> +
>  #define cmpxchg_local(ptr, ...) \
>  ({ \
>  	typeof(ptr) __ai_ptr = (ptr); \
> @@ -2058,6 +2126,13 @@ atomic_long_dec_if_positive(atomic_long_
>  	arch_cmpxchg64_local(__ai_ptr, __VA_ARGS__); \
>  })
>  
> +#define cmpxchg128_local(ptr, ...) \
> +({ \
> +	typeof(ptr) __ai_ptr = (ptr); \
> +	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +	arch_cmpxchg128_local(__ai_ptr, __VA_ARGS__); \
> +})
> +
>  #define sync_cmpxchg(ptr, ...) \
>  ({ \
>  	typeof(ptr) __ai_ptr = (ptr); \
> @@ -2083,4 +2158,4 @@ atomic_long_dec_if_positive(atomic_long_
>  })
>  
>  #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
> -// 764f741eb77a7ad565dc8d99ce2837d5542e8aee
> +// 27320c1ec2bf2878ecb9df3ea4816a7bc0c57a52
> --- a/scripts/atomic/gen-atomic-fallback.sh
> +++ b/scripts/atomic/gen-atomic-fallback.sh
> @@ -217,11 +217,11 @@ cat << EOF
>  
>  EOF
>  
> -for xchg in "arch_xchg" "arch_cmpxchg" "arch_cmpxchg64"; do
> +for xchg in "arch_xchg" "arch_cmpxchg" "arch_cmpxchg64" "arch_cmpxchg128"; do
>  	gen_xchg_fallbacks "${xchg}"
>  done
>  
> -for cmpxchg in "cmpxchg" "cmpxchg64"; do
> +for cmpxchg in "cmpxchg" "cmpxchg64" "cmpxchg128"; do
>  	gen_try_cmpxchg_fallbacks "${cmpxchg}"
>  done
>  
> --- a/scripts/atomic/gen-atomic-instrumented.sh
> +++ b/scripts/atomic/gen-atomic-instrumented.sh
> @@ -166,14 +166,14 @@ grep '^[a-z]' "$1" | while read name met
>  done
>  
>  
> -for xchg in "xchg" "cmpxchg" "cmpxchg64" "try_cmpxchg" "try_cmpxchg64"; do
> +for xchg in "xchg" "cmpxchg" "cmpxchg64" "cmpxchg128" "try_cmpxchg" "try_cmpxchg64" "try_cmpxchg128"; do
>  	for order in "" "_acquire" "_release" "_relaxed"; do
>  		gen_xchg "${xchg}" "${order}" ""
>  		printf "\n"
>  	done
>  done
>  
> -for xchg in "cmpxchg_local" "cmpxchg64_local" "sync_cmpxchg"; do
> +for xchg in "cmpxchg_local" "cmpxchg64_local" "cmpxchg128_local" "sync_cmpxchg"; do
>  	gen_xchg "${xchg}" "" ""
>  	printf "\n"
>  done
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-02 14:50 ` [PATCH v2 05/10] percpu: Wire up cmpxchg128 Peter Zijlstra
  2023-02-02 17:05   ` Heiko Carstens
@ 2023-02-03 17:02   ` Mark Rutland
  2023-02-03 17:25   ` Arnd Bergmann
  2 siblings, 0 replies; 31+ messages in thread
From: Mark Rutland @ 2023-02-03 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, catalin.marinas, dennis, tj,
	cl, hca, gor, agordeev, borntraeger, svens, tglx, mingo, bp,
	dave.hansen, x86, hpa, joro, suravee.suthikulpanit, robin.murphy,
	dwmw2, baolu.lu, Arnd Bergmann, Herbert Xu, davem, penberg,
	rientjes, iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin,
	42.hyeyoo, linux-doc, linux-kernel, linux-mm, linux-s390, iommu,
	linux-arch, linux-crypto

On Thu, Feb 02, 2023 at 03:50:35PM +0100, Peter Zijlstra wrote:
> In order to replace cmpxchg_double() with the newly minted
> cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/arm64/include/asm/percpu.h |   21 +++++++++++++++
>  arch/s390/include/asm/percpu.h  |   17 ++++++++++++
>  arch/x86/include/asm/percpu.h   |   56 ++++++++++++++++++++++++++++++++++++++++
>  include/asm-generic/percpu.h    |    8 +++++
>  include/linux/percpu-defs.h     |   20 ++++++++++++--
>  5 files changed, 120 insertions(+), 2 deletions(-)

For arm64:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> 
> --- a/arch/arm64/include/asm/percpu.h
> +++ b/arch/arm64/include/asm/percpu.h
> @@ -140,6 +140,10 @@ PERCPU_RET_OP(add, add, ldadd)
>   * re-enabling preemption for preemptible kernels, but doing that in a way
>   * which builds inside a module would mean messing directly with the preempt
>   * count. If you do this, peterz and tglx will hunt you down.
> + *
> + * Not to mention it'll break the actual preemption model for missing a
> + * preemption point when TIF_NEED_RESCHED gets set while preemption is
> + * disabled.
>   */
>  #define this_cpu_cmpxchg_double_8(ptr1, ptr2, o1, o2, n1, n2)		\
>  ({									\
> @@ -240,6 +244,23 @@ PERCPU_RET_OP(add, add, ldadd)
>  #define this_cpu_cmpxchg_8(pcp, o, n)	\
>  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>  
> +#define this_cpu_cmpxchg_16(pcp, o, n)					\
> +({									\
> +	typedef typeof(pcp) pcp_op_T__;					\
> +	union {								\
> +		pcp_op_T__ pot;						\
> +		u128 val;						\
> +	} old__, new__, ret__;						\
> +	pcp_op_T__ *ptr__;						\
> +	old__.pot = o;							\
> +	new__.pot = n;							\
> +	preempt_disable_notrace();					\
> +	ptr__ = raw_cpu_ptr(&(pcp));					\
> +	ret__.val = cmpxchg128_local((void *)ptr__, old__.val, new__.val); \
> +	preempt_enable_notrace();					\
> +	ret__.pot;							\
> +})
> +
>  #ifdef __KVM_NVHE_HYPERVISOR__
>  extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);
>  #define __per_cpu_offset
> --- a/arch/s390/include/asm/percpu.h
> +++ b/arch/s390/include/asm/percpu.h
> @@ -148,6 +148,23 @@
>  #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
>  #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
>  
> +#define this_cpu_cmpxchg_16(pcp, oval, nval)				\
> +({									\
> +	typedef typeof(pcp) pcp_op_T__;					\
> +	union {								\
> +		pcp_op_T__ pot;						\
> +		u128 val;						\
> +	} old__, new__, ret__;						\
> +	pcp_op_T__ *ptr__;						\
> +	old__.pot = oval;						\
> +	new__.pot = nval;						\
> +	preempt_disable_notrace();					\
> +	ptr__ = raw_cpu_ptr(&(pcp));					\
> +	ret__.val = cmpxchg128((void *)ptr__, old__.val, new__.val);	\
> +	preempt_enable_notrace();					\
> +	ret__.pot;							\
> +})
> +
>  #define arch_this_cpu_xchg(pcp, nval)					\
>  ({									\
>  	typeof(pcp) *ptr__;						\
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -210,6 +210,62 @@ do {									\
>  	(typeof(_var))(unsigned long) pco_old__;			\
>  })
>  
> +#if defined(CONFIG_X86_32) && defined(CONFIG_X86_CMPXCHG64)
> +#define percpu_cmpxchg64_op(size, qual, _var, _oval, _nval)		\
> +({									\
> +	union {								\
> +		typeof(_var) var;					\
> +		struct {						\
> +			u32 low, high;					\
> +		};							\
> +	} old__, new__;							\
> +									\
> +	old__.var = _oval;						\
> +	new__.var = _nval;						\
> +									\
> +	asm qual ("cmpxchg8b " __percpu_arg([var])			\
> +		  : [var] "+m" (_var),					\
> +		    "+a" (old__.low),					\
> +		    "+d" (old__.high)					\
> +		  : "b" (new__.low),					\
> +		    "c" (new__.high)					\
> +		  : "memory");						\
> +									\
> +	old__.var;							\
> +})
> +
> +#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg64_op(8,         , pcp, oval, nval)
> +#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg64_op(8, volatile, pcp, oval, nval)
> +#endif
> +
> +#ifdef CONFIG_X86_64
> +#define percpu_cmpxchg128_op(size, qual, _var, _oval, _nval)		\
> +({									\
> +	union {								\
> +		typeof(_var) var;					\
> +		struct {						\
> +			u64 low, high;					\
> +		};							\
> +	} old__, new__;							\
> +									\
> +	old__.var = _oval;						\
> +	new__.var = _nval;						\
> +									\
> +	asm qual ("cmpxchg16b " __percpu_arg([var])			\
> +		  : [var] "+m" (_var),					\
> +		    "+a" (old__.low),					\
> +		    "+d" (old__.high)					\
> +		  : "b" (new__.low),					\
> +		    "c" (new__.high)					\
> +		  : "memory");						\
> +									\
> +	old__.var;							\
> +})
> +
> +#define raw_cpu_cmpxchg_16(pcp, oval, nval)	percpu_cmpxchg128_op(16,         , pcp, oval, nval)
> +#define this_cpu_cmpxchg_16(pcp, oval, nval)	percpu_cmpxchg128_op(16, volatile, pcp, oval, nval)
> +#endif
> +
>  /*
>   * this_cpu_read() makes gcc load the percpu variable every time it is
>   * accessed while this_cpu_read_stable() allows the value to be cached.
> --- a/include/asm-generic/percpu.h
> +++ b/include/asm-generic/percpu.h
> @@ -298,6 +298,10 @@ do {									\
>  #define raw_cpu_cmpxchg_8(pcp, oval, nval) \
>  	raw_cpu_generic_cmpxchg(pcp, oval, nval)
>  #endif
> +#ifndef raw_cpu_cmpxchg_16
> +#define raw_cpu_cmpxchg_16(pcp, oval, nval) \
> +	raw_cpu_generic_cmpxchg(pcp, oval, nval)
> +#endif
>  
>  #ifndef raw_cpu_cmpxchg_double_1
>  #define raw_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
> @@ -423,6 +427,10 @@ do {									\
>  #define this_cpu_cmpxchg_8(pcp, oval, nval) \
>  	this_cpu_generic_cmpxchg(pcp, oval, nval)
>  #endif
> +#ifndef this_cpu_cmpxchg_16
> +#define this_cpu_cmpxchg_16(pcp, oval, nval) \
> +	this_cpu_generic_cmpxchg(pcp, oval, nval)
> +#endif
>  
>  #ifndef this_cpu_cmpxchg_double_1
>  #define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
> --- a/include/linux/percpu-defs.h
> +++ b/include/linux/percpu-defs.h
> @@ -343,6 +343,22 @@ static inline void __this_cpu_preempt_ch
>  	pscr2_ret__;							\
>  })
>  
> +#define __pcpu_size16_call_return2(stem, variable, ...)			\
> +({									\
> +	typeof(variable) pscr2_ret__;					\
> +	__verify_pcpu_ptr(&(variable));					\
> +	switch(sizeof(variable)) {					\
> +	case 1: pscr2_ret__ = stem##1(variable, __VA_ARGS__); break;	\
> +	case 2: pscr2_ret__ = stem##2(variable, __VA_ARGS__); break;	\
> +	case 4: pscr2_ret__ = stem##4(variable, __VA_ARGS__); break;	\
> +	case 8: pscr2_ret__ = stem##8(variable, __VA_ARGS__); break;	\
> +	case 16: pscr2_ret__ = stem##16(variable, __VA_ARGS__); break;	\
> +	default:							\
> +		__bad_size_call_parameter(); break;			\
> +	}								\
> +	pscr2_ret__;							\
> +})
> +
>  /*
>   * Special handling for cmpxchg_double.  cmpxchg_double is passed two
>   * percpu variables.  The first has to be aligned to a double word
> @@ -425,7 +441,7 @@ do {									\
>  #define raw_cpu_add_return(pcp, val)	__pcpu_size_call_return2(raw_cpu_add_return_, pcp, val)
>  #define raw_cpu_xchg(pcp, nval)		__pcpu_size_call_return2(raw_cpu_xchg_, pcp, nval)
>  #define raw_cpu_cmpxchg(pcp, oval, nval) \
> -	__pcpu_size_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
> +	__pcpu_size16_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
>  #define raw_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
>  	__pcpu_double_call_return_bool(raw_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
>  
> @@ -512,7 +528,7 @@ do {									\
>  #define this_cpu_add_return(pcp, val)	__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
>  #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
>  #define this_cpu_cmpxchg(pcp, oval, nval) \
> -	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> +	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
>  #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
>  	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
>  
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-02 14:50 ` [PATCH v2 05/10] percpu: Wire up cmpxchg128 Peter Zijlstra
  2023-02-02 17:05   ` Heiko Carstens
  2023-02-03 17:02   ` Mark Rutland
@ 2023-02-03 17:25   ` Arnd Bergmann
  2023-02-06 11:24     ` Peter Zijlstra
  2 siblings, 1 reply; 31+ messages in thread
From: Arnd Bergmann @ 2023-02-03 17:25 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Jonathan Corbet, Will Deacon, Boqun Feng, Mark Rutland,
	Catalin Marinas, dennis, Tejun Heo, Christoph Lameter,
	Heiko Carstens, gor, Alexander Gordeev, borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Joerg Roedel,
	suravee.suthikulpanit, Robin Murphy, dwmw2, Baolu Lu, Herbert Xu,
	David S . Miller, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, Linux-Arch,
	linux-crypto

On Thu, Feb 2, 2023, at 15:50, Peter Zijlstra wrote:
> In order to replace cmpxchg_double() with the newly minted
> cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I commented on this in the previous version but never got any
reply from you:

https://lore.kernel.org/all/1d88ba9f-5541-4b67-9cc8-a361eef36547@app.fastmail.com/

Unless I have misunderstood what you are doing, my concerns are
still the same:

>  #define this_cpu_cmpxchg(pcp, oval, nval) \
> -	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> +	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
>  #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, 
> nval2) \
>  	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, 
> oval1, oval2, nval1, nval2)

Having a variable-length this_cpu_cmpxchg() that turns into cmpxchg128()
and cmpxchg64() even on CPUs where this traps (!X86_FEATURE_CX16) seems
like a bad design to me.

I would much prefer fixed-length this_cpu_cmpxchg64()/this_cpu_cmpxchg128()
calls that never trap but fall back to the generic version on CPUs that
are lacking the atomics.

     Arnd

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-03 17:25   ` Arnd Bergmann
@ 2023-02-06 11:24     ` Peter Zijlstra
  2023-02-06 12:14       ` Peter Zijlstra
  2023-02-06 12:19       ` Arnd Bergmann
  0 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-06 11:24 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Jonathan Corbet, Will Deacon, Boqun Feng,
	Mark Rutland, Catalin Marinas, dennis, Tejun Heo,
	Christoph Lameter, Heiko Carstens, gor, Alexander Gordeev,
	borntraeger, Sven Schnelle, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Joerg Roedel,
	suravee.suthikulpanit, Robin Murphy, dwmw2, Baolu Lu, Herbert Xu,
	David S . Miller, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, Linux-Arch,
	linux-crypto

On Fri, Feb 03, 2023 at 06:25:04PM +0100, Arnd Bergmann wrote:
> On Thu, Feb 2, 2023, at 15:50, Peter Zijlstra wrote:
> > In order to replace cmpxchg_double() with the newly minted
> > cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> I commented on this in the previous version but never got any
> reply from you:
> 
> https://lore.kernel.org/all/1d88ba9f-5541-4b67-9cc8-a361eef36547@app.fastmail.com/

Sorry, seem to have missed that :/

> Unless I have misunderstood what you are doing, my concerns are
> still the same:
> 
> >  #define this_cpu_cmpxchg(pcp, oval, nval) \
> > -	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> > +	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> >  #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, 
> > nval2) \
> >  	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, 
> > oval1, oval2, nval1, nval2)
> 
> Having a variable-length this_cpu_cmpxchg() that turns into cmpxchg128()
> and cmpxchg64() even on CPUs where this traps (!X86_FEATURE_CX16) seems
> like a bad design to me.
> 
> I would much prefer fixed-length this_cpu_cmpxchg64()/this_cpu_cmpxchg128()
> calls that never trap but fall back to the generic version on CPUs that
> are lacking the atomics.

You're thinking acidental usage etc..? Lemme see what I can do.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-06 11:24     ` Peter Zijlstra
@ 2023-02-06 12:14       ` Peter Zijlstra
  2023-02-06 12:48         ` Peter Zijlstra
  2023-02-06 12:19       ` Arnd Bergmann
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-06 12:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Jonathan Corbet, Will Deacon, Boqun Feng,
	Mark Rutland, Catalin Marinas, dennis, Tejun Heo,
	Christoph Lameter, Heiko Carstens, gor, Alexander Gordeev,
	borntraeger, Sven Schnelle, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Joerg Roedel,
	suravee.suthikulpanit, Robin Murphy, dwmw2, Baolu Lu, Herbert Xu,
	David S . Miller, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, Linux-Arch,
	linux-crypto

On Mon, Feb 06, 2023 at 12:24:00PM +0100, Peter Zijlstra wrote:

> > Unless I have misunderstood what you are doing, my concerns are
> > still the same:
> > 
> > >  #define this_cpu_cmpxchg(pcp, oval, nval) \
> > > -	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> > > +	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> > >  #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, 
> > > nval2) \
> > >  	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, 
> > > oval1, oval2, nval1, nval2)
> > 
> > Having a variable-length this_cpu_cmpxchg() that turns into cmpxchg128()
> > and cmpxchg64() even on CPUs where this traps (!X86_FEATURE_CX16) seems
> > like a bad design to me.
> > 
> > I would much prefer fixed-length this_cpu_cmpxchg64()/this_cpu_cmpxchg128()
> > calls that never trap but fall back to the generic version on CPUs that
> > are lacking the atomics.
> 
> You're thinking acidental usage etc..? Lemme see what I can do.

So lookng at this I remember why I did it like this, currently 32bit
archs silently fall back to the generics for most/all 64bit ops.

And personally I would just as soon drop support for the
!X86_FEATURE_CX* cpus... :/ Those are some serious museum pieces.

One problem with silent downgrades like this is that semantics vs NMI
change, which makes for subtle bugs on said museum pieces.

Basically, using 64bit percpu ops on 32bit is already somewhat dangerous
-- wiring up native cmpxchg64 support in that case seemed an
improvement.

Anyway... let me get on with doing explicit
{raw,this}_cpu_cmpxchg{64,128}() thingies.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-06 11:24     ` Peter Zijlstra
  2023-02-06 12:14       ` Peter Zijlstra
@ 2023-02-06 12:19       ` Arnd Bergmann
  1 sibling, 0 replies; 31+ messages in thread
From: Arnd Bergmann @ 2023-02-06 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jonathan Corbet, Will Deacon, Boqun Feng,
	Mark Rutland, Catalin Marinas, dennis, Tejun Heo,
	Christoph Lameter, Heiko Carstens, gor, Alexander Gordeev,
	borntraeger, Sven Schnelle, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Joerg Roedel,
	suravee.suthikulpanit, Robin Murphy, dwmw2, Baolu Lu, Herbert Xu,
	David S . Miller, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, Linux-Arch,
	linux-crypto

On Mon, Feb 6, 2023, at 12:24, Peter Zijlstra wrote:
> On Fri, Feb 03, 2023 at 06:25:04PM +0100, Arnd Bergmann wrote:

>> Unless I have misunderstood what you are doing, my concerns are
>> still the same:
>> 
>> >  #define this_cpu_cmpxchg(pcp, oval, nval) \
>> > -	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
>> > +	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
>> >  #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, 
>> > nval2) \
>> >  	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, 
>> > oval1, oval2, nval1, nval2)
>> 
>> Having a variable-length this_cpu_cmpxchg() that turns into cmpxchg128()
>> and cmpxchg64() even on CPUs where this traps (!X86_FEATURE_CX16) seems
>> like a bad design to me.
>> 
>> I would much prefer fixed-length this_cpu_cmpxchg64()/this_cpu_cmpxchg128()
>> calls that never trap but fall back to the generic version on CPUs that
>> are lacking the atomics.
>
> You're thinking acidental usage etc..? Lemme see what I can do.

I wouldn't even call it accidental when the dependency is so subtle:

Having to call system_has_cmpxchg64() beforce calling cmpxchg64()
is already somewhat awkward but has some logic to it. Having to
call system_has_cmpxchg64()/system_has_cmpxchg128() before calling
this_cpu_cmpxchg() depending on the argument size on architectures
that sometimes have cmpxchg128 but not on architectures that always
have it or that never have it makes it useless as an abstraction.

     Arnd

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-06 12:14       ` Peter Zijlstra
@ 2023-02-06 12:48         ` Peter Zijlstra
  2023-02-06 13:20           ` Arnd Bergmann
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2023-02-06 12:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Jonathan Corbet, Will Deacon, Boqun Feng,
	Mark Rutland, Catalin Marinas, dennis, Tejun Heo,
	Christoph Lameter, Heiko Carstens, gor, Alexander Gordeev,
	borntraeger, Sven Schnelle, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Joerg Roedel,
	suravee.suthikulpanit, Robin Murphy, dwmw2, Baolu Lu, Herbert Xu,
	David S . Miller, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, Linux-Arch,
	linux-crypto

On Mon, Feb 06, 2023 at 01:14:28PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 06, 2023 at 12:24:00PM +0100, Peter Zijlstra wrote:
> 
> > > Unless I have misunderstood what you are doing, my concerns are
> > > still the same:
> > > 
> > > >  #define this_cpu_cmpxchg(pcp, oval, nval) \
> > > > -	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> > > > +	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
> > > >  #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, 
> > > > nval2) \
> > > >  	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, 
> > > > oval1, oval2, nval1, nval2)
> > > 
> > > Having a variable-length this_cpu_cmpxchg() that turns into cmpxchg128()
> > > and cmpxchg64() even on CPUs where this traps (!X86_FEATURE_CX16) seems
> > > like a bad design to me.
> > > 
> > > I would much prefer fixed-length this_cpu_cmpxchg64()/this_cpu_cmpxchg128()
> > > calls that never trap but fall back to the generic version on CPUs that
> > > are lacking the atomics.
> > 
> > You're thinking acidental usage etc..? Lemme see what I can do.
> 
> So lookng at this I remember why I did it like this, currently 32bit
> archs silently fall back to the generics for most/all 64bit ops.
> 
> And personally I would just as soon drop support for the
> !X86_FEATURE_CX* cpus... :/ Those are some serious museum pieces.
> 
> One problem with silent downgrades like this is that semantics vs NMI
> change, which makes for subtle bugs on said museum pieces.
> 
> Basically, using 64bit percpu ops on 32bit is already somewhat dangerous
> -- wiring up native cmpxchg64 support in that case seemed an
> improvement.
> 
> Anyway... let me get on with doing explicit
> {raw,this}_cpu_cmpxchg{64,128}() thingies.

I only converted x86 and didn't do the automagic downgrade...

Opinions?

---
 arch/x86/include/asm/percpu.h | 11 +++++++----
 include/asm-generic/percpu.h  | 18 ++++++++++++++----
 include/linux/percpu-defs.h   | 20 ++------------------
 mm/slab.h                     |  2 ++
 mm/slub.c                     | 21 +++++++++++----------
 5 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 4c803a1fd0e7..7515e065369b 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -214,7 +214,7 @@ do {									\
 #define percpu_cmpxchg64_op(size, qual, _var, _oval, _nval)		\
 ({									\
 	union {								\
-		typeof(_var) var;					\
+		u64 val;						\
 		struct {						\
 			u32 low, high;					\
 		};							\
@@ -234,15 +234,18 @@ do {									\
 	old__.var;							\
 })
 
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg64_op(8,         , pcp, oval, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg64_op(8, volatile, pcp, oval, nval)
+#define raw_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg64_op(8,         , pcp, oval, nval)
+#define this_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg64_op(8, volatile, pcp, oval, nval)
 #endif
 
 #ifdef CONFIG_X86_64
+#define raw_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg_op(8,         , pcp, oval, nval);
+#define this_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval);
+
 #define percpu_cmpxchg128_op(size, qual, _var, _oval, _nval)		\
 ({									\
 	union {								\
-		typeof(_var) var;					\
+		u128 var;						\
 		struct {						\
 			u64 low, high;					\
 		};							\
diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index ad254a20fe68..7da7d1793411 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -274,8 +274,13 @@ do {									\
 #define raw_cpu_cmpxchg_8(pcp, oval, nval) \
 	raw_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
-#ifndef raw_cpu_cmpxchg_16
-#define raw_cpu_cmpxchg_16(pcp, oval, nval) \
+
+#ifndef raw_cpu_cmpxchg64
+#define raw_cpu_cmpxchg64(pcp, oval, nval) \
+	raw_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef raw_cpu_cmpxchg128
+#define raw_cpu_cmpxchg128(pcp, oval, nval) \
 	raw_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
@@ -386,8 +391,13 @@ do {									\
 #define this_cpu_cmpxchg_8(pcp, oval, nval) \
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
-#ifndef this_cpu_cmpxchg_16
-#define this_cpu_cmpxchg_16(pcp, oval, nval) \
+
+#ifndef this_cpu_cmpxchg64
+#define this_cpu_cmpxchg64(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg128
+#define this_cpu_cmpxchg128(pcp, oval, nval) \
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index fe3c7fc2d411..7cd614a46af4 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -343,22 +343,6 @@ static inline void __this_cpu_preempt_check(const char *op) { }
 	pscr2_ret__;							\
 })
 
-#define __pcpu_size16_call_return2(stem, variable, ...)			\
-({									\
-	typeof(variable) pscr2_ret__;					\
-	__verify_pcpu_ptr(&(variable));					\
-	switch(sizeof(variable)) {					\
-	case 1: pscr2_ret__ = stem##1(variable, __VA_ARGS__); break;	\
-	case 2: pscr2_ret__ = stem##2(variable, __VA_ARGS__); break;	\
-	case 4: pscr2_ret__ = stem##4(variable, __VA_ARGS__); break;	\
-	case 8: pscr2_ret__ = stem##8(variable, __VA_ARGS__); break;	\
-	case 16: pscr2_ret__ = stem##16(variable, __VA_ARGS__); break;	\
-	default:							\
-		__bad_size_call_parameter(); break;			\
-	}								\
-	pscr2_ret__;							\
-})
-
 #define __pcpu_size_call(stem, variable, ...)				\
 do {									\
 	__verify_pcpu_ptr(&(variable));					\
@@ -414,7 +398,7 @@ do {									\
 #define raw_cpu_add_return(pcp, val)	__pcpu_size_call_return2(raw_cpu_add_return_, pcp, val)
 #define raw_cpu_xchg(pcp, nval)		__pcpu_size_call_return2(raw_cpu_xchg_, pcp, nval)
 #define raw_cpu_cmpxchg(pcp, oval, nval) \
-	__pcpu_size16_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
+	__pcpu_size_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
 #define raw_cpu_sub(pcp, val)		raw_cpu_add(pcp, -(val))
 #define raw_cpu_inc(pcp)		raw_cpu_add(pcp, 1)
 #define raw_cpu_dec(pcp)		raw_cpu_sub(pcp, 1)
@@ -493,7 +477,7 @@ do {									\
 #define this_cpu_add_return(pcp, val)	__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
-	__pcpu_size16_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
+	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
 #define this_cpu_sub(pcp, val)		this_cpu_add(pcp, -(typeof(pcp))(val))
 #define this_cpu_inc(pcp)		this_cpu_add(pcp, 1)
 #define this_cpu_dec(pcp)		this_cpu_sub(pcp, 1)
diff --git a/mm/slab.h b/mm/slab.h
index 19e1899673ef..50b5edd6a950 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -25,11 +25,13 @@ typedef union {
 # ifdef system_has_cmpxchg128
 # define system_has_freelist_aba()	system_has_cmpxchg128()
 # define try_cmpxchg_freelist		try_cmpxchg128
+# define this_cpu_cmpxchg_freelist	this_cpu_cmpxchg128
 # endif
 #else /* CONFIG_64BIT */
 # ifdef system_has_cmpxchg64
 # define system_has_freelist_aba()	system_has_cmpxchg64()
 # define try_cmpxchg_freelist		try_cmpxchg64
+# define this_cpu_cmpxchg_freelist	this_cpu_cmpxchg64
 # endif
 #endif /* CONFIG_64BIT */
 
diff --git a/mm/slub.c b/mm/slub.c
index 45f2b28d60e1..35939c5aa28a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -523,17 +523,14 @@ __update_freelist_fast(struct slab *slab,
 		      void *freelist_old, unsigned long counters_old,
 		      void *freelist_new, unsigned long counters_new)
 {
-
-	bool ret = false;
-
-#ifdef system_has_freelist_aba
+#ifdef syste_has_freelist_aba
 	freelist_aba_t old = { .freelist = freelist_old, .counter = counters_old };
 	freelist_aba_t new = { .freelist = freelist_new, .counter = counters_new };
 
-	ret = try_cmpxchg_freelist(&slab->freelist_counter.full, &old.full, new.full);
-#endif /* system_has_freelist_aba */
-
-	return ret;
+	return try_cmpxchg_freelist(&slab->freelist_counter.full, &old.full, new.full);
+#else
+	return false;
+#endif
 }
 
 static inline bool
@@ -3039,11 +3036,15 @@ __update_cpu_freelist_fast(struct kmem_cache *s,
 			   void *freelist_old, void *freelist_new,
 			   unsigned long tid)
 {
+#ifdef system_has_freelist_aba
 	freelist_aba_t old = { .freelist = freelist_old, .counter = tid };
 	freelist_aba_t new = { .freelist = freelist_new, .counter = next_tid(tid) };
 
-	return this_cpu_cmpxchg(s->cpu_slab->freelist_tid.full,
-				old.full, new.full) == old.full;
+	return this_cpu_cmpxchg_freelist(s->cpu_slab->freelist_tid.full,
+					 old.full, new.full) == old.full;
+#else
+	return false;
+#endif
 }
 
 /*

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/10] percpu: Wire up cmpxchg128
  2023-02-06 12:48         ` Peter Zijlstra
@ 2023-02-06 13:20           ` Arnd Bergmann
  0 siblings, 0 replies; 31+ messages in thread
From: Arnd Bergmann @ 2023-02-06 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jonathan Corbet, Will Deacon, Boqun Feng,
	Mark Rutland, Catalin Marinas, dennis, Tejun Heo,
	Christoph Lameter, Heiko Carstens, gor, Alexander Gordeev,
	borntraeger, Sven Schnelle, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Joerg Roedel,
	suravee.suthikulpanit, Robin Murphy, dwmw2, Baolu Lu, Herbert Xu,
	David S . Miller, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	linux-doc, linux-kernel, linux-mm, linux-s390, iommu, Linux-Arch,
	linux-crypto

On Mon, Feb 6, 2023, at 13:48, Peter Zijlstra wrote:
> On Mon, Feb 06, 2023 at 01:14:28PM +0100, Peter Zijlstra wrote:
>> On Mon, Feb 06, 2023 at 12:24:00PM +0100, Peter Zijlstra wrote:
>> Basically, using 64bit percpu ops on 32bit is already somewhat dangerous
>> -- wiring up native cmpxchg64 support in that case seemed an
>> improvement.
>> 
>> Anyway... let me get on with doing explicit
>> {raw,this}_cpu_cmpxchg{64,128}() thingies.
>
> I only converted x86 and didn't do the automagic downgrade...
>
> Opinions?

I think that's much better, it keeps the interface symmetric
between cmpxchg and this_cpu_cmp_cmpxchg, and makes it harder
to run into the subtle corner case on old x86 CPUs.

For the M486/M586/MK8/MPSC/MATOM/MCORE2 configs, I would
probably want to go with a compile-time check and have most
distro kernels build with at least M586TSC for 32-bit or
an upgraded CONFIG_GENERIC_CPU for x86_64 that assumes cmpxchg16b
(nehalem, second-geneneration k8, silverthorne, ...) in
order to turn system_has_cmpxchg* into a constant value on
all architectures. That can be a separate discussion though
and shouldn't block your series as you just keep the current
state.

> -#ifdef system_has_freelist_aba
> +#ifdef syste_has_freelist_aba

Typo

    Arnd

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 08/10] slub: Replace cmpxchg_double()
  2023-02-02 14:50 ` [PATCH v2 08/10] slub: " Peter Zijlstra
@ 2023-02-08 13:31   ` Hyeonggon Yoo
  0 siblings, 0 replies; 31+ messages in thread
From: Hyeonggon Yoo @ 2023-02-08 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, hca, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu,
	Arnd Bergmann, Herbert Xu, davem, penberg, rientjes,
	iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin, linux-doc,
	linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On Thu, Feb 02, 2023 at 03:50:38PM +0100, Peter Zijlstra wrote:
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slub_def.h |   12 ++-
>  mm/slab.h                |   45 +++++++++++++-
>  mm/slub.c                |  142 ++++++++++++++++++++++++++++-------------------
>  3 files changed, 135 insertions(+), 64 deletions(-)
> 
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -39,7 +39,8 @@ enum stat_item {
>  	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
>  	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
>  	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
> -	NR_SLUB_STAT_ITEMS };
> +	NR_SLUB_STAT_ITEMS
> +};
>  
>  #ifndef CONFIG_SLUB_TINY
>  /*
> @@ -47,8 +48,13 @@ enum stat_item {
>   * with this_cpu_cmpxchg_double() alignment requirements.
>   */
>  struct kmem_cache_cpu {
> -	void **freelist;	/* Pointer to next available object */
> -	unsigned long tid;	/* Globally unique transaction id */
> +	union {
> +		struct {
> +			void **freelist;	/* Pointer to next available object */
> +			unsigned long tid;	/* Globally unique transaction id */
> +		};
> +		freelist_aba_t freelist_tid;
> +	};
>  	struct slab *slab;	/* The slab from which we are allocating */
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  	struct slab *partial;	/* Partially allocated frozen slabs */
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -5,6 +5,34 @@
>   * Internal slab definitions
>   */
>  
> +/*
> + * Freelist pointer and counter to cmpxchg together, avoids the typical ABA
> + * problems with cmpxchg of just a pointer.
> + */
> +typedef union {
> +	struct {
> +		void *freelist;
> +		unsigned long counter;
> +	};
> +#ifdef CONFIG_64BIT
> +	u128 full;
> +#else
> +	u64 full;
> +#endif
> +} freelist_aba_t;
> +
> +#ifdef CONFIG_64BIT
> +# ifdef system_has_cmpxchg128
> +# define system_has_freelist_aba()	system_has_cmpxchg128()
> +# define try_cmpxchg_freelist		try_cmpxchg128
> +# endif
> +#else /* CONFIG_64BIT */
> +# ifdef system_has_cmpxchg64
> +# define system_has_freelist_aba()	system_has_cmpxchg64()
> +# define try_cmpxchg_freelist		try_cmpxchg64
> +# endif
> +#endif /* CONFIG_64BIT */
> +
>  /* Reuses the bits in struct page */
>  struct slab {
>  	unsigned long __page_flags;
> @@ -37,14 +65,21 @@ struct slab {
>  #endif
>  			};
>  			/* Double-word boundary */
> -			void *freelist;		/* first free object */
>  			union {
> -				unsigned long counters;
>  				struct {
> -					unsigned inuse:16;
> -					unsigned objects:15;
> -					unsigned frozen:1;
> +					void *freelist;		/* first free object */
> +					union {
> +						unsigned long counters;
> +						struct {
> +							unsigned inuse:16;
> +							unsigned objects:15;
> +							unsigned frozen:1;
> +						};
> +					};
>  				};
> +#ifdef system_has_freelist_aba
> +				freelist_aba_t freelist_counter;
> +#endif
>  			};
>  		};
>  		struct rcu_head rcu_head;
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -292,7 +292,13 @@ static inline bool kmem_cache_has_cpu_pa
>  /* Poison object */
>  #define __OBJECT_POISON		((slab_flags_t __force)0x80000000U)
>  /* Use cmpxchg_double */
> +
> +#if defined(system_has_freelist_aba) && \
> +    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
>  #define __CMPXCHG_DOUBLE	((slab_flags_t __force)0x40000000U)
> +#else
> +#define __CMPXCHG_DOUBLE	((slab_flags_t __force)0U)
> +#endif
>  
>  /*
>   * Tracking user of a slab.
> @@ -512,6 +518,43 @@ static __always_inline void slab_unlock(
>  	__bit_spin_unlock(PG_locked, &page->flags);
>  }
>  
> +static inline bool
> +__update_freelist_fast(struct slab *slab,
> +		      void *freelist_old, unsigned long counters_old,
> +		      void *freelist_new, unsigned long counters_new)
> +{
> +
> +	bool ret = false;
> +
> +#ifdef system_has_freelist_aba
> +	freelist_aba_t old = { .freelist = freelist_old, .counter = counters_old };
> +	freelist_aba_t new = { .freelist = freelist_new, .counter = counters_new };
> +
> +	ret = try_cmpxchg_freelist(&slab->freelist_counter.full, &old.full, new.full);
> +#endif /* system_has_freelist_aba */
> +
> +	return ret;
> +}
> +
> +static inline bool
> +__update_freelist_slow(struct slab *slab,
> +		      void *freelist_old, unsigned long counters_old,
> +		      void *freelist_new, unsigned long counters_new)
> +{
> +	bool ret = false;
> +
> +	slab_lock(slab);
> +	if (slab->freelist == freelist_old &&
> +	    slab->counters == counters_old) {
> +		slab->freelist = freelist_new;
> +		slab->counters = counters_new;
> +		ret = true;
> +	}
> +	slab_unlock(slab);
> +
> +	return ret;
> +}
> +
>  /*
>   * Interrupts must be disabled (for the fallback code to work right), typically
>   * by an _irqsave() lock variant. On PREEMPT_RT the preempt_disable(), which is
> @@ -519,33 +562,25 @@ static __always_inline void slab_unlock(
>   * allocation/ free operation in hardirq context. Therefore nothing can
>   * interrupt the operation.
>   */
> -static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
> +static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *slab,
>  		void *freelist_old, unsigned long counters_old,
>  		void *freelist_new, unsigned long counters_new,
>  		const char *n)
>  {
> +	bool ret;
> +
>  	if (USE_LOCKLESS_FAST_PATH())
>  		lockdep_assert_irqs_disabled();
> -#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
> -    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
> +
>  	if (s->flags & __CMPXCHG_DOUBLE) {
> -		if (cmpxchg_double(&slab->freelist, &slab->counters,
> -				   freelist_old, counters_old,
> -				   freelist_new, counters_new))
> -			return true;
> -	} else
> -#endif
> -	{
> -		slab_lock(slab);
> -		if (slab->freelist == freelist_old &&
> -					slab->counters == counters_old) {
> -			slab->freelist = freelist_new;
> -			slab->counters = counters_new;
> -			slab_unlock(slab);
> -			return true;
> -		}
> -		slab_unlock(slab);
> +		ret = __update_freelist_fast(slab, freelist_old, counters_old,
> +				            freelist_new, counters_new);
> +	} else {
> +		ret = __update_freelist_slow(slab, freelist_old, counters_old,
> +				            freelist_new, counters_new);
>  	}
> +	if (likely(ret))
> +		return true;
>  
>  	cpu_relax();
>  	stat(s, CMPXCHG_DOUBLE_FAIL);
> @@ -557,36 +592,26 @@ static inline bool __cmpxchg_double_slab
>  	return false;
>  }
>  
> -static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
> +static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,
>  		void *freelist_old, unsigned long counters_old,
>  		void *freelist_new, unsigned long counters_new,
>  		const char *n)
>  {
> -#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
> -    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
> +	bool ret;
> +
>  	if (s->flags & __CMPXCHG_DOUBLE) {
> -		if (cmpxchg_double(&slab->freelist, &slab->counters,
> -				   freelist_old, counters_old,
> -				   freelist_new, counters_new))
> -			return true;
> -	} else
> -#endif
> -	{
> +		ret = __update_freelist_fast(slab, freelist_old, counters_old,
> +				            freelist_new, counters_new);
> +	} else {
>  		unsigned long flags;
>  
>  		local_irq_save(flags);
> -		slab_lock(slab);
> -		if (slab->freelist == freelist_old &&
> -					slab->counters == counters_old) {
> -			slab->freelist = freelist_new;
> -			slab->counters = counters_new;
> -			slab_unlock(slab);
> -			local_irq_restore(flags);
> -			return true;
> -		}
> -		slab_unlock(slab);
> +		ret = __update_freelist_slow(slab, freelist_old, counters_old,
> +				            freelist_new, counters_new);
>  		local_irq_restore(flags);
>  	}
> +	if (likely(ret))
> +		return true;
>  
>  	cpu_relax();
>  	stat(s, CMPXCHG_DOUBLE_FAIL);
> @@ -2229,7 +2254,7 @@ static inline void *acquire_slab(struct
>  	VM_BUG_ON(new.frozen);
>  	new.frozen = 1;
>  
> -	if (!__cmpxchg_double_slab(s, slab,
> +	if (!__slab_update_freelist(s, slab,
>  			freelist, counters,
>  			new.freelist, new.counters,
>  			"acquire_slab"))
> @@ -2555,7 +2580,7 @@ static void deactivate_slab(struct kmem_
>  	}
>  
>  
> -	if (!cmpxchg_double_slab(s, slab,
> +	if (!slab_update_freelist(s, slab,
>  				old.freelist, old.counters,
>  				new.freelist, new.counters,
>  				"unfreezing slab")) {
> @@ -2612,7 +2637,7 @@ static void __unfreeze_partials(struct k
>  
>  			new.frozen = 0;
>  
> -		} while (!__cmpxchg_double_slab(s, slab,
> +		} while (!__slab_update_freelist(s, slab,
>  				old.freelist, old.counters,
>  				new.freelist, new.counters,
>  				"unfreezing slab"));
> @@ -3009,6 +3034,18 @@ static inline bool pfmemalloc_match(stru
>  }
>  
>  #ifndef CONFIG_SLUB_TINY
> +static inline bool
> +__update_cpu_freelist_fast(struct kmem_cache *s,
> +			   void *freelist_old, void *freelist_new,
> +			   unsigned long tid)
> +{
> +	freelist_aba_t old = { .freelist = freelist_old, .counter = tid };
> +	freelist_aba_t new = { .freelist = freelist_new, .counter = next_tid(tid) };
> +
> +	return this_cpu_cmpxchg(s->cpu_slab->freelist_tid.full,
> +				old.full, new.full) == old.full;
> +}
> +
>  /*
>   * Check the slab->freelist and either transfer the freelist to the
>   * per cpu freelist or deactivate the slab.
> @@ -3035,7 +3072,7 @@ static inline void *get_freelist(struct
>  		new.inuse = slab->objects;
>  		new.frozen = freelist != NULL;
>  
> -	} while (!__cmpxchg_double_slab(s, slab,
> +	} while (!__slab_update_freelist(s, slab,
>  		freelist, counters,
>  		NULL, new.counters,
>  		"get_freelist"));
> @@ -3360,11 +3397,7 @@ static __always_inline void *__slab_allo
>  		 * against code executing on this cpu *not* from access by
>  		 * other cpus.
>  		 */
> -		if (unlikely(!this_cpu_cmpxchg_double(
> -				s->cpu_slab->freelist, s->cpu_slab->tid,
> -				object, tid,
> -				next_object, next_tid(tid)))) {
> -
> +		if (unlikely(!__update_cpu_freelist_fast(s, object, next_object, tid))) {
>  			note_cmpxchg_failure("slab_alloc", s, tid);
>  			goto redo;
>  		}
> @@ -3632,7 +3665,7 @@ static void __slab_free(struct kmem_cach
>  			}
>  		}
>  
> -	} while (!cmpxchg_double_slab(s, slab,
> +	} while (!slab_update_freelist(s, slab,
>  		prior, counters,
>  		head, new.counters,
>  		"__slab_free"));
> @@ -3737,11 +3770,7 @@ static __always_inline void do_slab_free
>  
>  		set_freepointer(s, tail_obj, freelist);
>  
> -		if (unlikely(!this_cpu_cmpxchg_double(
> -				s->cpu_slab->freelist, s->cpu_slab->tid,
> -				freelist, tid,
> -				head, next_tid(tid)))) {
> -
> +		if (unlikely(!__update_cpu_freelist_fast(s, freelist, head, tid))) {
>  			note_cmpxchg_failure("slab_free", s, tid);
>  			goto redo;
>  		}
> @@ -4505,11 +4534,12 @@ static int kmem_cache_open(struct kmem_c
>  		}
>  	}
>  
> -#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
> +#if defined(system_has_freelist_aba) && \
>      defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
> -	if (system_has_cmpxchg_double() && (s->flags & SLAB_NO_CMPXCHG) == 0)
> +	if (system_has_freelist_aba() && !(s->flags & SLAB_NO_CMPXCHG)) {
>  		/* Enable fast mode */
>  		s->flags |= __CMPXCHG_DOUBLE;
> +	}
>  #endif
>  
>  	/*

Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 09/10] arch: Remove cmpxchg_double
  2023-02-02 14:50 ` [PATCH v2 09/10] arch: Remove cmpxchg_double Peter Zijlstra
@ 2023-02-08 13:44   ` Hyeonggon Yoo
  0 siblings, 0 replies; 31+ messages in thread
From: Hyeonggon Yoo @ 2023-02-08 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, corbet, will, boqun.feng, mark.rutland,
	catalin.marinas, dennis, tj, cl, hca, gor, agordeev, borntraeger,
	svens, tglx, mingo, bp, dave.hansen, x86, hpa, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu,
	Arnd Bergmann, Herbert Xu, davem, penberg, rientjes,
	iamjoonsoo.kim, Andrew Morton, vbabka, roman.gushchin, linux-doc,
	linux-kernel, linux-mm, linux-s390, iommu, linux-arch,
	linux-crypto

On Thu, Feb 02, 2023 at 03:50:39PM +0100, Peter Zijlstra wrote:
> No moar users, remove the monster.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  Documentation/core-api/this_cpu_ops.rst    |    2 -
>  arch/arm64/include/asm/atomic_ll_sc.h      |   33 ----------------
>  arch/arm64/include/asm/atomic_lse.h        |   36 ------------------
>  arch/arm64/include/asm/cmpxchg.h           |   46 -----------------------
>  arch/arm64/include/asm/percpu.h            |   10 -----
>  arch/s390/include/asm/cmpxchg.h            |   34 -----------------
>  arch/s390/include/asm/percpu.h             |   18 ---------
>  arch/x86/include/asm/cmpxchg.h             |   25 ------------
>  arch/x86/include/asm/cmpxchg_32.h          |    1 
>  arch/x86/include/asm/cmpxchg_64.h          |    1 
>  arch/x86/include/asm/percpu.h              |   41 --------------------
>  include/asm-generic/percpu.h               |   58 -----------------------------
>  include/linux/atomic/atomic-instrumented.h |   17 --------
>  include/linux/percpu-defs.h                |   38 -------------------
>  scripts/atomic/gen-atomic-instrumented.sh  |   17 ++------
>  15 files changed, 6 insertions(+), 371 deletions(-)
> 
> --- a/arch/x86/include/asm/cmpxchg.h
> +++ b/arch/x86/include/asm/cmpxchg.h
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -385,30 +368,6 @@ do {									\
>  #define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
>  #define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
>  #define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
> -
> -/*
> - * Pretty complex macro to generate cmpxchg16 instruction.  The instruction
> - * is not supported on early AMD64 processors so we must be able to emulate
> - * it in software.  The address used in the cmpxchg16 instruction must be
> - * aligned to a 16 byte boundary.
> - */
> -#define percpu_cmpxchg16b_double(pcp1, pcp2, o1, o2, n1, n2)		\
> -({									\
> -	bool __ret;							\
> -	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
> -	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
> -	alternative_io("leaq %P1,%%rsi\n\tcall this_cpu_cmpxchg16b_emu\n\t", \

I guess now arch/x86/lib/cmpxchg*b_emu.S could be dropped too?

> -		       "cmpxchg16b " __percpu_arg(1) "\n\tsetz %0\n\t",	\
> -		       X86_FEATURE_CX16,				\
> -		       ASM_OUTPUT2("=a" (__ret), "+m" (pcp1),		\
> -				   "+m" (pcp2), "+d" (__o2)),		\
> -		       "b" (__n1), "c" (__n2), "a" (__o1) : "rsi");	\
> -	__ret;								\
> -})
> -
> -#define raw_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
> -#define this_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
> -
>  #endif

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-02-08 13:46 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-02 14:50 [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Peter Zijlstra
2023-02-02 14:50 ` [PATCH v2 01/10] cyrpto/b128ops: Remove struct u128 Peter Zijlstra
2023-02-02 20:21   ` H. Peter Anvin
2023-02-03  1:06   ` Herbert Xu
2023-02-02 14:50 ` [PATCH v2 02/10] types: Introduce [us]128 Peter Zijlstra
2023-02-03  3:37   ` Herbert Xu
2023-02-02 14:50 ` [PATCH v2 03/10] arch: Introduce arch_{,try_}_cmpxchg128{,_local}() Peter Zijlstra
2023-02-02 17:04   ` Heiko Carstens
2023-02-03 16:52   ` Mark Rutland
2023-02-02 14:50 ` [PATCH v2 04/10] instrumentation: Wire up cmpxchg128() Peter Zijlstra
2023-02-03 16:55   ` Mark Rutland
2023-02-02 14:50 ` [PATCH v2 05/10] percpu: Wire up cmpxchg128 Peter Zijlstra
2023-02-02 17:05   ` Heiko Carstens
2023-02-03 17:02   ` Mark Rutland
2023-02-03 17:25   ` Arnd Bergmann
2023-02-06 11:24     ` Peter Zijlstra
2023-02-06 12:14       ` Peter Zijlstra
2023-02-06 12:48         ` Peter Zijlstra
2023-02-06 13:20           ` Arnd Bergmann
2023-02-06 12:19       ` Arnd Bergmann
2023-02-02 14:50 ` [PATCH v2 06/10] x86,amd_iommu: Replace cmpxchg_double() Peter Zijlstra
2023-02-02 14:50 ` [PATCH v2 07/10] x86,intel_iommu: " Peter Zijlstra
2023-02-03  2:51   ` Baolu Lu
2023-02-02 14:50 ` [PATCH v2 08/10] slub: " Peter Zijlstra
2023-02-08 13:31   ` Hyeonggon Yoo
2023-02-02 14:50 ` [PATCH v2 09/10] arch: Remove cmpxchg_double Peter Zijlstra
2023-02-08 13:44   ` Hyeonggon Yoo
2023-02-02 14:50 ` [PATCH v2 10/10] s390/cpum_sf: Convert to cmpxchg128() Peter Zijlstra
2023-02-02 17:05   ` Heiko Carstens
2023-02-02 19:39 ` [PATCH v2 00/10] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() Linus Torvalds
2023-02-02 22:45   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).