All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
@ 2023-10-01 13:14 Uros Bizjak
  2023-10-01 13:14 ` [RFC PATCH 1/4] x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip Uros Bizjak
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Uros Bizjak @ 2023-10-01 13:14 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

This patchset resurrect the work of Richard Henderson [1] and Nadav Amit [2]
to introduce named address spaces compiler extension [3,4] into the linux kernel.

On the x86 target, variables may be declared as being relative to the %fs or %gs segments.

__seg_fs
__seg_gs

The object is accessed with the respective segment override prefix.

The following patchset takes a bit more cautious approach and converts only moves,
currently implemented as an asm, to generic moves to/from named address space. The
compiler is then able to propagate memory arguments into instructions that use these
memory references, producing more compact assembly, in addition to avoiding using
a register as a temporary to hold value from the memory.

The patchset enables propagation of hundreds of memory arguments, resulting in
the cumulative code size reduction of 7.94kB (please note that the kernel is
compiled with -O2, so the code size is not entirely correct measure; some
parts of the code can now be duplicated for better performance due to -O2,
etc...).

Some examples of propagations:

a) into sign/zero extensions:

 110b54:       65 0f b6 05 00 00 00    movzbl %gs:0x0(%rip),%eax
 11ab90:       65 0f b6 15 00 00 00    movzbl %gs:0x0(%rip),%edx
 14484a:       65 0f b7 35 00 00 00    movzwl %gs:0x0(%rip),%esi
 1a08a9:       65 0f b6 43 78          movzbl %gs:0x78(%rbx),%eax
 1a08f9:       65 0f b6 43 78          movzbl %gs:0x78(%rbx),%eax

 4ab29a:       65 48 63 15 00 00 00    movslq %gs:0x0(%rip),%rdx
 4be128:       65 4c 63 25 00 00 00    movslq %gs:0x0(%rip),%r12
 547468:       65 48 63 1f             movslq %gs:(%rdi),%rbx
 5474e7:       65 48 63 0a             movslq %gs:(%rdx),%rcx
 54d05d:       65 48 63 0d 00 00 00    movslq %gs:0x0(%rip),%rcx

b) into compares:

 b40804:       65 f7 05 00 00 00 00    testl  $0xf0000,%gs:0x0(%rip)
 b487e8:       65 f7 05 00 00 00 00    testl  $0xf0000,%gs:0x0(%rip)
 b6f14c:       65 f6 05 00 00 00 00    testb  $0x1,%gs:0x0(%rip)
 bac1b8:       65 f6 05 00 00 00 00    testb  $0x1,%gs:0x0(%rip)
 df2244:       65 f7 05 00 00 00 00    testl  $0xff00,%gs:0x0(%rip)

 9a7517:       65 80 3d 00 00 00 00    cmpb   $0x0,%gs:0x0(%rip)
 b282ba:       65 44 3b 35 00 00 00    cmp    %gs:0x0(%rip),%r14d
 b48f61:       65 66 83 3d 00 00 00    cmpw   $0x8,%gs:0x0(%rip)
 b493fe:       65 80 38 00             cmpb   $0x0,%gs:(%rax)
 b73867:       65 66 83 3d 00 00 00    cmpw   $0x8,%gs:0x0(%rip)

c) into other insns:

 65ec02:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx
 6c98ac:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx
 9aafaf:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx
 b45868:       65 0f 48 35 00 00 00    cmovs  %gs:0x0(%rip),%esi
 d276f8:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx

The above propagations result in the following code size
improvements for current mainline kernel (with the default config),
compiled with

gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)

   text    data     bss     dec     hex filename
25508862        4386540  808388 30703790        1d480ae vmlinux-vanilla.o
25500922        4386532  808388 30695842        1d461a2 vmlinux-new.o

The conversion of other read-modify-write instructions does not bring us any
benefits, the compiler has some problems when constructing RMW instructions
from the generic code and easily misses some opportunities.

There are other optimizations possible involving arch_raw_cpu_ptr and
aggressive caching of current that are implemented in the original
patch series. These can be implemented as follow-ups at some later
time.

The patcshet was tested on Fedora 38 with kernel 6.5.5 and gcc 13.2.1
(In fact, I'm writing this message on the patched kernel.)

[1] https://lore.kernel.org/lkml/1454483253-11246-1-git-send-email-rth@twiddle.net/
[2] https://lore.kernel.org/lkml/20190823224424.15296-1-namit@vmware.com/
[3] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html
[4] https://clang.llvm.org/docs/LanguageExtensions.html#target-specific-extensions

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Uros Bizjak (4):
  x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip
  x86/percpu: Detect compiler support for named address spaces
  x86/percpu: Use compiler segment prefix qualifier
  x86/percpu: Use C for percpu read/write accessors

 arch/x86/Kconfig               |   7 +
 arch/x86/include/asm/percpu.h  | 237 ++++++++++++++++++++++++++++-----
 arch/x86/include/asm/preempt.h |   2 +-
 3 files changed, 209 insertions(+), 37 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/4] x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip
  2023-10-01 13:14 [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Uros Bizjak
@ 2023-10-01 13:14 ` Uros Bizjak
  2023-10-01 13:14 ` [RFC PATCH 2/4] x86/percpu: Detect compiler support for named address spaces Uros Bizjak
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Uros Bizjak @ 2023-10-01 13:14 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

This is just a convenient patch that brings current mainline version
of arch/x86/include/asm/percpu.h to the version in the current tip tree.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/include/asm/percpu.h | 110 ++++++++++++++++++++++++++++++++--
 1 file changed, 104 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 34734d730463..20624b80f890 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -210,6 +210,25 @@ do {									\
 	(typeof(_var))(unsigned long) pco_old__;			\
 })
 
+#define percpu_try_cmpxchg_op(size, qual, _var, _ovalp, _nval)		\
+({									\
+	bool success;							\
+	__pcpu_type_##size *pco_oval__ = (__pcpu_type_##size *)(_ovalp); \
+	__pcpu_type_##size pco_old__ = *pco_oval__;			\
+	__pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval);	\
+	asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]",		\
+				    __percpu_arg([var]))		\
+		  CC_SET(z)						\
+		  : CC_OUT(z) (success),				\
+		    [oval] "+a" (pco_old__),				\
+		    [var] "+m" (_var)					\
+		  : [nval] __pcpu_reg_##size(, pco_new__)		\
+		  : "memory");						\
+	if (unlikely(!success))						\
+		*pco_oval__ = pco_old__;				\
+	likely(success);						\
+})
+
 #if defined(CONFIG_X86_32) && !defined(CONFIG_UML)
 #define percpu_cmpxchg64_op(size, qual, _var, _oval, _nval)		\
 ({									\
@@ -223,26 +242,63 @@ do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("leal %P[var], %%esi; call this_cpu_cmpxchg8b_emu", \
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
 			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
 		  : [var] "+m" (_var),					\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
-		    "c" (new__.high)					\
-		  : "memory", "esi");					\
+		    "c" (new__.high),					\
+		    "S" (&(_var))					\
+		  : "memory");						\
 									\
 	old__.var;							\
 })
 
 #define raw_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg64_op(8,         , pcp, oval, nval)
 #define this_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg64_op(8, volatile, pcp, oval, nval)
+
+#define percpu_try_cmpxchg64_op(size, qual, _var, _ovalp, _nval)	\
+({									\
+	bool success;							\
+	u64 *_oval = (u64 *)(_ovalp);					\
+	union {								\
+		u64 var;						\
+		struct {						\
+			u32 low, high;					\
+		};							\
+	} old__, new__;							\
+									\
+	old__.var = *_oval;						\
+	new__.var = _nval;						\
+									\
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
+			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
+		  CC_SET(z)						\
+		  : CC_OUT(z) (success),				\
+		    [var] "+m" (_var),					\
+		    "+a" (old__.low),					\
+		    "+d" (old__.high)					\
+		  : "b" (new__.low),					\
+		    "c" (new__.high),					\
+		    "S" (&(_var))					\
+		  : "memory");						\
+	if (unlikely(!success))						\
+		*_oval = old__.var;					\
+	likely(success);						\
+})
+
+#define raw_cpu_try_cmpxchg64(pcp, ovalp, nval)		percpu_try_cmpxchg64_op(8,         , pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg64(pcp, ovalp, nval)	percpu_try_cmpxchg64_op(8, volatile, pcp, ovalp, nval)
 #endif
 
 #ifdef CONFIG_X86_64
 #define raw_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg_op(8,         , pcp, oval, nval);
 #define this_cpu_cmpxchg64(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval);
 
+#define raw_cpu_try_cmpxchg64(pcp, ovalp, nval)		percpu_try_cmpxchg_op(8,         , pcp, ovalp, nval);
+#define this_cpu_try_cmpxchg64(pcp, ovalp, nval)	percpu_try_cmpxchg_op(8, volatile, pcp, ovalp, nval);
+
 #define percpu_cmpxchg128_op(size, qual, _var, _oval, _nval)		\
 ({									\
 	union {								\
@@ -255,20 +311,54 @@ do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
 			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
 		  : [var] "+m" (_var),					\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
-		    "c" (new__.high)					\
-		  : "memory", "rsi");					\
+		    "c" (new__.high),					\
+		    "S" (&(_var))					\
+		  : "memory");						\
 									\
 	old__.var;							\
 })
 
 #define raw_cpu_cmpxchg128(pcp, oval, nval)	percpu_cmpxchg128_op(16,         , pcp, oval, nval)
 #define this_cpu_cmpxchg128(pcp, oval, nval)	percpu_cmpxchg128_op(16, volatile, pcp, oval, nval)
+
+#define percpu_try_cmpxchg128_op(size, qual, _var, _ovalp, _nval)	\
+({									\
+	bool success;							\
+	u128 *_oval = (u128 *)(_ovalp);					\
+	union {								\
+		u128 var;						\
+		struct {						\
+			u64 low, high;					\
+		};							\
+	} old__, new__;							\
+									\
+	old__.var = *_oval;						\
+	new__.var = _nval;						\
+									\
+	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
+			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
+		  CC_SET(z)						\
+		  : CC_OUT(z) (success),				\
+		    [var] "+m" (_var),					\
+		    "+a" (old__.low),					\
+		    "+d" (old__.high)					\
+		  : "b" (new__.low),					\
+		    "c" (new__.high),					\
+		    "S" (&(_var))					\
+		  : "memory");						\
+	if (unlikely(!success))						\
+		*_oval = old__.var;					\
+	likely(success);						\
+})
+
+#define raw_cpu_try_cmpxchg128(pcp, ovalp, nval)	percpu_try_cmpxchg128_op(16,         , pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg128(pcp, ovalp, nval)	percpu_try_cmpxchg128_op(16, volatile, pcp, ovalp, nval)
 #endif
 
 /*
@@ -343,6 +433,9 @@ do {									\
 #define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval)
 #define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval)
 #define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_try_cmpxchg_1(pcp, ovalp, nval)	percpu_try_cmpxchg_op(1, , pcp, ovalp, nval)
+#define raw_cpu_try_cmpxchg_2(pcp, ovalp, nval)	percpu_try_cmpxchg_op(2, , pcp, ovalp, nval)
+#define raw_cpu_try_cmpxchg_4(pcp, ovalp, nval)	percpu_try_cmpxchg_op(4, , pcp, ovalp, nval)
 
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, volatile, pcp, val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, volatile, pcp, val)
@@ -350,6 +443,9 @@ do {									\
 #define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
 #define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
 #define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_try_cmpxchg_1(pcp, ovalp, nval)	percpu_try_cmpxchg_op(1, volatile, pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg_2(pcp, ovalp, nval)	percpu_try_cmpxchg_op(2, volatile, pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg_4(pcp, ovalp, nval)	percpu_try_cmpxchg_op(4, volatile, pcp, ovalp, nval)
 
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
@@ -364,6 +460,7 @@ do {									\
 #define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, , pcp, val)
 #define raw_cpu_xchg_8(pcp, nval)		raw_percpu_xchg_op(pcp, nval)
 #define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval)	percpu_try_cmpxchg_op(8, , pcp, ovalp, nval)
 
 #define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
 #define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
@@ -373,6 +470,7 @@ do {									\
 #define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
 #define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_try_cmpxchg_8(pcp, ovalp, nval)	percpu_try_cmpxchg_op(8, volatile, pcp, ovalp, nval)
 #endif
 
 static __always_inline bool x86_this_cpu_constant_test_bit(unsigned int nr,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/4] x86/percpu: Detect compiler support for named address spaces
  2023-10-01 13:14 [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Uros Bizjak
  2023-10-01 13:14 ` [RFC PATCH 1/4] x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip Uros Bizjak
@ 2023-10-01 13:14 ` Uros Bizjak
  2023-10-01 13:14 ` [RFC PATCH 3/4] x86/percpu: Use compiler segment prefix qualifier Uros Bizjak
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Uros Bizjak @ 2023-10-01 13:14 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

Detect compiler support for named address spaces by trying
to compile a small testcase that includes __seg_gs and __seg_gs
keywords. Reflect successful compilation in CC_HAS_NAMED_AS
and set USE_X86_SEG_SUPPORT to signal when segment qualifiers
should be used.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/Kconfig | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..f66fa5bf1f78 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2388,6 +2388,13 @@ source "kernel/livepatch/Kconfig"
 
 endmenu
 
+config CC_HAS_NAMED_AS
+	def_bool $(success,echo 'int __seg_fs fs; int __seg_gs gs;' | $(CC) -x c - -c -o /dev/null)
+
+config USE_X86_SEG_SUPPORT
+	def_bool y
+	depends on CC_HAS_NAMED_AS && SMP
+
 config CC_HAS_SLS
 	def_bool $(cc-option,-mharden-sls=all)
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/4] x86/percpu: Use compiler segment prefix qualifier
  2023-10-01 13:14 [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Uros Bizjak
  2023-10-01 13:14 ` [RFC PATCH 1/4] x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip Uros Bizjak
  2023-10-01 13:14 ` [RFC PATCH 2/4] x86/percpu: Detect compiler support for named address spaces Uros Bizjak
@ 2023-10-01 13:14 ` Uros Bizjak
  2023-10-01 13:14 ` [RFC PATCH 4/4] x86/percpu: Use C for percpu read/write accessors Uros Bizjak
  2023-10-01 17:07 ` [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Linus Torvalds
  4 siblings, 0 replies; 18+ messages in thread
From: Uros Bizjak @ 2023-10-01 13:14 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

From: Nadav Amit <namit@vmware.com>

Using a segment prefix qualifier is cleaner than using a segment prefix
in the inline assembly, and provides the compiler with more information,
telling it that __seg_gs:[addr] is different than [addr] when it
analyzes data dependencies. It also enables various optimizations that
will be implemented in the next patches.

Use segment prefix qualifiers when they are supported. Unfortunately,
gcc does not provide a way to remove segment qualifiers, which is needed
to use typeof() to create local instances of the per-cpu variable. For
this reason, do not use the segment qualifier for per-cpu variables, and
do casting using the segment qualifier instead.

Uros: Improve compiler support detection and update the patch
to the current mainline.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/include/asm/percpu.h  | 68 +++++++++++++++++++++++-----------
 arch/x86/include/asm/preempt.h |  2 +-
 2 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 20624b80f890..da451202a1b9 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -28,26 +28,50 @@
 #include <linux/stringify.h>
 
 #ifdef CONFIG_SMP
+
+#ifdef CONFIG_CC_HAS_NAMED_AS
+
+#ifdef CONFIG_X86_64
+#define __percpu_seg_override	__seg_gs
+#else
+#define __percpu_seg_override	__seg_fs
+#endif
+
+#define __percpu_prefix		""
+
+#else /* CONFIG_CC_HAS_NAMED_AS */
+
+#define __percpu_seg_override
 #define __percpu_prefix		"%%"__stringify(__percpu_seg)":"
+
+#endif /* CONFIG_CC_HAS_NAMED_AS */
+
+#define __force_percpu_prefix	"%%"__stringify(__percpu_seg)":"
 #define __my_cpu_offset		this_cpu_read(this_cpu_off)
 
 /*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
  */
-#define arch_raw_cpu_ptr(ptr)				\
-({							\
-	unsigned long tcp_ptr__;			\
-	asm ("add " __percpu_arg(1) ", %0"		\
-	     : "=r" (tcp_ptr__)				\
-	     : "m" (this_cpu_off), "0" (ptr));		\
-	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;	\
+#define arch_raw_cpu_ptr(ptr)					\
+({								\
+	unsigned long tcp_ptr__;				\
+	asm ("add " __percpu_arg(1) ", %0"			\
+	     : "=r" (tcp_ptr__)					\
+	     : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr));	\
+	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;		\
 })
-#else
+#else /* CONFIG_SMP */
+#define __percpu_seg_override
 #define __percpu_prefix		""
-#endif
+#define __force_percpu_prefix	""
+#endif /* CONFIG_SMP */
 
+#define __my_cpu_type(var)	typeof(var) __percpu_seg_override
+#define __my_cpu_ptr(ptr)	(__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
+#define __my_cpu_var(var)	(*__my_cpu_ptr(&var))
 #define __percpu_arg(x)		__percpu_prefix "%" #x
+#define __force_percpu_arg(x)	__force_percpu_prefix "%" #x
 
 /*
  * Initialized pointers to per-cpu variables needed for the boot
@@ -107,14 +131,14 @@ do {									\
 		(void)pto_tmp__;					\
 	}								\
 	asm qual(__pcpu_op2_##size(op, "%[val]", __percpu_arg([var]))	\
-	    : [var] "+m" (_var)						\
+	    : [var] "+m" (__my_cpu_var(_var))				\
 	    : [val] __pcpu_reg_imm_##size(pto_val__));			\
 } while (0)
 
 #define percpu_unary_op(size, qual, op, _var)				\
 ({									\
 	asm qual (__pcpu_op1_##size(op, __percpu_arg([var]))		\
-	    : [var] "+m" (_var));					\
+	    : [var] "+m" (__my_cpu_var(_var)));				\
 })
 
 /*
@@ -144,14 +168,14 @@ do {									\
 	__pcpu_type_##size pfo_val__;					\
 	asm qual (__pcpu_op2_##size(op, __percpu_arg([var]), "%[val]")	\
 	    : [val] __pcpu_reg_##size("=", pfo_val__)			\
-	    : [var] "m" (_var));					\
+	    : [var] "m" (__my_cpu_var(_var)));				\
 	(typeof(_var))(unsigned long) pfo_val__;			\
 })
 
 #define percpu_stable_op(size, op, _var)				\
 ({									\
 	__pcpu_type_##size pfo_val__;					\
-	asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]")	\
+	asm(__pcpu_op2_##size(op, __force_percpu_arg(P[var]), "%[val]")	\
 	    : [val] __pcpu_reg_##size("=", pfo_val__)			\
 	    : [var] "p" (&(_var)));					\
 	(typeof(_var))(unsigned long) pfo_val__;			\
@@ -166,7 +190,7 @@ do {									\
 	asm qual (__pcpu_op2_##size("xadd", "%[tmp]",			\
 				     __percpu_arg([var]))		\
 		  : [tmp] __pcpu_reg_##size("+", paro_tmp__),		\
-		    [var] "+m" (_var)					\
+		    [var] "+m" (__my_cpu_var(_var))			\
 		  : : "memory");					\
 	(typeof(_var))(unsigned long) (paro_tmp__ + _val);		\
 })
@@ -187,7 +211,7 @@ do {									\
 				    __percpu_arg([var]))		\
 		  "\n\tjnz 1b"						\
 		  : [oval] "=&a" (pxo_old__),				\
-		    [var] "+m" (_var)					\
+		    [var] "+m" (__my_cpu_var(_var))			\
 		  : [nval] __pcpu_reg_##size(, pxo_new__)		\
 		  : "memory");						\
 	(typeof(_var))(unsigned long) pxo_old__;			\
@@ -204,7 +228,7 @@ do {									\
 	asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]",		\
 				    __percpu_arg([var]))		\
 		  : [oval] "+a" (pco_old__),				\
-		    [var] "+m" (_var)					\
+		    [var] "+m" (__my_cpu_var(_var))			\
 		  : [nval] __pcpu_reg_##size(, pco_new__)		\
 		  : "memory");						\
 	(typeof(_var))(unsigned long) pco_old__;			\
@@ -221,7 +245,7 @@ do {									\
 		  CC_SET(z)						\
 		  : CC_OUT(z) (success),				\
 		    [oval] "+a" (pco_old__),				\
-		    [var] "+m" (_var)					\
+		    [var] "+m" (__my_cpu_var(_var))			\
 		  : [nval] __pcpu_reg_##size(, pco_new__)		\
 		  : "memory");						\
 	if (unlikely(!success))						\
@@ -244,7 +268,7 @@ do {									\
 									\
 	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
 			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
-		  : [var] "+m" (_var),					\
+		  : [var] "+m" (__my_cpu_var(_var)),			\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
@@ -276,7 +300,7 @@ do {									\
 			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
 		  CC_SET(z)						\
 		  : CC_OUT(z) (success),				\
-		    [var] "+m" (_var),					\
+		    [var] "+m" (__my_cpu_var(_var)),			\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
@@ -313,7 +337,7 @@ do {									\
 									\
 	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
 			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
-		  : [var] "+m" (_var),					\
+		  : [var] "+m" (__my_cpu_var(_var)),			\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
@@ -345,7 +369,7 @@ do {									\
 			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
 		  CC_SET(z)						\
 		  : CC_OUT(z) (success),				\
-		    [var] "+m" (_var),					\
+		    [var] "+m" (__my_cpu_var(_var)),			\
 		    "+a" (old__.low),					\
 		    "+d" (old__.high)					\
 		  : "b" (new__.low),					\
@@ -494,7 +518,7 @@ static inline bool x86_this_cpu_variable_test_bit(int nr,
 	asm volatile("btl "__percpu_arg(2)",%1"
 			CC_SET(c)
 			: CC_OUT(c) (oldbit)
-			: "m" (*(unsigned long __percpu *)addr), "Ir" (nr));
+			: "m" (*__my_cpu_ptr((unsigned long __percpu *)(addr))), "Ir" (nr));
 
 	return oldbit;
 }
diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 2d13f25b1bd8..e25b95e7cf82 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -92,7 +92,7 @@ static __always_inline void __preempt_count_sub(int val)
  */
 static __always_inline bool __preempt_count_dec_and_test(void)
 {
-	return GEN_UNARY_RMWcc("decl", pcpu_hot.preempt_count, e,
+	return GEN_UNARY_RMWcc("decl", __my_cpu_var(pcpu_hot.preempt_count), e,
 			       __percpu_arg([var]));
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 4/4] x86/percpu: Use C for percpu read/write accessors
  2023-10-01 13:14 [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Uros Bizjak
                   ` (2 preceding siblings ...)
  2023-10-01 13:14 ` [RFC PATCH 3/4] x86/percpu: Use compiler segment prefix qualifier Uros Bizjak
@ 2023-10-01 13:14 ` Uros Bizjak
  2023-10-01 17:07 ` [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Linus Torvalds
  4 siblings, 0 replies; 18+ messages in thread
From: Uros Bizjak @ 2023-10-01 13:14 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g. propagation of memory arguments). Convert
percpu read and write accessors to C code, so the memory argument can
be propagated to the instruction that uses this argument.

Some examples of propagations:

a) into sign/zero extensions:

 110b54:       65 0f b6 05 00 00 00    movzbl %gs:0x0(%rip),%eax
 11ab90:       65 0f b6 15 00 00 00    movzbl %gs:0x0(%rip),%edx
 14484a:       65 0f b7 35 00 00 00    movzwl %gs:0x0(%rip),%esi
 1a08a9:       65 0f b6 43 78          movzbl %gs:0x78(%rbx),%eax
 1a08f9:       65 0f b6 43 78          movzbl %gs:0x78(%rbx),%eax

 4ab29a:       65 48 63 15 00 00 00    movslq %gs:0x0(%rip),%rdx
 4be128:       65 4c 63 25 00 00 00    movslq %gs:0x0(%rip),%r12
 547468:       65 48 63 1f             movslq %gs:(%rdi),%rbx
 5474e7:       65 48 63 0a             movslq %gs:(%rdx),%rcx
 54d05d:       65 48 63 0d 00 00 00    movslq %gs:0x0(%rip),%rcx

b) into compares:

 b40804:       65 f7 05 00 00 00 00    testl  $0xf0000,%gs:0x0(%rip)
 b487e8:       65 f7 05 00 00 00 00    testl  $0xf0000,%gs:0x0(%rip)
 b6f14c:       65 f6 05 00 00 00 00    testb  $0x1,%gs:0x0(%rip)
 bac1b8:       65 f6 05 00 00 00 00    testb  $0x1,%gs:0x0(%rip)
 df2244:       65 f7 05 00 00 00 00    testl  $0xff00,%gs:0x0(%rip)

 9a7517:       65 80 3d 00 00 00 00    cmpb   $0x0,%gs:0x0(%rip)
 b282ba:       65 44 3b 35 00 00 00    cmp    %gs:0x0(%rip),%r14d
 b48f61:       65 66 83 3d 00 00 00    cmpw   $0x8,%gs:0x0(%rip)
 b493fe:       65 80 38 00             cmpb   $0x0,%gs:(%rax)
 b73867:       65 66 83 3d 00 00 00    cmpw   $0x8,%gs:0x0(%rip)

c) into other insns:

 65ec02:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx
 6c98ac:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx
 9aafaf:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx
 b45868:       65 0f 48 35 00 00 00    cmovs  %gs:0x0(%rip),%esi
 d276f8:       65 0f 44 15 00 00 00    cmove  %gs:0x0(%rip),%edx

The above propagations result in the following code size
improvements for current mainline kernel (with the default config),
compiled with

gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)

   text    data     bss     dec     hex filename
25508862        4386540  808388 30703790        1d480ae vmlinux-vanilla.o
25500922        4386532  808388 30695842        1d461a2 vmlinux-new.o

The conversion of other read-modify-write instructions does not bring us any
benefits, the compiler has some problems when constructing RMW instructions
from the generic code and easily misses some opportunities.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Co-developed-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/include/asm/percpu.h | 65 +++++++++++++++++++++++++++++------
 1 file changed, 54 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index da451202a1b9..60ea7755c0fe 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -400,13 +400,66 @@ do {									\
 #define this_cpu_read_stable_8(pcp)	percpu_stable_op(8, "mov", pcp)
 #define this_cpu_read_stable(pcp)	__pcpu_size_call_return(this_cpu_read_stable_, pcp)
 
+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp)					\
+({									\
+	*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp));		\
+})
+
+#define __raw_cpu_write(qual, pcp, val)					\
+do {									\
+	*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val);	\
+} while (0)
+
+#define raw_cpu_read_1(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val)	__raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val)	__raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val)	__raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_1(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_read_2(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_read_4(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_write_1(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_2(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_4(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_write_8(pcp, val)	__raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_8(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_write_8(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#endif
+
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
+
 #define raw_cpu_read_1(pcp)		percpu_from_op(1, , "mov", pcp)
 #define raw_cpu_read_2(pcp)		percpu_from_op(2, , "mov", pcp)
 #define raw_cpu_read_4(pcp)		percpu_from_op(4, , "mov", pcp)
-
 #define raw_cpu_write_1(pcp, val)	percpu_to_op(1, , "mov", (pcp), val)
 #define raw_cpu_write_2(pcp, val)	percpu_to_op(2, , "mov", (pcp), val)
 #define raw_cpu_write_4(pcp, val)	percpu_to_op(4, , "mov", (pcp), val)
+
+#define this_cpu_read_1(pcp)		percpu_from_op(1, volatile, "mov", pcp)
+#define this_cpu_read_2(pcp)		percpu_from_op(2, volatile, "mov", pcp)
+#define this_cpu_read_4(pcp)		percpu_from_op(4, volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val)	percpu_to_op(1, volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val)	percpu_to_op(2, volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val)	percpu_to_op(4, volatile, "mov", (pcp), val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp)		percpu_from_op(8, , "mov", pcp)
+#define raw_cpu_write_8(pcp, val)	percpu_to_op(8, , "mov", (pcp), val)
+
+#define this_cpu_read_8(pcp)		percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)	percpu_to_op(8, volatile, "mov", (pcp), val)
+#endif
+
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
 #define raw_cpu_add_1(pcp, val)		percpu_add_op(1, , (pcp), val)
 #define raw_cpu_add_2(pcp, val)		percpu_add_op(2, , (pcp), val)
 #define raw_cpu_add_4(pcp, val)		percpu_add_op(4, , (pcp), val)
@@ -432,12 +485,6 @@ do {									\
 #define raw_cpu_xchg_2(pcp, val)	raw_percpu_xchg_op(pcp, val)
 #define raw_cpu_xchg_4(pcp, val)	raw_percpu_xchg_op(pcp, val)
 
-#define this_cpu_read_1(pcp)		percpu_from_op(1, volatile, "mov", pcp)
-#define this_cpu_read_2(pcp)		percpu_from_op(2, volatile, "mov", pcp)
-#define this_cpu_read_4(pcp)		percpu_from_op(4, volatile, "mov", pcp)
-#define this_cpu_write_1(pcp, val)	percpu_to_op(1, volatile, "mov", (pcp), val)
-#define this_cpu_write_2(pcp, val)	percpu_to_op(2, volatile, "mov", (pcp), val)
-#define this_cpu_write_4(pcp, val)	percpu_to_op(4, volatile, "mov", (pcp), val)
 #define this_cpu_add_1(pcp, val)	percpu_add_op(1, volatile, (pcp), val)
 #define this_cpu_add_2(pcp, val)	percpu_add_op(2, volatile, (pcp), val)
 #define this_cpu_add_4(pcp, val)	percpu_add_op(4, volatile, (pcp), val)
@@ -476,8 +523,6 @@ do {									\
  * 32 bit must fall back to generic operations.
  */
 #ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp)			percpu_from_op(8, , "mov", pcp)
-#define raw_cpu_write_8(pcp, val)		percpu_to_op(8, , "mov", (pcp), val)
 #define raw_cpu_add_8(pcp, val)			percpu_add_op(8, , (pcp), val)
 #define raw_cpu_and_8(pcp, val)			percpu_to_op(8, , "and", (pcp), val)
 #define raw_cpu_or_8(pcp, val)			percpu_to_op(8, , "or", (pcp), val)
@@ -486,8 +531,6 @@ do {									\
 #define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
 #define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval)	percpu_try_cmpxchg_op(8, , pcp, ovalp, nval)
 
-#define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
 #define this_cpu_add_8(pcp, val)		percpu_add_op(8, volatile, (pcp), val)
 #define this_cpu_and_8(pcp, val)		percpu_to_op(8, volatile, "and", (pcp), val)
 #define this_cpu_or_8(pcp, val)			percpu_to_op(8, volatile, "or", (pcp), val)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-01 13:14 [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Uros Bizjak
                   ` (3 preceding siblings ...)
  2023-10-01 13:14 ` [RFC PATCH 4/4] x86/percpu: Use C for percpu read/write accessors Uros Bizjak
@ 2023-10-01 17:07 ` Linus Torvalds
  2023-10-01 19:53   ` Uros Bizjak
  4 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2023-10-01 17:07 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Sun, 1 Oct 2023 at 06:16, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> This patchset resurrect the work of Richard Henderson [1] and Nadav Amit [2]
> to introduce named address spaces compiler extension [3,4] into the linux kernel.

So apparently the extension has been around for a while (since gcc-6),
but I don't actually find any single major _user_ of it.

Debian code search finds exactly one case of it outside of the
compilers themselves:

  #  define THREAD_SELF \
    (*(struct pthread *__seg_fs *) offsetof (struct pthread, header.self))

in glibc/sysdeps/x86_64/nptl/tls.h, and even that isn't very widely
used (it seems to be a pthread_mutex implementation helper).

So the coverage testing of this thing seems very weak. Do we have any
other reason to believe that this is likely to actually be reliable
enough to use?

                     Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-01 17:07 ` [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Linus Torvalds
@ 2023-10-01 19:53   ` Uros Bizjak
  2023-10-01 20:21     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Uros Bizjak @ 2023-10-01 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: x86, linux-kernel, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Sun, Oct 1, 2023 at 7:07 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 1 Oct 2023 at 06:16, Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > This patchset resurrect the work of Richard Henderson [1] and Nadav Amit [2]
> > to introduce named address spaces compiler extension [3,4] into the linux kernel.
>
> So apparently the extension has been around for a while (since gcc-6),
> but I don't actually find any single major _user_ of it.
>
> Debian code search finds exactly one case of it outside of the
> compilers themselves:
>
>   #  define THREAD_SELF \
>     (*(struct pthread *__seg_fs *) offsetof (struct pthread, header.self))
>
> in glibc/sysdeps/x86_64/nptl/tls.h, and even that isn't very widely
> used (it seems to be a pthread_mutex implementation helper).
>
> So the coverage testing of this thing seems very weak. Do we have any
> other reason to believe that this is likely to actually be reliable
> enough to use?

The clang manual nicely summarises named address space extension with:

"Note that this is a very very low-level feature that should only be
used if you know what you’re doing (for example in an OS kernel)."

But, at least in GCC, the middle-end code handles many targets,
grepping for MEM_ADDR_SPACE in config directories returns avr, ft32,
gcn, h8300, i386, m32c, m68k, mn10300, msp430, pru, riscv. rl78,
rs6000, sh and vax target. This extension is quite popular with
embedded targets.

Regarding x86 target specific code, the same functionality used for
explicit address space is used internally to handle __thread
qualifier. The testcase:

__thread int m;
int foo (void) { return m; }

compiles the memory read to:

#(insn:TI 10 2 11 2 (set (reg/i:SI 0 ax)
#        (mem/c:SI (const:DI (unspec:DI [
#                        (symbol_ref:DI ("m") [flags 0x2a] <var_decl
0x7f4a5a811bd0 m>)
#                    ] UNSPEC_NTPOFF)) [1 m+0 S4 A32 AS1]))
"thread.c":3:28 81 {*movsi_internal}
#     (nil))
       movl    %fs:m@tpoff, %eax       # 10    [c=5 l=8]  *movsi_internal/0

where AS1 in memory flags represents address space 1 (%fs: prefix).

Also, stack protector internally uses the same target specific code:

#(insn:TI 6 9 10 2 (parallel [
#            (set (mem/v/f/c:DI (plus:DI (reg/f:DI 7 sp)
#                        (const_int 8 [0x8])) [4 D.2009+0 S8 A64])
#                (unspec:DI [
#                        (mem/v/f:DI (const_int 40 [0x28]) [5
MEM[(<address-space-1> long unsigned int *)40B]+0 S8 A64 AS1])
#                    ] UNSPEC_SP_SET))
#            (set (reg:DI 0 ax [92])
#                (const_int 0 [0]))
#            (clobber (reg:CC 17 flags))
#        ]) "pr111023.c":12:1 1265 {stack_protect_set_1_di}
#     (expr_list:REG_UNUSED (reg:CC 17 flags)
#        (expr_list:REG_UNUSED (reg:DI 0 ax [92])
#            (nil))))
       movq    %fs:40, %rax    # 6     [c=0 l=16]  stack_protect_set_1_di

Again, AS1 in memory flags defines address space 1 (%fs prefix).

Compare this to the following testcase that explicitly defines __seg_gs:

__seg_gs int m;
int foo (void) { return m; }

The testcase compiles the read from memory to:

#(insn:TI 10 2 11 2 (set (reg/i:SI 0 ax)
#        (mem/c:SI (symbol_ref:DI ("m") [flags 0x2] <var_decl
0x7f89fe611bd0 m>) [1 m+0 S4 A32 AS2])) "thread.c":3:28 81
{*movsi_internal}
#     (nil))
       movl    %gs:m(%rip), %eax       # 10    [c=5 l=7]  *movsi_internal/0

Please note AS2 in memory flags, this represents address space 2 (%gs:
prefix). Otherwise, the memory reference is handled by the compiler as
all other memory references.

As demonstrated above, the compiler handles memory reference with
explicit address space through the same middle-end and
target-dependant code as __thread and stack protector memory
references. __thread and stack protector are heavily used features of
the compiler, so I'm quite confident that explicit address space
should work as advertised. Even *if* there are some issues with
aliasing, the kernel is immune to them due to

KBUILD_CFLAGS += -fno-strict-aliasing

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-01 19:53   ` Uros Bizjak
@ 2023-10-01 20:21     ` Linus Torvalds
  2023-10-01 20:30       ` Linus Torvalds
  2023-10-01 21:47       ` Uros Bizjak
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2023-10-01 20:21 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Sun, 1 Oct 2023 at 12:53, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> Regarding x86 target specific code, the same functionality used for
> explicit address space is used internally to handle __thread
> qualifier.

Ok, that's interesting, in that __thread is certainly widely used so
it will have seen testing.

> Even *if* there are some issues with aliasing, the kernel
> is immune to them due to
>
> KBUILD_CFLAGS += -fno-strict-aliasing

It's not aliasing I'd worry about. It's correctness.

And indeed, the *very* first thing I tried shows that this is all very
very buggy in gcc.

What did I try? A simple memory copy with a structure assignment.

Try to compile this:

    #include <string.h>
    struct a { long arr[30]; };

    __seg_fs struct a m;
    void foo(struct a *dst) { *dst = m; }

using the kernel compiler options (it's the "don't use sse/avx" ones
that matter):

    gcc -mno-avx -mno-sse -O2 -S t.c

and look at the end result. It's complete and utter sh*t:

        foo:
                xorl    %eax, %eax
                cmpq    $240, %rax
                jnb     .L5
        .L2:
                movzbl  %fs:m(%rax), %edx
                movb    %dl, (%rdi,%rax)
                addq    $1, %rax
                cmpq    $240, %rax
                jb      .L2
        .L5:
                ret

to the point that I can only go "WTF"?

I mean, it's not just that it does the copy one byte at a time. It
literally compares %rax to $240 just after it has cleared it. I look
at that code, and I go "a five-year old with a crayon could have done
better".

In other words, no, we're not using this thing that generates that
kind of garbage.

Somebody needs to open a bugzilla entry for this kind of code generation.

Clang isn't much better, but at least it doesn't generate bad code. It
just crashes with an internal compiler error on the above trivial
test-case:

    fatal error: error in backend: cannot lower memory intrinsic in
address space 257

which at least tells the user that they can't copy memory from that
address space. But once again shows that no, this feature is not ready
for prime-time.

If literally the *first* thing I thought to test was this broken, what
else is broken in this model?

And no, the kernel doesn't currently do the above kinds of things.
That's not the point. The point was "how well is this compiler support
tested". The answer is "not at all".

                   Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-01 20:21     ` Linus Torvalds
@ 2023-10-01 20:30       ` Linus Torvalds
  2023-10-01 21:47       ` Uros Bizjak
  1 sibling, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2023-10-01 20:30 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Sun, 1 Oct 2023 at 13:21, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 1 Oct 2023 at 12:53, Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > Regarding x86 target specific code, the same functionality used for
> > explicit address space is used internally to handle __thread
> > qualifier.
>
> Ok, that's interesting, in that __thread is certainly widely used so
> it will have seen testing.

.. but I just checked that the __thread case *does* work with my
stupid test-case, so clearly the "__thread" coverage ends up being
very different from something like __seg_fs.

The difference? For __thread, gcc and clang know how to get the
beginning of the thread area (the equivalent of our kernel
this_cpu_ptr() macro), so now the compiler knows how to turn a
__thread pointer into a "normal" pointer, and can just do memcpy.

But for __seg_fs and __seg_gs, the compiler doesn't know how to do
that, and then ends up just flailing wildly.

If the structure is small enough to be done as individual moves, both
gcc and clang do ok. But anything else is just a complete shit-show.

If they both errored out reliably and legibly, that would be one
thing. But gcc just silently generates garbage, and clang errors out
with 60+ lines of internal compiler backtrace.

             Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-01 20:21     ` Linus Torvalds
  2023-10-01 20:30       ` Linus Torvalds
@ 2023-10-01 21:47       ` Uros Bizjak
  2023-10-02 12:13         ` Uros Bizjak
  1 sibling, 1 reply; 18+ messages in thread
From: Uros Bizjak @ 2023-10-01 21:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: x86, linux-kernel, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Sun, Oct 1, 2023 at 10:21 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 1 Oct 2023 at 12:53, Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > Regarding x86 target specific code, the same functionality used for
> > explicit address space is used internally to handle __thread
> > qualifier.
>
> Ok, that's interesting, in that __thread is certainly widely used so
> it will have seen testing.
>
> > Even *if* there are some issues with aliasing, the kernel
> > is immune to them due to
> >
> > KBUILD_CFLAGS += -fno-strict-aliasing
>
> It's not aliasing I'd worry about. It's correctness.
>
> And indeed, the *very* first thing I tried shows that this is all very
> very buggy in gcc.
>
> What did I try? A simple memory copy with a structure assignment.
>
> Try to compile this:
>
>     #include <string.h>
>     struct a { long arr[30]; };
>
>     __seg_fs struct a m;
>     void foo(struct a *dst) { *dst = m; }
>
> using the kernel compiler options (it's the "don't use sse/avx" ones
> that matter):
>
>     gcc -mno-avx -mno-sse -O2 -S t.c
>
> and look at the end result. It's complete and utter sh*t:
>
>         foo:
>                 xorl    %eax, %eax
>                 cmpq    $240, %rax
>                 jnb     .L5
>         .L2:
>                 movzbl  %fs:m(%rax), %edx
>                 movb    %dl, (%rdi,%rax)
>                 addq    $1, %rax
>                 cmpq    $240, %rax
>                 jb      .L2
>         .L5:
>                 ret
>
> to the point that I can only go "WTF"?
>
> I mean, it's not just that it does the copy one byte at a time. It
> literally compares %rax to $240 just after it has cleared it. I look
> at that code, and I go "a five-year old with a crayon could have done
> better".
>
> In other words, no, we're not using this thing that generates that
> kind of garbage.
>
> Somebody needs to open a bugzilla entry for this kind of code generation.

Huh, this testcase triggers known issue with IVopts. I opened
PR111657, but the issue with IVopts is already reported in PR79649
[2].

> Clang isn't much better, but at least it doesn't generate bad code. It
> just crashes with an internal compiler error on the above trivial
> test-case:
>
>     fatal error: error in backend: cannot lower memory intrinsic in
> address space 257
>
> which at least tells the user that they can't copy memory from that
> address space. But once again shows that no, this feature is not ready
> for prime-time.
>
> If literally the *first* thing I thought to test was this broken, what
> else is broken in this model?
>
> And no, the kernel doesn't currently do the above kinds of things.
> That's not the point. The point was "how well is this compiler support
> tested". The answer is "not at all".

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79649

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-01 21:47       ` Uros Bizjak
@ 2023-10-02 12:13         ` Uros Bizjak
  2023-10-02 13:22           ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Uros Bizjak @ 2023-10-02 12:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: x86, linux-kernel, Andy Lutomirski, Ingo Molnar, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Sun, Oct 1, 2023 at 11:47 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Sun, Oct 1, 2023 at 10:21 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sun, 1 Oct 2023 at 12:53, Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > Regarding x86 target specific code, the same functionality used for
> > > explicit address space is used internally to handle __thread
> > > qualifier.
> >
> > Ok, that's interesting, in that __thread is certainly widely used so
> > it will have seen testing.
> >
> > > Even *if* there are some issues with aliasing, the kernel
> > > is immune to them due to
> > >
> > > KBUILD_CFLAGS += -fno-strict-aliasing
> >
> > It's not aliasing I'd worry about. It's correctness.
> >
> > And indeed, the *very* first thing I tried shows that this is all very
> > very buggy in gcc.
> >
> > What did I try? A simple memory copy with a structure assignment.
> >
> > Try to compile this:
> >
> >     #include <string.h>
> >     struct a { long arr[30]; };
> >
> >     __seg_fs struct a m;
> >     void foo(struct a *dst) { *dst = m; }
> >
> > using the kernel compiler options (it's the "don't use sse/avx" ones
> > that matter):
> >
> >     gcc -mno-avx -mno-sse -O2 -S t.c
> >
> > and look at the end result. It's complete and utter sh*t:
> >
> >         foo:
> >                 xorl    %eax, %eax
> >                 cmpq    $240, %rax
> >                 jnb     .L5
> >         .L2:
> >                 movzbl  %fs:m(%rax), %edx
> >                 movb    %dl, (%rdi,%rax)
> >                 addq    $1, %rax
> >                 cmpq    $240, %rax
> >                 jb      .L2
> >         .L5:
> >                 ret
> >
> > to the point that I can only go "WTF"?
> >
> > I mean, it's not just that it does the copy one byte at a time. It
> > literally compares %rax to $240 just after it has cleared it. I look
> > at that code, and I go "a five-year old with a crayon could have done
> > better".
> >
> > In other words, no, we're not using this thing that generates that
> > kind of garbage.
> >
> > Somebody needs to open a bugzilla entry for this kind of code generation.
>
> Huh, this testcase triggers known issue with IVopts. I opened
> PR111657, but the issue with IVopts is already reported in PR79649
> [2].

Actually (or luckily), my preliminary analysis was wrong. This is just
the case of missing optimization, where target dependent code chose
the nonoptimal (but still correct) copy algorithm, under very unusual
circumstances. In GCC, the stringop can be implemented using several
(8) algorithms:

DEF_ALG (libcall, libcall)
DEF_ALG (rep_prefix_1_byte, rep_byte)
DEF_ALG (rep_prefix_4_byte, rep_4byte)
DEF_ALG (rep_prefix_8_byte, rep_8byte)
DEF_ALG (loop_1_byte, byte_loop)
DEF_ALG (loop, loop)
DEF_ALG (unrolled_loop, unrolled_loop)
DEF_ALG (vector_loop, vector_loop)

but some of them (rep_prefix ones) can not be used with non-default
address spaces. Obviously, vector_loop can not be used without SSE/AVX
instructions, so what remains is a severely limited selection of
algorithms. Target processors (-mtune=...) select their own selection
of algorithms, based on maximum copied block size. The generic tuning
selects:

static stringop_algs generic_memcpy[2] = {
  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
             {-1, libcall, false}}},
  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
             {-1, libcall, false}}}};

Now, rep_prefix_8_byte is not available with non-default address
space, so the algorithm falls back to libcall (the one after
unavailable algo). However, we can't call into the libc here (library
function also handles only default address space), so the target
independent part of the compiler emits "the-most-generic" one-byte
copy loop.

The "unusual circumstances" here are following:
- rep_prefix instructions are disabled, these are otherwise used by most targets
- vector loops are disabled
- libcall algo still produces correct code, with non-optimal loop instructions.

Very few users look into produced assembly to find the difference
between one-byte loop (admittedly with unwanted compare) and 8-byte
loop. Since the compiled code works as expected, there were no
bugreports for this, I would say minor issue.

As shown in the bugreport [1], the fix is to select the "loop"
fallback algorithm instead of libcall (a patch is effectively a couple
of lines) when non-default address space is used. The generated
assembly for the structure copy now reads:

       xorl    %eax, %eax
.L2:
       movl    %eax, %edx
       addl    $8, %eax
       movq    %gs:m(%rdx), %rcx
       movq    %rcx, (%rdi,%rdx)
       cmpl    $240, %eax
       jb      .L2
       ret

The same assembly can be generated with -mstringop-strategy=loop
compiler argument. This argument will affect both, __seg_gs and
__thread copy loops, so (almost) the same code will be generated for
both loops (because, as claimed in previous post, the same compiler
code handles named and implicit name spaces).

> > Clang isn't much better, but at least it doesn't generate bad code. It
> > just crashes with an internal compiler error on the above trivial
> > test-case:
> >
> >     fatal error: error in backend: cannot lower memory intrinsic in
> > address space 257
> >
> > which at least tells the user that they can't copy memory from that
> > address space. But once again shows that no, this feature is not ready
> > for prime-time.
> >
> > If literally the *first* thing I thought to test was this broken, what
> > else is broken in this model?
> >
> > And no, the kernel doesn't currently do the above kinds of things.
> > That's not the point. The point was "how well is this compiler support
> > tested". The answer is "not at all".

I don't agree with the above claims. The generated code was the
product of a too limited selection of available copy algorithms in
unusual circumstances, but even in the case of generic fallback code,
the generated code was *correct*. As said in the previous post, and
re-confirmed by the patch in the PR, the same code in GCC handles
implicit (__thread) and named address spaces. At the end of the day,
the problematic code was merely a missing-optimization (the bug with
the lowest severity in GCC).

> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657
> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79649

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-02 12:13         ` Uros Bizjak
@ 2023-10-02 13:22           ` Ingo Molnar
  2023-10-03  9:49             ` Uros Bizjak
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2023-10-02 13:22 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Linus Torvalds, x86, linux-kernel, Andy Lutomirski, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf


* Uros Bizjak <ubizjak@gmail.com> wrote:

> > > Clang isn't much better, but at least it doesn't generate bad code. It
> > > just crashes with an internal compiler error on the above trivial
> > > test-case:
> > >
> > >     fatal error: error in backend: cannot lower memory intrinsic in
> > > address space 257
> > >
> > > which at least tells the user that they can't copy memory from that
> > > address space. But once again shows that no, this feature is not ready
> > > for prime-time.
> > >
> > > If literally the *first* thing I thought to test was this broken, what
> > > else is broken in this model?
> > >
> > > And no, the kernel doesn't currently do the above kinds of things.
> > > That's not the point. The point was "how well is this compiler support
> > > tested". The answer is "not at all".
> 
> I don't agree with the above claims. The generated code was the product 
> of a too limited selection of available copy algorithms in unusual 
> circumstances, but even in the case of generic fallback code, the 
> generated code was *correct*. As said in the previous post, and 
> re-confirmed by the patch in the PR, the same code in GCC handles 
> implicit (__thread) and named address spaces. At the end of the day, the 
> problematic code was merely a missing-optimization (the bug with the 
> lowest severity in GCC).

Yeah, so the feature generated really crappy code for Linus's
testcase. That's never a good sign for compiler features, full stop.

Do we want the kernel to be at the bleeding edge of an 
unusual compiler feature that is only used internally by the
compiler in a very specific fashion?

Maybe, but Linus's reluctance and caution is justified IMHO,
and at minimum this feature needs some careful evaluation of 
long-time suitability [*] ...

Thanks,

	Ingo


[*] euphemism for: "I have no idea how to evaluate this risk"... :-/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-02 13:22           ` Ingo Molnar
@ 2023-10-03  9:49             ` Uros Bizjak
  2023-10-03 13:38               ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Uros Bizjak @ 2023-10-03  9:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, x86, linux-kernel, Andy Lutomirski, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Mon, Oct 2, 2023 at 3:22 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > > > Clang isn't much better, but at least it doesn't generate bad code. It
> > > > just crashes with an internal compiler error on the above trivial
> > > > test-case:
> > > >
> > > >     fatal error: error in backend: cannot lower memory intrinsic in
> > > > address space 257
> > > >
> > > > which at least tells the user that they can't copy memory from that
> > > > address space. But once again shows that no, this feature is not ready
> > > > for prime-time.
> > > >
> > > > If literally the *first* thing I thought to test was this broken, what
> > > > else is broken in this model?
> > > >
> > > > And no, the kernel doesn't currently do the above kinds of things.
> > > > That's not the point. The point was "how well is this compiler support
> > > > tested". The answer is "not at all".
> >
> > I don't agree with the above claims. The generated code was the product
> > of a too limited selection of available copy algorithms in unusual
> > circumstances, but even in the case of generic fallback code, the
> > generated code was *correct*. As said in the previous post, and
> > re-confirmed by the patch in the PR, the same code in GCC handles
> > implicit (__thread) and named address spaces. At the end of the day, the
> > problematic code was merely a missing-optimization (the bug with the
> > lowest severity in GCC).
>
> Yeah, so the feature generated really crappy code for Linus's
> testcase. That's never a good sign for compiler features, full stop.
>
> Do we want the kernel to be at the bleeding edge of an
> unusual compiler feature that is only used internally by the
> compiler in a very specific fashion?

I understand your reservations about the new feature, but please allow
me to bring up some facts about it. The feature is *far* from being an
unusual internal compiler feature in GCC. It was introduced for gcc-6
(mostly for embedded targets), but in 2017 the complete thread local
storage (e.g. __thread) handling was switched to named address space
and released in gcc-8. I don't remember any reported problem with this
feature (and I know what I'm talking about, the switch to named AS in
the compiler was done by myself [1]). Since named AS uses the same
compiler code, I'm quite confident that it shouldn't be considered a
bleeding edge unusual compiler feature.

The "crappy code" happened due to the limitation in the completely
different part of the compiler. It was stringop algorithm selection, a
feature orthogonal to address spaces (tangential only in the sense
that several algorithms are not available with non-default address
space). Colorful language notwithstanding, the generated code is
nothing more than "missed-optimization" and the gcc bugreport was
qualified as that. With -mstringop-strategy=loop, the testcase
produces expected code, and the stringop selection in the compiler
will be adjusted accordingly also for the very limited selection of
stringop algorithms for named AS. Even when the kernel will never use
these algorithms for its percpu functionality...

> Maybe, but Linus's reluctance and caution is justified IMHO,
> and at minimum this feature needs some careful evaluation of
> long-time suitability [*] ...

I do have a proposal on how to introduce a new feature while
minimising risk as much as possible. I must admit that detecting the
feature for all released compilers that can handle __seg_gs seems
quite risky (I have to curb my enthusiasm somehow ;) ), so perhaps a
staged approach with a currently unreleased compiler is more
appropriate. Using:

+config CC_HAS_NAMED_AS
+       def_bool CC_IS_GCC && GCC_VERSION >= 140000

would enable the new feature only for currently unreleased
experimental GCC. This would allow to qualify the compiler and weed
out any possible problems, before the compiler is released. Please
note, that the patch is written in such a way, that the code reverts
to the existing approach for undefined CC_HAS_NAMED_AS.

I would also like to point out additional possible improvements that
are not in the proposed patch. Using named AS enables the kernel to
move one step further to PIE, as noted in the original patch
submission [2].

> [*] euphemism for: "I have no idea how to evaluate this risk"... :-/

[1] https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=cfc72af0fbdf
[2] https://lore.kernel.org/lkml/20190823224424.15296-1-namit@vmware.com/

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-03  9:49             ` Uros Bizjak
@ 2023-10-03 13:38               ` Ingo Molnar
  2023-10-03 19:31                 ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2023-10-03 13:38 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Linus Torvalds, x86, linux-kernel, Andy Lutomirski, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf


* Uros Bizjak <ubizjak@gmail.com> wrote:

> > Maybe, but Linus's reluctance and caution is justified IMHO, and at 
> > minimum this feature needs some careful evaluation of long-time 
> > suitability [*] ...
> 
> I do have a proposal on how to introduce a new feature while minimising 
> risk as much as possible. I must admit that detecting the feature for all 
> released compilers that can handle __seg_gs seems quite risky (I have to 
> curb my enthusiasm somehow ;) ), so perhaps a staged approach with a 
> currently unreleased compiler is more appropriate. Using:
> 
> +config CC_HAS_NAMED_AS
> +       def_bool CC_IS_GCC && GCC_VERSION >= 140000
> 
> would enable the new feature only for currently unreleased experimental 
> GCC. This would allow to qualify the compiler and weed out any possible 
> problems, before the compiler is released. Please note, that the patch is 
> written in such a way, that the code reverts to the existing approach for 
> undefined CC_HAS_NAMED_AS.

So I don't think it's a good idea to restrict it to the devel GCC version 
only, the cross-section of devel-GCC and devel-kernel reduces testing 
coverage to near-zero in practice ...

Instead, if Linus doesn't disagree that is, how about restricting it to the 
freshest GCC minor version? Ie. GCC 13.1 and later. We could carry it in a 
separate branch in -tip for a while and not merge it into v6.7 but v6.8 or 
so.

That would at least give us *some* amount of test coverage, without any 
real upstream risk.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-03 13:38               ` Ingo Molnar
@ 2023-10-03 19:31                 ` Linus Torvalds
  2023-10-03 19:43                   ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2023-10-03 19:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Uros Bizjak, x86, linux-kernel, Andy Lutomirski, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Tue, 3 Oct 2023 at 06:38, Ingo Molnar <mingo@kernel.org> wrote:
>
> So I don't think it's a good idea to restrict it to the devel GCC version
> only, the cross-section of devel-GCC and devel-kernel reduces testing
> coverage to near-zero in practice ...

In fact, while the clang failure was arguably worse from a code
generation standpoint (as in "it didn't generate any code AT ALL"), it
was actually better from a kernel standpoint: I'd *much* rather have a
compile-time failure than bad code generation when it's a particular
issue that we can avoid by just not doing it.

IOW, *if* this is the only actual issue with named address spaces,
then I'd much rather have a compiler that says "don't do that" over a
compiler that silently generates absolutely horrendous code.

That is not unlike my "I'd rather get a link time error from trying to
do a 64-by-64 divide on x86-32, than have the compiler actually
generate that horrendously expensive operation". There's a reason we
have "do_div64()" to do 64-by-32 divides, because that's usually what
you actually want.

We should not be doing big structure copies from or to the percpu
area, so clang then failing with an admittedly horrendous error
message is not a bad thing.

And again - my worry really isn't this "copy a percpu structure"
issue. It's literally just that I feel this doesn't have a lot of
coverage.

              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-03 19:31                 ` Linus Torvalds
@ 2023-10-03 19:43                   ` Ingo Molnar
  2023-10-03 21:43                     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2023-10-03 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Uros Bizjak, x86, linux-kernel, Andy Lutomirski, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 3 Oct 2023 at 06:38, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > So I don't think it's a good idea to restrict it to the devel GCC version
> > only, the cross-section of devel-GCC and devel-kernel reduces testing
> > coverage to near-zero in practice ...
> 
> In fact, while the clang failure was arguably worse from a code
> generation standpoint (as in "it didn't generate any code AT ALL"), it
> was actually better from a kernel standpoint: I'd *much* rather have a
> compile-time failure than bad code generation when it's a particular
> issue that we can avoid by just not doing it.
> 
> IOW, *if* this is the only actual issue with named address spaces,
> then I'd much rather have a compiler that says "don't do that" over a
> compiler that silently generates absolutely horrendous code.
> 
> That is not unlike my "I'd rather get a link time error from trying to
> do a 64-by-64 divide on x86-32, than have the compiler actually
> generate that horrendously expensive operation". There's a reason we
> have "do_div64()" to do 64-by-32 divides, because that's usually what
> you actually want.
> 
> We should not be doing big structure copies from or to the percpu
> area, so clang then failing with an admittedly horrendous error
> message is not a bad thing.
> 
> And again - my worry really isn't this "copy a percpu structure"
> issue. It's literally just that I feel this doesn't have a lot of
> coverage.

I share all those concerns.

So we could do this: we let it live in -tip for a cycle, in a separate
branch, and observe what happens - it gets picked up by -next on
a daily basis and most x86 developers test it. It won't be merged by other
branches in -tip, it won't be pulled by others or relied on. If it
conflicts with other bits we rebase it cleanly, no questions asked.

While -next test coverage is still limited in many ways, it's also
certainly not zero.

If it's problem-free for a cycle I'll offer it up to you as an RFC pull,
summarizing our experience with it. (Should it ever get to that point.)

That's the best I think we can do - and worst-case we'll turn it off
again and go curse flaky compiler features. Will be easy to turn off
if it's compiler version triggered anyway.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-03 19:43                   ` Ingo Molnar
@ 2023-10-03 21:43                     ` Linus Torvalds
  2023-10-04  6:01                       ` Uros Bizjak
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2023-10-03 21:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Uros Bizjak, x86, linux-kernel, Andy Lutomirski, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Tue, 3 Oct 2023 at 12:43, Ingo Molnar <mingo@kernel.org> wrote:
>
> So we could do this: we let it live in -tip for a cycle, in a separate
> branch, and observe what happens - it gets picked up by -next on
> a daily basis and most x86 developers test it. It won't be merged by other
> branches in -tip, it won't be pulled by others or relied on. If it
> conflicts with other bits we rebase it cleanly, no questions asked.

Sounds like a plan,

              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] x86/percpu: Use segment qualifiers
  2023-10-03 21:43                     ` Linus Torvalds
@ 2023-10-04  6:01                       ` Uros Bizjak
  0 siblings, 0 replies; 18+ messages in thread
From: Uros Bizjak @ 2023-10-04  6:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, x86, linux-kernel, Andy Lutomirski, Nadav Amit,
	Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Josh Poimboeuf

On Tue, Oct 3, 2023 at 11:44 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 3 Oct 2023 at 12:43, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > So we could do this: we let it live in -tip for a cycle, in a separate
> > branch, and observe what happens - it gets picked up by -next on
> > a daily basis and most x86 developers test it. It won't be merged by other
> > branches in -tip, it won't be pulled by others or relied on. If it
> > conflicts with other bits we rebase it cleanly, no questions asked.
>
> Sounds like a plan,

Thanks!

Will send a patchset with the config change later today.

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-10-04  6:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-01 13:14 [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Uros Bizjak
2023-10-01 13:14 ` [RFC PATCH 1/4] x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip Uros Bizjak
2023-10-01 13:14 ` [RFC PATCH 2/4] x86/percpu: Detect compiler support for named address spaces Uros Bizjak
2023-10-01 13:14 ` [RFC PATCH 3/4] x86/percpu: Use compiler segment prefix qualifier Uros Bizjak
2023-10-01 13:14 ` [RFC PATCH 4/4] x86/percpu: Use C for percpu read/write accessors Uros Bizjak
2023-10-01 17:07 ` [RFC PATCH 0/4] x86/percpu: Use segment qualifiers Linus Torvalds
2023-10-01 19:53   ` Uros Bizjak
2023-10-01 20:21     ` Linus Torvalds
2023-10-01 20:30       ` Linus Torvalds
2023-10-01 21:47       ` Uros Bizjak
2023-10-02 12:13         ` Uros Bizjak
2023-10-02 13:22           ` Ingo Molnar
2023-10-03  9:49             ` Uros Bizjak
2023-10-03 13:38               ` Ingo Molnar
2023-10-03 19:31                 ` Linus Torvalds
2023-10-03 19:43                   ` Ingo Molnar
2023-10-03 21:43                     ` Linus Torvalds
2023-10-04  6:01                       ` Uros Bizjak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.