linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] x86/percpu: Use segment qualifiers
@ 2019-08-23 22:44 Nadav Amit
  2019-08-23 22:44 ` [PATCH 1/7] compiler: Report x86 segment support Nadav Amit
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

GCC 6+ supports segment qualifiers. Using them allows to implement
several optimizations:

1. Avoid unnecessary instructions when an operation is carried on
read/written per-cpu value, and instead allow the compiler to set
instructions that access per-cpu value directly.

2. Make this_cpu_ptr() more efficient and allow its value to be cached,
since preemption must be disabled when this_cpu_ptr() is used.

3. Provide better alternative for this_cpu_read_stable() that caches
values more efficiently using alias attribute to const variable.

4. Allow the compiler to perform other optimizations (e.g. CSE).

5. Use rip-relative addressing in per_cpu_read_stable(), which make it
PIE-ready.

"size" and Peter's compare do not seem to show the impact on code size
reduction correctly. Summing the code size according to nm on defconfig
shows a minor reduction from 11451310 to 11451310 (0.09%).

RFC->v1:
 * Fixing i386 build bug
 * Moving chunk to the right place [Peter]

Nadav Amit (7):
  compiler: Report x86 segment support
  x86/percpu: Use compiler segment prefix qualifier
  x86/percpu: Use C for percpu accesses when possible
  x86: Fix possible caching of current_task
  percpu: Assume preemption is disabled on per_cpu_ptr()
  x86/percpu: Optimized arch_raw_cpu_ptr()
  x86/current: Aggressive caching of current

 arch/x86/include/asm/current.h         |  30 +++
 arch/x86/include/asm/fpu/internal.h    |   7 +-
 arch/x86/include/asm/percpu.h          | 293 +++++++++++++++++++------
 arch/x86/include/asm/preempt.h         |   3 +-
 arch/x86/include/asm/resctrl_sched.h   |  14 +-
 arch/x86/kernel/cpu/Makefile           |   1 +
 arch/x86/kernel/cpu/common.c           |   7 +-
 arch/x86/kernel/cpu/current.c          |  16 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |   4 +-
 arch/x86/kernel/process_32.c           |   4 +-
 arch/x86/kernel/process_64.c           |   4 +-
 include/asm-generic/percpu.h           |  12 +
 include/linux/compiler-gcc.h           |   4 +
 include/linux/compiler.h               |   2 +-
 include/linux/percpu-defs.h            |  33 ++-
 15 files changed, 346 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/current.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/7] compiler: Report x86 segment support
  2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
@ 2019-08-23 22:44 ` Nadav Amit
  2019-08-23 22:44 ` [PATCH 2/7] x86/percpu: Use compiler segment prefix qualifier Nadav Amit
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

GCC v6+ supports x86 segment qualifiers (__seg_gs and __seg_fs). Define
COMPILER_HAS_X86_SEG_SUPPORT when it is supported.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/compiler-gcc.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index d7ee4c6bad48..5967590a18c6 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -149,6 +149,10 @@
 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
 #endif
 
+#if GCC_VERSION >= 60000
+#define COMPILER_HAS_X86_SEG_SUPPORT 1
+#endif
+
 /*
  * Turn individual warnings and errors on and off locally, depending
  * on version.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/7] x86/percpu: Use compiler segment prefix qualifier
  2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
  2019-08-23 22:44 ` [PATCH 1/7] compiler: Report x86 segment support Nadav Amit
@ 2019-08-23 22:44 ` Nadav Amit
  2019-08-23 22:44 ` [PATCH 3/7] x86/percpu: Use C for percpu accesses when possible Nadav Amit
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

Using a segment prefix qualifier is cleaner than using a segment prefix
in the inline assembly, and provides the compiler with more information,
telling it that __seg_gs:[addr] is different than [addr] when it
analyzes data dependencies. It also enables various optimizations that
will be implemented in the next patches.

Use segment prefix qualifiers when they are supported. Unfortunately,
gcc does not provide a way to remove segment qualifiers, which is needed
to use typeof() to create local instances of the per-cpu variable. For
this reason, do not use the segment qualifier for per-cpu variables, and
do casting using the segment qualifier instead.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/percpu.h  | 153 ++++++++++++++++++++++-----------
 arch/x86/include/asm/preempt.h |   3 +-
 2 files changed, 104 insertions(+), 52 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 2278797c769d..1fe348884477 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -45,8 +45,33 @@
 #include <linux/kernel.h>
 #include <linux/stringify.h>
 
+#if defined(COMPILER_HAS_X86_SEG_SUPPORT) && IS_ENABLED(CONFIG_SMP)
+#define USE_X86_SEG_SUPPORT	1
+#else
+#define USE_X86_SEG_SUPPORT	0
+#endif
+
 #ifdef CONFIG_SMP
+
+#if USE_X86_SEG_SUPPORT
+
+#ifdef CONFIG_X86_64
+#define __percpu_seg_override	__seg_gs
+#else
+#define __percpu_seg_override	__seg_fs
+#endif
+
+#define __percpu_prefix		""
+#define __force_percpu_prefix	"%%"__stringify(__percpu_seg)":"
+
+#else /* USE_X86_SEG_SUPPORT */
+
+#define __percpu_seg_override
 #define __percpu_prefix		"%%"__stringify(__percpu_seg)":"
+#define __force_percpu_prefix	__percpu_prefix
+
+#endif /* USE_X86_SEG_SUPPORT */
+
 #define __my_cpu_offset		this_cpu_read(this_cpu_off)
 
 /*
@@ -58,14 +83,21 @@
 	unsigned long tcp_ptr__;			\
 	asm volatile("add " __percpu_arg(1) ", %0"	\
 		     : "=r" (tcp_ptr__)			\
-		     : "m" (this_cpu_off), "0" (ptr));	\
+		     : "m" (__my_cpu_var(this_cpu_off)),\
+		       "0" (ptr));			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;	\
 })
-#else
+#else /* CONFIG_SMP */
+#define __percpu_seg_override
 #define __percpu_prefix		""
-#endif
+#define __force_percpu_prefix	""
+#endif /* CONFIG_SMP */
 
+#define __my_cpu_type(var)	typeof(var) __percpu_seg_override
+#define __my_cpu_ptr(ptr)	(__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
+#define __my_cpu_var(var)	(*__my_cpu_ptr(&var))
 #define __percpu_arg(x)		__percpu_prefix "%" #x
+#define __force_percpu_arg(x)	__force_percpu_prefix "%" #x
 
 /*
  * Initialized pointers to per-cpu variables needed for the boot
@@ -98,22 +130,22 @@ do {							\
 	switch (sizeof(var)) {				\
 	case 1:						\
 		asm qual (op "b %1,"__percpu_arg(0)	\
-		    : "+m" (var)			\
+		    : "+m" (__my_cpu_var(var))		\
 		    : "qi" ((pto_T__)(val)));		\
 		break;					\
 	case 2:						\
 		asm qual (op "w %1,"__percpu_arg(0)	\
-		    : "+m" (var)			\
+		    : "+m" (__my_cpu_var(var))		\
 		    : "ri" ((pto_T__)(val)));		\
 		break;					\
 	case 4:						\
 		asm qual (op "l %1,"__percpu_arg(0)	\
-		    : "+m" (var)			\
+		    : "+m" (__my_cpu_var(var))		\
 		    : "ri" ((pto_T__)(val)));		\
 		break;					\
 	case 8:						\
 		asm qual (op "q %1,"__percpu_arg(0)	\
-		    : "+m" (var)			\
+		    : "+m" (__my_cpu_var(var))		\
 		    : "re" ((pto_T__)(val)));		\
 		break;					\
 	default: __bad_percpu_size();			\
@@ -138,71 +170,79 @@ do {									\
 	switch (sizeof(var)) {						\
 	case 1:								\
 		if (pao_ID__ == 1)					\
-			asm qual ("incb "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incb "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else if (pao_ID__ == -1)				\
-			asm qual ("decb "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decb "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else							\
 			asm qual ("addb %1, "__percpu_arg(0)		\
-			    : "+m" (var)				\
+			    : "+m" (__my_cpu_var(var))			\
 			    : "qi" ((pao_T__)(val)));			\
 		break;							\
 	case 2:								\
 		if (pao_ID__ == 1)					\
-			asm qual ("incw "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incw "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else if (pao_ID__ == -1)				\
-			asm qual ("decw "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decw "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else							\
 			asm qual ("addw %1, "__percpu_arg(0)		\
-			    : "+m" (var)				\
+			    : "+m" (__my_cpu_var(var))			\
 			    : "ri" ((pao_T__)(val)));			\
 		break;							\
 	case 4:								\
 		if (pao_ID__ == 1)					\
-			asm qual ("incl "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incl "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else if (pao_ID__ == -1)				\
-			asm qual ("decl "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decl "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else							\
 			asm qual ("addl %1, "__percpu_arg(0)		\
-			    : "+m" (var)				\
+			    : "+m" (__my_cpu_var(var))			\
 			    : "ri" ((pao_T__)(val)));			\
 		break;							\
 	case 8:								\
 		if (pao_ID__ == 1)					\
-			asm qual ("incq "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incq "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else if (pao_ID__ == -1)				\
-			asm qual ("decq "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decq "__percpu_arg(0) :		\
+				  "+m" (__my_cpu_var(var)));		\
 		else							\
 			asm qual ("addq %1, "__percpu_arg(0)		\
-			    : "+m" (var)				\
+			    : "+m" (__my_cpu_var(var))			\
 			    : "re" ((pao_T__)(val)));			\
 		break;							\
 	default: __bad_percpu_size();					\
 	}								\
 } while (0)
 
-#define percpu_from_op(qual, op, var)			\
-({							\
+#define percpu_from_op(qual, op, var)					\
+({									\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
 		asm qual (op "b "__percpu_arg(1)",%0"	\
 		    : "=q" (pfo_ret__)			\
-		    : "m" (var));			\
+		    : "m" (__my_cpu_var(var)));		\
 		break;					\
 	case 2:						\
 		asm qual (op "w "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
-		    : "m" (var));			\
+		    : "m" (__my_cpu_var(var)));		\
 		break;					\
 	case 4:						\
 		asm qual (op "l "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
-		    : "m" (var));			\
+		    : "m" (__my_cpu_var(var)));		\
 		break;					\
 	case 8:						\
 		asm qual (op "q "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
-		    : "m" (var));			\
+		    : "m" (__my_cpu_var(var)));		\
 		break;					\
 	default: __bad_percpu_size();			\
 	}						\
@@ -214,22 +254,22 @@ do {									\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b "__percpu_arg(P1)",%0"	\
+		asm(op "b "__force_percpu_arg(P1)",%0"	\
 		    : "=q" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 2:						\
-		asm(op "w "__percpu_arg(P1)",%0"	\
+		asm(op "w "__force_percpu_arg(P1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 4:						\
-		asm(op "l "__percpu_arg(P1)",%0"	\
+		asm(op "l "__force_percpu_arg(P1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 8:						\
-		asm(op "q "__percpu_arg(P1)",%0"	\
+		asm(op "q "__force_percpu_arg(P1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
@@ -243,19 +283,19 @@ do {									\
 	switch (sizeof(var)) {				\
 	case 1:						\
 		asm qual (op "b "__percpu_arg(0)	\
-		    : "+m" (var));			\
+		    : "+m" (__my_cpu_var(var)));	\
 		break;					\
 	case 2:						\
 		asm qual (op "w "__percpu_arg(0)	\
-		    : "+m" (var));			\
+		    : "+m" (__my_cpu_var(var)));	\
 		break;					\
 	case 4:						\
 		asm qual (op "l "__percpu_arg(0)	\
-		    : "+m" (var));			\
+		    : "+m" (__my_cpu_var(var)));	\
 		break;					\
 	case 8:						\
 		asm qual (op "q "__percpu_arg(0)	\
-		    : "+m" (var));			\
+		    : "+m" (__my_cpu_var(var)));	\
 		break;					\
 	default: __bad_percpu_size();			\
 	}						\
@@ -270,22 +310,26 @@ do {									\
 	switch (sizeof(var)) {						\
 	case 1:								\
 		asm qual ("xaddb %0, "__percpu_arg(1)			\
-			    : "+q" (paro_ret__), "+m" (var)		\
+			    : "+q" (paro_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : : "memory");				\
 		break;							\
 	case 2:								\
 		asm qual ("xaddw %0, "__percpu_arg(1)			\
-			    : "+r" (paro_ret__), "+m" (var)		\
+			    : "+r" (paro_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : : "memory");				\
 		break;							\
 	case 4:								\
 		asm qual ("xaddl %0, "__percpu_arg(1)			\
-			    : "+r" (paro_ret__), "+m" (var)		\
+			    : "+r" (paro_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : : "memory");				\
 		break;							\
 	case 8:								\
 		asm qual ("xaddq %0, "__percpu_arg(1)			\
-			    : "+re" (paro_ret__), "+m" (var)		\
+			    : "+re" (paro_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : : "memory");				\
 		break;							\
 	default: __bad_percpu_size();					\
@@ -308,7 +352,8 @@ do {									\
 		asm qual ("\n\tmov "__percpu_arg(1)",%%al"		\
 		    "\n1:\tcmpxchgb %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
-			    : "=&a" (pxo_ret__), "+m" (var)		\
+			    : "=&a" (pxo_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : "q" (pxo_new__)				\
 			    : "memory");				\
 		break;							\
@@ -316,7 +361,8 @@ do {									\
 		asm qual ("\n\tmov "__percpu_arg(1)",%%ax"		\
 		    "\n1:\tcmpxchgw %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
-			    : "=&a" (pxo_ret__), "+m" (var)		\
+			    : "=&a" (pxo_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : "r" (pxo_new__)				\
 			    : "memory");				\
 		break;							\
@@ -324,7 +370,8 @@ do {									\
 		asm qual ("\n\tmov "__percpu_arg(1)",%%eax"		\
 		    "\n1:\tcmpxchgl %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
-			    : "=&a" (pxo_ret__), "+m" (var)		\
+			    : "=&a" (pxo_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : "r" (pxo_new__)				\
 			    : "memory");				\
 		break;							\
@@ -332,7 +379,8 @@ do {									\
 		asm qual ("\n\tmov "__percpu_arg(1)",%%rax"		\
 		    "\n1:\tcmpxchgq %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
-			    : "=&a" (pxo_ret__), "+m" (var)		\
+			    : "=&a" (pxo_ret__),			\
+			      "+m" (__my_cpu_var(var))			\
 			    : "r" (pxo_new__)				\
 			    : "memory");				\
 		break;							\
@@ -353,25 +401,25 @@ do {									\
 	switch (sizeof(var)) {						\
 	case 1:								\
 		asm qual ("cmpxchgb %2, "__percpu_arg(1)		\
-			    : "=a" (pco_ret__), "+m" (var)		\
+			    : "=a" (pco_ret__), "+m" (__my_cpu_var(var))\
 			    : "q" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 2:								\
 		asm qual ("cmpxchgw %2, "__percpu_arg(1)		\
-			    : "=a" (pco_ret__), "+m" (var)		\
+			    : "=a" (pco_ret__), "+m" (__my_cpu_var(var))\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 4:								\
 		asm qual ("cmpxchgl %2, "__percpu_arg(1)		\
-			    : "=a" (pco_ret__), "+m" (var)		\
+			    : "=a" (pco_ret__), "+m" (__my_cpu_var(var))\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 8:								\
 		asm qual ("cmpxchgq %2, "__percpu_arg(1)		\
-			    : "=a" (pco_ret__), "+m" (var)		\
+			    : "=a" (pco_ret__), "+m" (__my_cpu_var(var))\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
@@ -464,7 +512,8 @@ do {									\
 	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
 	asm volatile("cmpxchg8b "__percpu_arg(1)			\
 		     CC_SET(z)						\
-		     : CC_OUT(z) (__ret), "+m" (pcp1), "+m" (pcp2), "+a" (__o1), "+d" (__o2) \
+		     : CC_OUT(z) (__ret), "+m" (__my_cpu_var(pcp1)),	\
+		    "+m" (__my_cpu_var(pcp2)), "+a" (__o1), "+d" (__o2) \
 		     : "b" (__n1), "c" (__n2));				\
 	__ret;								\
 })
@@ -507,11 +556,13 @@ do {									\
 	bool __ret;							\
 	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
 	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
-	alternative_io("leaq %P1,%%rsi\n\tcall this_cpu_cmpxchg16b_emu\n\t", \
-		       "cmpxchg16b " __percpu_arg(1) "\n\tsetz %0\n\t",	\
+	alternative_io("leaq %P[ptr],%%rsi\n\tcall this_cpu_cmpxchg16b_emu\n\t", \
+		       "cmpxchg16b " __force_percpu_arg([ptr]) "\n\tsetz %0\n\t",\
 		       X86_FEATURE_CX16,				\
-		       ASM_OUTPUT2("=a" (__ret), "+m" (pcp1),		\
-				   "+m" (pcp2), "+d" (__o2)),		\
+		       ASM_OUTPUT2("=a" (__ret),			\
+				   "+m" (__my_cpu_var(pcp1)),		\
+				   "+m" (__my_cpu_var(pcp2)),		\
+				   "+d" (__o2)), [ptr]"m" (pcp1),	\
 		       "b" (__n1), "c" (__n2), "a" (__o1) : "rsi");	\
 	__ret;								\
 })
@@ -542,7 +593,7 @@ static inline bool x86_this_cpu_variable_test_bit(int nr,
 	asm volatile("btl "__percpu_arg(2)",%1"
 			CC_SET(c)
 			: CC_OUT(c) (oldbit)
-			: "m" (*(unsigned long __percpu *)addr), "Ir" (nr));
+			: "m" (*__my_cpu_ptr((unsigned long __percpu *)(addr))), "Ir" (nr));
 
 	return oldbit;
 }
diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 3d4cb83a8828..8d46efbc675a 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -91,7 +91,8 @@ static __always_inline void __preempt_count_sub(int val)
  */
 static __always_inline bool __preempt_count_dec_and_test(void)
 {
-	return GEN_UNARY_RMWcc("decl", __preempt_count, e, __percpu_arg([var]));
+	return GEN_UNARY_RMWcc("decl", __my_cpu_var(__preempt_count), e,
+			       __percpu_arg([var]));
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/7] x86/percpu: Use C for percpu accesses when possible
  2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
  2019-08-23 22:44 ` [PATCH 1/7] compiler: Report x86 segment support Nadav Amit
  2019-08-23 22:44 ` [PATCH 2/7] x86/percpu: Use compiler segment prefix qualifier Nadav Amit
@ 2019-08-23 22:44 ` Nadav Amit
  2019-08-23 22:44 ` [PATCH 4/7] x86: Fix possible caching of current_task Nadav Amit
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g., CSE). For example, in __schedule() the
following two instructions:

  mov    %gs:0x7e5f1eff(%rip),%edx        # 0x10350 <cpu_number>
  movslq %edx,%rdx

Turn with this patch into:

  movslq %gs:0x7e5f2e6e(%rip),%rax        # 0x10350 <cpu_number>

In addition, operations that have no guarantee against concurrent
interrupts or preemption, such as __this_cpu_cmpxchg() can be further
optimized by the compiler when they are implemented in C, for example
in call_timer_fn().

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/percpu.h | 115 +++++++++++++++++++++++++++++++---
 1 file changed, 105 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 1fe348884477..13987f9bc82f 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -439,13 +439,88 @@ do {									\
  */
 #define this_cpu_read_stable(var)	percpu_stable_op("mov", var)
 
+#if USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp)					\
+({									\
+	*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp));		\
+})
+
+#define __raw_cpu_write(qual, pcp, val)					\
+	do {								\
+		*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+	} while (0)
+
+/*
+ * Performance-wise, C operations are only more efficient than their inline
+ * assembly counterparts for non-volatile variables (__this_*) and for volatile
+ * loads and stores.
+ *
+ * Since we do not use assembly, we are free to define 64-bit operations
+ * on 32-bit architecture
+ */
+#define __raw_cpu_add(pcp, val)		do { __my_cpu_var(pcp) += (val); } while (0)
+#define __raw_cpu_and(pcp, val)		do { __my_cpu_var(pcp) &= (val); } while (0)
+#define __raw_cpu_or(pcp, val)		do { __my_cpu_var(pcp) |= (val); } while (0)
+#define __raw_cpu_add_return(pcp, val)	({ __my_cpu_var(pcp) += (val); })
+
+#define __raw_cpu_xchg(pcp, val)					\
+({									\
+	typeof(pcp) pxo_ret__ = __my_cpu_var(pcp);			\
+									\
+	__my_cpu_var(pcp) = (val);					\
+	pxo_ret__;							\
+})
+
+#define __raw_cpu_cmpxchg(pcp, oval, nval)				\
+({									\
+	__my_cpu_type(pcp) *__p = __my_cpu_ptr(&(pcp));			\
+									\
+	typeof(pcp) __ret = *__p;					\
+									\
+	if (__ret == (oval))						\
+		*__p = nval;						\
+	__ret;								\
+})
+
+#define raw_cpu_read_1(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val)	__raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val)	__raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val)	__raw_cpu_write(, pcp, val)
+#define raw_cpu_add_1(pcp, val)		__raw_cpu_add(pcp, val)
+#define raw_cpu_add_2(pcp, val)		__raw_cpu_add(pcp, val)
+#define raw_cpu_add_4(pcp, val)		__raw_cpu_add(pcp, val)
+#define raw_cpu_and_1(pcp, val)		__raw_cpu_and(pcp, val)
+#define raw_cpu_and_2(pcp, val)		__raw_cpu_and(pcp, val)
+#define raw_cpu_and_4(pcp, val)		__raw_cpu_and(pcp, val)
+#define raw_cpu_or_1(pcp, val)		__raw_cpu_or(pcp, val)
+#define raw_cpu_or_2(pcp, val)		__raw_cpu_or(pcp, val)
+#define raw_cpu_or_4(pcp, val)		__raw_cpu_or(pcp, val)
+#define raw_cpu_xchg_1(pcp, val)	__raw_cpu_xchg(pcp, val)
+#define raw_cpu_xchg_2(pcp, val)	__raw_cpu_xchg(pcp, val)
+#define raw_cpu_xchg_4(pcp, val)	__raw_cpu_xchg(pcp, val)
+#define raw_cpu_add_return_1(pcp, val)	__raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_2(pcp, val)	__raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_4(pcp, val)	__raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_8(pcp, val)		__raw_cpu_add_return(pcp, val)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	__raw_cpu_cmpxchg(pcp, oval, nval)
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	__raw_cpu_cmpxchg(pcp, oval, nval)
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	__raw_cpu_cmpxchg(pcp, oval, nval)
+
+#define this_cpu_read_1(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_read_2(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_read_4(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_write_1(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_2(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_4(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+
+#else
 #define raw_cpu_read_1(pcp)		percpu_from_op(, "mov", pcp)
 #define raw_cpu_read_2(pcp)		percpu_from_op(, "mov", pcp)
 #define raw_cpu_read_4(pcp)		percpu_from_op(, "mov", pcp)
 
-#define raw_cpu_write_1(pcp, val)	percpu_to_op(, "mov", (pcp), val)
-#define raw_cpu_write_2(pcp, val)	percpu_to_op(, "mov", (pcp), val)
-#define raw_cpu_write_4(pcp, val)	percpu_to_op(, "mov", (pcp), val)
 #define raw_cpu_add_1(pcp, val)		percpu_add_op(, (pcp), val)
 #define raw_cpu_add_2(pcp, val)		percpu_add_op(, (pcp), val)
 #define raw_cpu_add_4(pcp, val)		percpu_add_op(, (pcp), val)
@@ -477,6 +552,14 @@ do {									\
 #define this_cpu_write_1(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
 #define this_cpu_write_2(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
 #define this_cpu_write_4(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
+
+#define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+#endif
 #define this_cpu_add_1(pcp, val)	percpu_add_op(volatile, (pcp), val)
 #define this_cpu_add_2(pcp, val)	percpu_add_op(volatile, (pcp), val)
 #define this_cpu_add_4(pcp, val)	percpu_add_op(volatile, (pcp), val)
@@ -490,13 +573,6 @@ do {									\
 #define this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
 #define this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
 
-#define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(, pcp, val)
-#define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(, pcp, val)
-#define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(, pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
-
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(volatile, pcp, val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(volatile, pcp, val)
 #define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(volatile, pcp, val)
@@ -527,6 +603,22 @@ do {									\
  * 32 bit must fall back to generic operations.
  */
 #ifdef CONFIG_X86_64
+
+#if USE_X86_SEG_SUPPORT
+
+#define raw_cpu_read_8(pcp)			__raw_cpu_read(, pcp)
+#define raw_cpu_write_8(pcp, val)		__raw_cpu_write(, pcp, val)
+#define raw_cpu_add_8(pcp, val)			__raw_cpu_add(pcp, val)
+#define raw_cpu_and_8(pcp, val)			__raw_cpu_and(pcp, val)
+#define raw_cpu_or_8(pcp, val)			__raw_cpu_or(pcp, val)
+#define raw_cpu_xchg_8(pcp, nval)		__raw_cpu_xchg(pcp, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	__raw_cpu_cmpxchg(pcp, oval, nval)
+
+#define this_cpu_read_8(pcp)			__raw_cpu_read(volatile, pcp)
+#define this_cpu_write_8(pcp, val)		__raw_cpu_write(volatile, pcp, val)
+
+#else
+
 #define raw_cpu_read_8(pcp)			percpu_from_op(, "mov", pcp)
 #define raw_cpu_write_8(pcp, val)		percpu_to_op(, "mov", (pcp), val)
 #define raw_cpu_add_8(pcp, val)			percpu_add_op(, (pcp), val)
@@ -538,6 +630,9 @@ do {									\
 
 #define this_cpu_read_8(pcp)			percpu_from_op(volatile, "mov", pcp)
 #define this_cpu_write_8(pcp, val)		percpu_to_op(volatile, "mov", (pcp), val)
+
+#endif
+
 #define this_cpu_add_8(pcp, val)		percpu_add_op(volatile, (pcp), val)
 #define this_cpu_and_8(pcp, val)		percpu_to_op(volatile, "and", (pcp), val)
 #define this_cpu_or_8(pcp, val)			percpu_to_op(volatile, "or", (pcp), val)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/7] x86: Fix possible caching of current_task
  2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
                   ` (2 preceding siblings ...)
  2019-08-23 22:44 ` [PATCH 3/7] x86/percpu: Use C for percpu accesses when possible Nadav Amit
@ 2019-08-23 22:44 ` Nadav Amit
  2019-08-23 22:44 ` [PATCH 5/7] percpu: Assume preemption is disabled on per_cpu_ptr() Nadav Amit
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

this_cpu_read_stable() is allowed and supposed to cache and return the
same value, specifically for current_task. It actually does not cache
current_task very well, which hinders possible invalid caching when the
task is switched in __switch_to().

Fix the possible caching by avoiding the use of current in
__switch_to()'s dynamic extent.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/fpu/internal.h    |  7 ++++---
 arch/x86/include/asm/resctrl_sched.h   | 14 +++++++-------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
 arch/x86/kernel/process_32.c           |  4 ++--
 arch/x86/kernel/process_64.c           |  4 ++--
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 4c95c365058a..b537788600fe 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -588,9 +588,10 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
 
 /*
  * Load PKRU from the FPU context if available. Delay loading of the
- * complete FPU state until the return to userland.
+ * complete FPU state until the return to userland. Avoid reading current during
+ * switch.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct task_struct *task, struct fpu *new_fpu)
 {
 	u32 pkru_val = init_pkru_value;
 	struct pkru_state *pk;
@@ -598,7 +599,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 	if (!static_cpu_has(X86_FEATURE_FPU))
 		return;
 
-	set_thread_flag(TIF_NEED_FPU_LOAD);
+	set_ti_thread_flag(task_thread_info(task), TIF_NEED_FPU_LOAD);
 
 	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
 		return;
diff --git a/arch/x86/include/asm/resctrl_sched.h b/arch/x86/include/asm/resctrl_sched.h
index f6b7fe2833cc..9a00d9df9d02 100644
--- a/arch/x86/include/asm/resctrl_sched.h
+++ b/arch/x86/include/asm/resctrl_sched.h
@@ -51,7 +51,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
  *   simple as possible.
  * Must be called with preemption disabled.
  */
-static void __resctrl_sched_in(void)
+static void __resctrl_sched_in(struct task_struct *task)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
 	u32 closid = state->default_closid;
@@ -62,13 +62,13 @@ static void __resctrl_sched_in(void)
 	 * Else use the closid/rmid assigned to this cpu.
 	 */
 	if (static_branch_likely(&rdt_alloc_enable_key)) {
-		if (current->closid)
+		if (task->closid)
 			closid = current->closid;
 	}
 
 	if (static_branch_likely(&rdt_mon_enable_key)) {
-		if (current->rmid)
-			rmid = current->rmid;
+		if (task->rmid)
+			rmid = task->rmid;
 	}
 
 	if (closid != state->cur_closid || rmid != state->cur_rmid) {
@@ -78,15 +78,15 @@ static void __resctrl_sched_in(void)
 	}
 }
 
-static inline void resctrl_sched_in(void)
+static inline void resctrl_sched_in(struct task_struct *task)
 {
 	if (static_branch_likely(&rdt_enable_key))
-		__resctrl_sched_in();
+		__resctrl_sched_in(task);
 }
 
 #else
 
-static inline void resctrl_sched_in(void) {}
+static inline void resctrl_sched_in(struct task_struct *task) {}
 
 #endif /* CONFIG_X86_CPU_RESCTRL */
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a46dee8e78db..5fcf56cbf438 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -311,7 +311,7 @@ static void update_cpu_closid_rmid(void *info)
 	 * executing task might have its own closid selected. Just reuse
 	 * the context switch code.
 	 */
-	resctrl_sched_in();
+	resctrl_sched_in(current);
 }
 
 /*
@@ -536,7 +536,7 @@ static void move_myself(struct callback_head *head)
 
 	preempt_disable();
 	/* update PQR_ASSOC MSR to make resource group go into effect */
-	resctrl_sched_in();
+	resctrl_sched_in(current);
 	preempt_enable();
 
 	kfree(callback);
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index b8ceec4974fe..699a4c95ab13 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,10 +292,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	this_cpu_write(current_task, next_p);
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(next_p, next_fpu);
 
 	/* Load the Intel cache allocation PQR MSR. */
-	resctrl_sched_in();
+	resctrl_sched_in(next_p);
 
 	return prev_p;
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index af64519b2695..bb811808936d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -565,7 +565,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(current_task, next_p);
 	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(next_p, next_fpu);
 
 	/* Reload sp0. */
 	update_task_stack(next_p);
@@ -612,7 +612,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	}
 
 	/* Load the Intel cache allocation PQR MSR. */
-	resctrl_sched_in();
+	resctrl_sched_in(next_p);
 
 	return prev_p;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/7] percpu: Assume preemption is disabled on per_cpu_ptr()
  2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
                   ` (3 preceding siblings ...)
  2019-08-23 22:44 ` [PATCH 4/7] x86: Fix possible caching of current_task Nadav Amit
@ 2019-08-23 22:44 ` Nadav Amit
  2019-08-23 22:44 ` [PATCH 6/7] x86/percpu: Optimized arch_raw_cpu_ptr() Nadav Amit
  2019-08-23 22:44 ` [PATCH 7/7] x86/current: Aggressive caching of current Nadav Amit
  6 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

When per_cpu_ptr() is used, the caller should have preemption disabled,
as otherwise the pointer is meaningless. If the user wants an "unstable"
pointer he should call raw_cpu_ptr().

Add an assertion to check that indeed preemption is disabled, and
distinguish between the two cases to allow further, per-arch
optimizations.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/asm-generic/percpu.h | 12 ++++++++++++
 include/linux/percpu-defs.h  | 33 ++++++++++++++++++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index c2de013b2cf4..7853605f4210 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -36,6 +36,14 @@ extern unsigned long __per_cpu_offset[NR_CPUS];
 #define my_cpu_offset __my_cpu_offset
 #endif
 
+/*
+ * Determine the offset of the current active processor when preemption is
+ * disabled. Can be overriden by arch code.
+ */
+#ifndef __raw_my_cpu_offset
+#define __raw_my_cpu_offset __my_cpu_offset
+#endif
+
 /*
  * Arch may define arch_raw_cpu_ptr() to provide more efficient address
  * translations for raw_cpu_ptr().
@@ -44,6 +52,10 @@ extern unsigned long __per_cpu_offset[NR_CPUS];
 #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
 #endif
 
+#ifndef arch_raw_cpu_ptr_preemptable
+#define arch_raw_cpu_ptr_preemptable(ptr) SHIFT_PERCPU_PTR(ptr, __raw_my_cpu_offset)
+#endif
+
 #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void setup_per_cpu_areas(void);
 #endif
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index a6fabd865211..13afca8a37e7 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -237,20 +237,51 @@ do {									\
 	SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)));			\
 })
 
+#ifndef arch_raw_cpu_ptr_preemption_disabled
+#define arch_raw_cpu_ptr_preemption_disabled(ptr)			\
+	arch_raw_cpu_ptr(ptr)
+#endif
+
+#define raw_cpu_ptr_preemption_disabled(ptr)				\
+({									\
+	__verify_pcpu_ptr(ptr);						\
+	arch_raw_cpu_ptr_preemption_disabled(ptr);			\
+})
+
+/*
+ * If preemption is enabled, we need to read the pointer atomically on
+ * raw_cpu_ptr(). However if it is disabled, we can use the
+ * raw_cpu_ptr_nopreempt(), which is potentially more efficient. Similarly, we
+ * can use the preemption-disabled version if the kernel is non-preemptable or
+ * if voluntary preemption is used.
+ */
+#ifdef CONFIG_PREEMPT
+
 #define raw_cpu_ptr(ptr)						\
 ({									\
 	__verify_pcpu_ptr(ptr);						\
 	arch_raw_cpu_ptr(ptr);						\
 })
 
+#else
+
+#define raw_cpu_ptr(ptr)	raw_cpu_ptr_preemption_disabled(ptr)
+
+#endif
+
 #ifdef CONFIG_DEBUG_PREEMPT
+/*
+ * Unlike other this_cpu_* operations, this_cpu_ptr() requires that preemption
+ * will be disabled. In contrast, raw_cpu_ptr() does not require that.
+ */
 #define this_cpu_ptr(ptr)						\
 ({									\
+	__this_cpu_preempt_check("ptr");				\
 	__verify_pcpu_ptr(ptr);						\
 	SHIFT_PERCPU_PTR(ptr, my_cpu_offset);				\
 })
 #else
-#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
+#define this_cpu_ptr(ptr) raw_cpu_ptr_preemption_disabled(ptr)
 #endif
 
 #else	/* CONFIG_SMP */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 6/7] x86/percpu: Optimized arch_raw_cpu_ptr()
  2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
                   ` (4 preceding siblings ...)
  2019-08-23 22:44 ` [PATCH 5/7] percpu: Assume preemption is disabled on per_cpu_ptr() Nadav Amit
@ 2019-08-23 22:44 ` Nadav Amit
  2019-08-23 22:44 ` [PATCH 7/7] x86/current: Aggressive caching of current Nadav Amit
  6 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
better optimizations, such as setting an appropriate base to compute
the address instead of an add instruction.

The benefit of this computation is relevant only when using compiler
segment qualifiers. It is inapplicable to use this method when the
address size is greater than the maximum operand size, as it is when
building vdso32.

Distinguish between the two cases in which preemption is disabled (as
happens when this_cpu_ptr() is used) and enabled (when raw_cpu_ptr() is
used).

This allows optimizations, for instance in rcu_dynticks_eqs_exit(),
the following code:

  mov    $0x2bbc0,%rax
  add    %gs:0x7ef07570(%rip),%rax	# 0x10358 <this_cpu_off>
  lock xadd %edx,0xd8(%rax)

Turns with this patch into:

  mov    %gs:0x7ef08aa5(%rip),%rax	# 0x10358 <this_cpu_off>
  lock xadd %edx,0x2bc58(%rax)

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/percpu.h | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 13987f9bc82f..3d82df27d45c 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -73,20 +73,41 @@
 #endif /* USE_X86_SEG_SUPPORT */
 
 #define __my_cpu_offset		this_cpu_read(this_cpu_off)
+#define __raw_my_cpu_offset	__this_cpu_read(this_cpu_off)
+#define __my_cpu_ptr(ptr)	(__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
 
+#if USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && defined(CONFIG_X86_64)
+/*
+ * Efficient implementation for cases in which the compiler supports C segments.
+ * Allows the compiler to perform additional optimizations that can save more
+ * instructions.
+ *
+ * This optimized version can only be used if the pointer size equals to native
+ * operand size, which does not happen when vdso32 is used.
+ */
+#define __arch_raw_cpu_ptr_qual(qual, ptr)				\
+({									\
+	(qual typeof(*(ptr)) __kernel __force *)((uintptr_t)(ptr) +	\
+						__my_cpu_offset);	\
+})
+#else /* USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && defined(CONFIG_X86_64) */
 /*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
  */
-#define arch_raw_cpu_ptr(ptr)				\
+#define __arch_raw_cpu_ptr_qual(qual, ptr)		\
 ({							\
 	unsigned long tcp_ptr__;			\
-	asm volatile("add " __percpu_arg(1) ", %0"	\
+	asm qual ("add " __percpu_arg(1) ", %0"		\
 		     : "=r" (tcp_ptr__)			\
 		     : "m" (__my_cpu_var(this_cpu_off)),\
 		       "0" (ptr));			\
 	(typeof(*(ptr)) __kernel __force *)tcp_ptr__;	\
 })
+#endif /* USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && defined(CONFIG_X86_64) */
+
+#define arch_raw_cpu_ptr(ptr)				__arch_raw_cpu_ptr_qual(volatile, ptr)
+#define arch_raw_cpu_ptr_preemption_disabled(ptr)	__arch_raw_cpu_ptr_qual( , ptr)
 #else /* CONFIG_SMP */
 #define __percpu_seg_override
 #define __percpu_prefix		""
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 7/7] x86/current: Aggressive caching of current
  2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
                   ` (5 preceding siblings ...)
  2019-08-23 22:44 ` [PATCH 6/7] x86/percpu: Optimized arch_raw_cpu_ptr() Nadav Amit
@ 2019-08-23 22:44 ` Nadav Amit
  6 siblings, 0 replies; 8+ messages in thread
From: Nadav Amit @ 2019-08-23 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Thomas Gleixner, Thomas Garnier, Ingo Molnar,
	Nadav Amit

The current_task is supposed to be constant in each thread and therefore
does not need to be reread. There is already an attempt to cache it
using inline assembly, using this_cpu_read_stable(), which hides the
dependency on the read memory address.

However, this caching is not working very well. For example,
sync_mm_rss() still reads current_task twice for no reason.

Allow more aggressive caching by aliasing current_task to
into a constant const_current_task and reading from the constant copy.
Doing so requires the compiler to support x86 segment qualifiers.
Hide const_current_task in a different compilation unit to avoid the
compiler from assuming that the value is constant during compilation.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/current.h | 30 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/Makefile   |  1 +
 arch/x86/kernel/cpu/common.c   |  7 +------
 arch/x86/kernel/cpu/current.c  | 16 ++++++++++++++++
 include/linux/compiler.h       |  2 +-
 5 files changed, 49 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/current.c

diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
index 3e204e6140b5..7f093e81a647 100644
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -10,11 +10,41 @@ struct task_struct;
 
 DECLARE_PER_CPU(struct task_struct *, current_task);
 
+#if USE_X86_SEG_SUPPORT
+
+/*
+ * Hold a constant alias for current_task, which would allow to avoid caching of
+ * current task.
+ *
+ * We must mark const_current_task with the segment qualifiers, as otherwise gcc
+ * would do redundant reads of const_current_task.
+ */
+DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, const_current_task);
+
+static __always_inline struct task_struct *get_current(void)
+{
+
+	/*
+	 * GCC is missing functionality of removing segment qualifiers, which
+	 * messes with per-cpu infrastructure that holds local copies. Use
+	 * __raw_cpu_read to avoid holding any copy.
+	 */
+	return __raw_cpu_read(, const_current_task);
+}
+
+#else /* USE_X86_SEG_SUPPORT */
+
+/*
+ * Without segment qualifier support, the per-cpu infrastrucutre is not
+ * suitable for reading constants, so use this_cpu_read_stable() in this case.
+ */
 static __always_inline struct task_struct *get_current(void)
 {
 	return this_cpu_read_stable(current_task);
 }
 
+#endif /* USE_X86_SEG_SUPPORT */
+
 #define current get_current()
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index d7a1e5a9331c..d816f03a37d7 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -19,6 +19,7 @@ CFLAGS_common.o		:= $(nostackp)
 
 obj-y			:= cacheinfo.o scattered.o topology.o
 obj-y			+= common.o
+obj-y			+= current.o
 obj-y			+= rdrand.o
 obj-y			+= match.o
 obj-y			+= bugs.o
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5cc2d51cc25e..5f7c9ee57802 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1619,13 +1619,8 @@ DEFINE_PER_CPU_FIRST(struct fixed_percpu_data,
 EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
 
 /*
- * The following percpu variables are hot.  Align current_task to
- * cacheline size such that they fall in the same cacheline.
+ * The following percpu variables are hot.
  */
-DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
-	&init_task;
-EXPORT_PER_CPU_SYMBOL(current_task);
-
 DEFINE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
 DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 
diff --git a/arch/x86/kernel/cpu/current.c b/arch/x86/kernel/cpu/current.c
new file mode 100644
index 000000000000..3238c6e34984
--- /dev/null
+++ b/arch/x86/kernel/cpu/current.c
@@ -0,0 +1,16 @@
+#include <linux/sched/task.h>
+#include <asm/current.h>
+
+/*
+ * Align current_task to cacheline size such that they fall in the same
+ * cacheline.
+ */
+DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
+	&init_task;
+EXPORT_PER_CPU_SYMBOL(current_task);
+
+#if USE_X86_SEG_SUPPORT
+DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, const_current_task)
+	__attribute__((alias("current_task")));
+EXPORT_PER_CPU_SYMBOL(const_current_task);
+#endif
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index f0fd5636fddb..1b6ee9ab6373 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -299,7 +299,7 @@ unsigned long read_word_at_a_time(const void *addr)
  */
 #define __ADDRESSABLE(sym) \
 	static void * __section(".discard.addressable") __used \
-		__PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
+		__PASTE(__addressable_##sym, __LINE__) = (void *)(uintptr_t)&sym;
 
 /**
  * offset_to_ptr - convert a relative memory offset to an absolute pointer
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-08-24  6:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-23 22:44 [PATCH 0/7] x86/percpu: Use segment qualifiers Nadav Amit
2019-08-23 22:44 ` [PATCH 1/7] compiler: Report x86 segment support Nadav Amit
2019-08-23 22:44 ` [PATCH 2/7] x86/percpu: Use compiler segment prefix qualifier Nadav Amit
2019-08-23 22:44 ` [PATCH 3/7] x86/percpu: Use C for percpu accesses when possible Nadav Amit
2019-08-23 22:44 ` [PATCH 4/7] x86: Fix possible caching of current_task Nadav Amit
2019-08-23 22:44 ` [PATCH 5/7] percpu: Assume preemption is disabled on per_cpu_ptr() Nadav Amit
2019-08-23 22:44 ` [PATCH 6/7] x86/percpu: Optimized arch_raw_cpu_ptr() Nadav Amit
2019-08-23 22:44 ` [PATCH 7/7] x86/current: Aggressive caching of current Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).