* [PATCH 0/5] x86/percpu semantics and fixes
@ 2019-02-27 10:12 Peter Zijlstra
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
` (7 more replies)
0 siblings, 8 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel
Hi all,
This is a collection of x86/percpu changes that I had pending and got reminded
of by Linus' comment yesterday about __this_cpu_xchg().
This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
Built and boot tested with CONFIG_DEBUG_PREEMPT=y.
---
arch/x86/include/asm/irq_regs.h | 4 +-
arch/x86/include/asm/percpu.h | 236 +++++++++++++++++++++-------------------
arch/x86/include/asm/smp.h | 3 +-
arch/x86/mm/tlb.c | 62 +++++------
include/linux/smp.h | 45 +++++---
kernel/sched/fair.c | 5 +-
6 files changed, 193 insertions(+), 162 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
2019-02-27 16:14 ` Linus Torvalds
2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra
` (6 subsequent siblings)
7 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit
Nadav Amit reported that commit:
b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
added a bunch of constraints to all sorts of code; and while some of
that was correct and desired, some of that seems superfluous.
The thing is, the this_cpu_*() operations are defined IRQ-safe, this
means the values are subject to change from IRQs, and thus must be
reloaded.
Also (the generic form):
local_irq_save()
__this_cpu_read()
local_irq_restore()
would not allow the re-use of previous values; if by nothing else,
then the barrier()s implied by local_irq_*().
Which raises the point that percpu_from_op() and the other also need
that volatile.
OTOH __this_cpu_*() operations are not IRQ-safe and assume external
preempt/IRQ disabling and could thus be allowed more room for
optimization.
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/x86/include/asm/percpu.h | 224 +++++++++++++++++++++---------------------
1 file changed, 112 insertions(+), 112 deletions(-)
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -87,7 +87,7 @@
* don't give an lvalue though). */
extern void __bad_percpu_size(void);
-#define percpu_to_op(op, var, val) \
+#define percpu_to_op(qual, op, var, val) \
do { \
typedef typeof(var) pto_T__; \
if (0) { \
@@ -97,22 +97,22 @@ do { \
} \
switch (sizeof(var)) { \
case 1: \
- asm(op "b %1,"__percpu_arg(0) \
+ asm qual (op "b %1,"__percpu_arg(0) \
: "+m" (var) \
: "qi" ((pto_T__)(val))); \
break; \
case 2: \
- asm(op "w %1,"__percpu_arg(0) \
+ asm qual (op "w %1,"__percpu_arg(0) \
: "+m" (var) \
: "ri" ((pto_T__)(val))); \
break; \
case 4: \
- asm(op "l %1,"__percpu_arg(0) \
+ asm qual (op "l %1,"__percpu_arg(0) \
: "+m" (var) \
: "ri" ((pto_T__)(val))); \
break; \
case 8: \
- asm(op "q %1,"__percpu_arg(0) \
+ asm qual (op "q %1,"__percpu_arg(0) \
: "+m" (var) \
: "re" ((pto_T__)(val))); \
break; \
@@ -124,7 +124,7 @@ do { \
* Generate a percpu add to memory instruction and optimize code
* if one is added or subtracted.
*/
-#define percpu_add_op(var, val) \
+#define percpu_add_op(qual, var, val) \
do { \
typedef typeof(var) pao_T__; \
const int pao_ID__ = (__builtin_constant_p(val) && \
@@ -138,41 +138,41 @@ do { \
switch (sizeof(var)) { \
case 1: \
if (pao_ID__ == 1) \
- asm("incb "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("incb "__percpu_arg(0) : "+m" (var)); \
else if (pao_ID__ == -1) \
- asm("decb "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("decb "__percpu_arg(0) : "+m" (var)); \
else \
- asm("addb %1, "__percpu_arg(0) \
+ asm qual ("addb %1, "__percpu_arg(0) \
: "+m" (var) \
: "qi" ((pao_T__)(val))); \
break; \
case 2: \
if (pao_ID__ == 1) \
- asm("incw "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("incw "__percpu_arg(0) : "+m" (var)); \
else if (pao_ID__ == -1) \
- asm("decw "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("decw "__percpu_arg(0) : "+m" (var)); \
else \
- asm("addw %1, "__percpu_arg(0) \
+ asm qual ("addw %1, "__percpu_arg(0) \
: "+m" (var) \
: "ri" ((pao_T__)(val))); \
break; \
case 4: \
if (pao_ID__ == 1) \
- asm("incl "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("incl "__percpu_arg(0) : "+m" (var)); \
else if (pao_ID__ == -1) \
- asm("decl "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("decl "__percpu_arg(0) : "+m" (var)); \
else \
- asm("addl %1, "__percpu_arg(0) \
+ asm qual ("addl %1, "__percpu_arg(0) \
: "+m" (var) \
: "ri" ((pao_T__)(val))); \
break; \
case 8: \
if (pao_ID__ == 1) \
- asm("incq "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("incq "__percpu_arg(0) : "+m" (var)); \
else if (pao_ID__ == -1) \
- asm("decq "__percpu_arg(0) : "+m" (var)); \
+ asm qual ("decq "__percpu_arg(0) : "+m" (var)); \
else \
- asm("addq %1, "__percpu_arg(0) \
+ asm qual ("addq %1, "__percpu_arg(0) \
: "+m" (var) \
: "re" ((pao_T__)(val))); \
break; \
@@ -180,27 +180,27 @@ do { \
} \
} while (0)
-#define percpu_from_op(op, var) \
+#define percpu_from_op(qual, op, var) \
({ \
typeof(var) pfo_ret__; \
switch (sizeof(var)) { \
case 1: \
- asm volatile(op "b "__percpu_arg(1)",%0"\
+ asm qual (op "b "__percpu_arg(1)",%0" \
: "=q" (pfo_ret__) \
: "m" (var)); \
break; \
case 2: \
- asm volatile(op "w "__percpu_arg(1)",%0"\
+ asm qual (op "w "__percpu_arg(1)",%0" \
: "=r" (pfo_ret__) \
: "m" (var)); \
break; \
case 4: \
- asm volatile(op "l "__percpu_arg(1)",%0"\
+ asm qual (op "l "__percpu_arg(1)",%0" \
: "=r" (pfo_ret__) \
: "m" (var)); \
break; \
case 8: \
- asm volatile(op "q "__percpu_arg(1)",%0"\
+ asm qual (op "q "__percpu_arg(1)",%0" \
: "=r" (pfo_ret__) \
: "m" (var)); \
break; \
@@ -238,23 +238,23 @@ do { \
pfo_ret__; \
})
-#define percpu_unary_op(op, var) \
+#define percpu_unary_op(qual, op, var) \
({ \
switch (sizeof(var)) { \
case 1: \
- asm(op "b "__percpu_arg(0) \
+ asm qual (op "b "__percpu_arg(0) \
: "+m" (var)); \
break; \
case 2: \
- asm(op "w "__percpu_arg(0) \
+ asm qual (op "w "__percpu_arg(0) \
: "+m" (var)); \
break; \
case 4: \
- asm(op "l "__percpu_arg(0) \
+ asm qual (op "l "__percpu_arg(0) \
: "+m" (var)); \
break; \
case 8: \
- asm(op "q "__percpu_arg(0) \
+ asm qual (op "q "__percpu_arg(0) \
: "+m" (var)); \
break; \
default: __bad_percpu_size(); \
@@ -264,27 +264,27 @@ do { \
/*
* Add return operation
*/
-#define percpu_add_return_op(var, val) \
+#define percpu_add_return_op(qual, var, val) \
({ \
typeof(var) paro_ret__ = val; \
switch (sizeof(var)) { \
case 1: \
- asm("xaddb %0, "__percpu_arg(1) \
+ asm qual ("xaddb %0, "__percpu_arg(1) \
: "+q" (paro_ret__), "+m" (var) \
: : "memory"); \
break; \
case 2: \
- asm("xaddw %0, "__percpu_arg(1) \
+ asm qual ("xaddw %0, "__percpu_arg(1) \
: "+r" (paro_ret__), "+m" (var) \
: : "memory"); \
break; \
case 4: \
- asm("xaddl %0, "__percpu_arg(1) \
+ asm qual ("xaddl %0, "__percpu_arg(1) \
: "+r" (paro_ret__), "+m" (var) \
: : "memory"); \
break; \
case 8: \
- asm("xaddq %0, "__percpu_arg(1) \
+ asm qual ("xaddq %0, "__percpu_arg(1) \
: "+re" (paro_ret__), "+m" (var) \
: : "memory"); \
break; \
@@ -299,13 +299,13 @@ do { \
* expensive due to the implied lock prefix. The processor cannot prefetch
* cachelines if xchg is used.
*/
-#define percpu_xchg_op(var, nval) \
+#define percpu_xchg_op(qual, var, nval) \
({ \
typeof(var) pxo_ret__; \
typeof(var) pxo_new__ = (nval); \
switch (sizeof(var)) { \
case 1: \
- asm("\n\tmov "__percpu_arg(1)",%%al" \
+ asm qual ("\n\tmov "__percpu_arg(1)",%%al" \
"\n1:\tcmpxchgb %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=&a" (pxo_ret__), "+m" (var) \
@@ -313,7 +313,7 @@ do { \
: "memory"); \
break; \
case 2: \
- asm("\n\tmov "__percpu_arg(1)",%%ax" \
+ asm qual ("\n\tmov "__percpu_arg(1)",%%ax" \
"\n1:\tcmpxchgw %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=&a" (pxo_ret__), "+m" (var) \
@@ -321,7 +321,7 @@ do { \
: "memory"); \
break; \
case 4: \
- asm("\n\tmov "__percpu_arg(1)",%%eax" \
+ asm qual ("\n\tmov "__percpu_arg(1)",%%eax" \
"\n1:\tcmpxchgl %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=&a" (pxo_ret__), "+m" (var) \
@@ -329,7 +329,7 @@ do { \
: "memory"); \
break; \
case 8: \
- asm("\n\tmov "__percpu_arg(1)",%%rax" \
+ asm qual ("\n\tmov "__percpu_arg(1)",%%rax" \
"\n1:\tcmpxchgq %2, "__percpu_arg(1) \
"\n\tjnz 1b" \
: "=&a" (pxo_ret__), "+m" (var) \
@@ -345,32 +345,32 @@ do { \
* cmpxchg has no such implied lock semantics as a result it is much
* more efficient for cpu local operations.
*/
-#define percpu_cmpxchg_op(var, oval, nval) \
+#define percpu_cmpxchg_op(qual, var, oval, nval) \
({ \
typeof(var) pco_ret__; \
typeof(var) pco_old__ = (oval); \
typeof(var) pco_new__ = (nval); \
switch (sizeof(var)) { \
case 1: \
- asm("cmpxchgb %2, "__percpu_arg(1) \
+ asm qual ("cmpxchgb %2, "__percpu_arg(1) \
: "=a" (pco_ret__), "+m" (var) \
: "q" (pco_new__), "0" (pco_old__) \
: "memory"); \
break; \
case 2: \
- asm("cmpxchgw %2, "__percpu_arg(1) \
+ asm qual ("cmpxchgw %2, "__percpu_arg(1) \
: "=a" (pco_ret__), "+m" (var) \
: "r" (pco_new__), "0" (pco_old__) \
: "memory"); \
break; \
case 4: \
- asm("cmpxchgl %2, "__percpu_arg(1) \
+ asm qual ("cmpxchgl %2, "__percpu_arg(1) \
: "=a" (pco_ret__), "+m" (var) \
: "r" (pco_new__), "0" (pco_old__) \
: "memory"); \
break; \
case 8: \
- asm("cmpxchgq %2, "__percpu_arg(1) \
+ asm qual ("cmpxchgq %2, "__percpu_arg(1) \
: "=a" (pco_ret__), "+m" (var) \
: "r" (pco_new__), "0" (pco_old__) \
: "memory"); \
@@ -391,58 +391,58 @@ do { \
*/
#define this_cpu_read_stable(var) percpu_stable_op("mov", var)
-#define raw_cpu_read_1(pcp) percpu_from_op("mov", pcp)
-#define raw_cpu_read_2(pcp) percpu_from_op("mov", pcp)
-#define raw_cpu_read_4(pcp) percpu_from_op("mov", pcp)
-
-#define raw_cpu_write_1(pcp, val) percpu_to_op("mov", (pcp), val)
-#define raw_cpu_write_2(pcp, val) percpu_to_op("mov", (pcp), val)
-#define raw_cpu_write_4(pcp, val) percpu_to_op("mov", (pcp), val)
-#define raw_cpu_add_1(pcp, val) percpu_add_op((pcp), val)
-#define raw_cpu_add_2(pcp, val) percpu_add_op((pcp), val)
-#define raw_cpu_add_4(pcp, val) percpu_add_op((pcp), val)
-#define raw_cpu_and_1(pcp, val) percpu_to_op("and", (pcp), val)
-#define raw_cpu_and_2(pcp, val) percpu_to_op("and", (pcp), val)
-#define raw_cpu_and_4(pcp, val) percpu_to_op("and", (pcp), val)
-#define raw_cpu_or_1(pcp, val) percpu_to_op("or", (pcp), val)
-#define raw_cpu_or_2(pcp, val) percpu_to_op("or", (pcp), val)
-#define raw_cpu_or_4(pcp, val) percpu_to_op("or", (pcp), val)
-#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(pcp, val)
-#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(pcp, val)
-#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(pcp, val)
-
-#define this_cpu_read_1(pcp) percpu_from_op("mov", pcp)
-#define this_cpu_read_2(pcp) percpu_from_op("mov", pcp)
-#define this_cpu_read_4(pcp) percpu_from_op("mov", pcp)
-#define this_cpu_write_1(pcp, val) percpu_to_op("mov", (pcp), val)
-#define this_cpu_write_2(pcp, val) percpu_to_op("mov", (pcp), val)
-#define this_cpu_write_4(pcp, val) percpu_to_op("mov", (pcp), val)
-#define this_cpu_add_1(pcp, val) percpu_add_op((pcp), val)
-#define this_cpu_add_2(pcp, val) percpu_add_op((pcp), val)
-#define this_cpu_add_4(pcp, val) percpu_add_op((pcp), val)
-#define this_cpu_and_1(pcp, val) percpu_to_op("and", (pcp), val)
-#define this_cpu_and_2(pcp, val) percpu_to_op("and", (pcp), val)
-#define this_cpu_and_4(pcp, val) percpu_to_op("and", (pcp), val)
-#define this_cpu_or_1(pcp, val) percpu_to_op("or", (pcp), val)
-#define this_cpu_or_2(pcp, val) percpu_to_op("or", (pcp), val)
-#define this_cpu_or_4(pcp, val) percpu_to_op("or", (pcp), val)
-#define this_cpu_xchg_1(pcp, nval) percpu_xchg_op(pcp, nval)
-#define this_cpu_xchg_2(pcp, nval) percpu_xchg_op(pcp, nval)
-#define this_cpu_xchg_4(pcp, nval) percpu_xchg_op(pcp, nval)
-
-#define raw_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val)
-#define raw_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val)
-#define raw_cpu_add_return_4(pcp, val) percpu_add_return_op(pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
-
-#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val)
-#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val)
-#define this_cpu_add_return_4(pcp, val) percpu_add_return_op(pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
+#define raw_cpu_read_1(pcp) percpu_from_op(, "mov", pcp)
+#define raw_cpu_read_2(pcp) percpu_from_op(, "mov", pcp)
+#define raw_cpu_read_4(pcp) percpu_from_op(, "mov", pcp)
+
+#define raw_cpu_write_1(pcp, val) percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_write_2(pcp, val) percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_write_4(pcp, val) percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_add_1(pcp, val) percpu_add_op(, (pcp), val)
+#define raw_cpu_add_2(pcp, val) percpu_add_op(, (pcp), val)
+#define raw_cpu_add_4(pcp, val) percpu_add_op(, (pcp), val)
+#define raw_cpu_and_1(pcp, val) percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_and_2(pcp, val) percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_and_4(pcp, val) percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val)
+#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val)
+#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val)
+
+#define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_read_4(pcp) percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val) percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val) percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val) percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_add_1(pcp, val) percpu_add_op(volatile, (pcp), val)
+#define this_cpu_add_2(pcp, val) percpu_add_op(volatile, (pcp), val)
+#define this_cpu_add_4(pcp, val) percpu_add_op(volatile, (pcp), val)
+#define this_cpu_and_1(pcp, val) percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_and_2(pcp, val) percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_and_4(pcp, val) percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_or_1(pcp, val) percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_or_2(pcp, val) percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_or_4(pcp, val) percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_xchg_1(pcp, nval) percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_xchg_2(pcp, nval) percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_xchg_4(pcp, nval) percpu_xchg_op(volatile, pcp, nval)
+
+#define raw_cpu_add_return_1(pcp, val) percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_2(pcp, val) percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_4(pcp, val) percpu_add_return_op(, pcp, val)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval)
+
+#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_add_return_4(pcp, val) percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval)
#ifdef CONFIG_X86_CMPXCHG64
#define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2) \
@@ -466,23 +466,23 @@ do { \
* 32 bit must fall back to generic operations.
*/
#ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp) percpu_from_op("mov", pcp)
-#define raw_cpu_write_8(pcp, val) percpu_to_op("mov", (pcp), val)
-#define raw_cpu_add_8(pcp, val) percpu_add_op((pcp), val)
-#define raw_cpu_and_8(pcp, val) percpu_to_op("and", (pcp), val)
-#define raw_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val)
-#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(pcp, val)
-#define raw_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
-
-#define this_cpu_read_8(pcp) percpu_from_op("mov", pcp)
-#define this_cpu_write_8(pcp, val) percpu_to_op("mov", (pcp), val)
-#define this_cpu_add_8(pcp, val) percpu_add_op((pcp), val)
-#define this_cpu_and_8(pcp, val) percpu_to_op("and", (pcp), val)
-#define this_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(pcp, val)
-#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
+#define raw_cpu_read_8(pcp) percpu_from_op(, "mov", pcp)
+#define raw_cpu_write_8(pcp, val) percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_add_8(pcp, val) percpu_add_op(, (pcp), val)
+#define raw_cpu_and_8(pcp, val) percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_or_8(pcp, val) percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(, pcp, val)
+#define raw_cpu_xchg_8(pcp, nval) percpu_xchg_op(, pcp, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval)
+
+#define this_cpu_read_8(pcp) percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val) percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val) percpu_add_op(volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val) percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val) percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval)
/*
* Pretty complex macro to generate cmpxchg16 instruction. The instruction
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/5] x86/percpu: Relax smp_processor_id()
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra
` (5 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit
Nadav reported that since this_cpu_read() became asm-volatile, many
smp_processor_id() users generated worse code due to the extra
constraints.
However since smp_processor_id() is reading a stable value, we can use
__this_cpu_read().
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/x86/include/asm/smp.h | 3 ++-
include/linux/smp.h | 45 +++++++++++++++++++++++++++++++--------------
2 files changed, 33 insertions(+), 15 deletions(-)
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -162,7 +162,8 @@ __visible void smp_call_function_single_
* from the initial startup. We map APIC_BASE very early in page_setup(),
* so this is correct in the x86 case.
*/
-#define raw_smp_processor_id() (this_cpu_read(cpu_number))
+#define raw_smp_processor_id() this_cpu_read(cpu_number)
+#define __smp_processor_id() __this_cpu_read(cpu_number)
#ifdef CONFIG_X86_32
extern int safe_smp_processor_id(void);
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -181,29 +181,46 @@ static inline int get_boot_cpu_id(void)
#endif /* !SMP */
-/*
- * smp_processor_id(): get the current CPU ID.
+/**
+ * raw_processor_id() - get the current (unstable) CPU id
+ *
+ * For then you know what you are doing and need an unstable
+ * CPU id.
+ */
+
+/**
+ * smp_processor_id() - get the current (stable) CPU id
+ *
+ * This is the normal accessor to the CPU id and should be used
+ * whenever possible.
*
- * if DEBUG_PREEMPT is enabled then we check whether it is
- * used in a preemption-safe way. (smp_processor_id() is safe
- * if it's used in a preemption-off critical section, or in
- * a thread that is bound to the current CPU.)
+ * The CPU id is stable when:
+ *
+ * - IRQs are disabled;
+ * - preemption is disabled;
+ * - the task is CPU affine.
*
- * NOTE: raw_smp_processor_id() is for internal use only
- * (smp_processor_id() is the preferred variant), but in rare
- * instances it might also be used to turn off false positives
- * (i.e. smp_processor_id() use that the debugging code reports but
- * which use for some reason is legal). Don't use this to hack around
- * the warning message, as your code might not work under PREEMPT.
+ * When CONFIG_DEBUG_PREEMPT; we verify these assumption and WARN
+ * when smp_processor_id() is used when the CPU id is not stable.
*/
+
+/*
+ * Allow the architecture to differentiate between a stable and unstable read.
+ * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
+ * regular asm read for the stable.
+ */
+#ifndef __smp_processor_id
+#define __smp_processor_id(x) raw_smp_processor_id(x)
+#endif
+
#ifdef CONFIG_DEBUG_PREEMPT
extern unsigned int debug_smp_processor_id(void);
# define smp_processor_id() debug_smp_processor_id()
#else
-# define smp_processor_id() raw_smp_processor_id()
+# define smp_processor_id() __smp_processor_id()
#endif
-#define get_cpu() ({ preempt_disable(); smp_processor_id(); })
+#define get_cpu() ({ preempt_disable(); __smp_processor_id(); })
#define put_cpu() preempt_enable()
/*
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs()
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra
` (4 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit
Nadav reported that since the this_cpu_*() ops got asm-volatile
constraints on, code generation suffered for do_IRQ(), but since this
is all with IRQs disabled we can use __this_cpu_*().
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/x86/include/asm/irq_regs.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/include/asm/irq_regs.h
+++ b/arch/x86/include/asm/irq_regs.h
@@ -16,7 +16,7 @@ DECLARE_PER_CPU(struct pt_regs *, irq_re
static inline struct pt_regs *get_irq_regs(void)
{
- return this_cpu_read(irq_regs);
+ return __this_cpu_read(irq_regs);
}
static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs)
@@ -24,7 +24,7 @@ static inline struct pt_regs *set_irq_re
struct pt_regs *old_regs;
old_regs = get_irq_regs();
- this_cpu_write(irq_regs, new_regs);
+ __this_cpu_write(irq_regs, new_regs);
return old_regs;
}
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
` (2 preceding siblings ...)
2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra
` (3 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit
Almost all of this is ran with IRQs disabled and therefore doesn't
need the extra constraints on the this_cpu_*() ops, use __this_cpu_*()
to alleviate this.
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/x86/mm/tlb.c | 62 +++++++++++++++++++++++++++---------------------------
1 file changed, 31 insertions(+), 31 deletions(-)
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -58,15 +58,15 @@ static void clear_asid_other(void)
for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
/* Do not need to flush the current asid */
- if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+ if (asid == __this_cpu_read(cpu_tlbstate.loaded_mm_asid))
continue;
/*
* Make sure the next time we go to switch to
* this asid, we do a flush:
*/
- this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+ __this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
}
- this_cpu_write(cpu_tlbstate.invalidate_other, false);
+ __this_cpu_write(cpu_tlbstate.invalidate_other, false);
}
atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
@@ -83,16 +83,16 @@ static void choose_new_asid(struct mm_st
return;
}
- if (this_cpu_read(cpu_tlbstate.invalidate_other))
+ if (__this_cpu_read(cpu_tlbstate.invalidate_other))
clear_asid_other();
for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
- if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
+ if (__this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
next->context.ctx_id)
continue;
*new_asid = asid;
- *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
+ *need_flush = (__this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
next_tlb_gen);
return;
}
@@ -101,10 +101,10 @@ static void choose_new_asid(struct mm_st
* We don't currently own an ASID slot on this CPU.
* Allocate a slot.
*/
- *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
+ *new_asid = __this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
if (*new_asid >= TLB_NR_DYN_ASIDS) {
*new_asid = 0;
- this_cpu_write(cpu_tlbstate.next_asid, 1);
+ __this_cpu_write(cpu_tlbstate.next_asid, 1);
}
*need_flush = true;
}
@@ -245,7 +245,7 @@ static void cond_ibpb(struct task_struct
* cpu_tlbstate.last_user_mm_ibpb for comparison.
*/
next_mm = mm_mangle_tif_spec_ib(next);
- prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
+ prev_mm = __this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
/*
* Issue IBPB only if the mm's are different and one or
@@ -255,7 +255,7 @@ static void cond_ibpb(struct task_struct
(next_mm | prev_mm) & LAST_USER_MM_IBPB)
indirect_branch_prediction_barrier();
- this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
+ __this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
}
if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -264,9 +264,9 @@ static void cond_ibpb(struct task_struct
* different context than the user space task which ran
* last on this CPU.
*/
- if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+ if (__this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
indirect_branch_prediction_barrier();
- this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
+ __this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
}
}
}
@@ -274,9 +274,9 @@ static void cond_ibpb(struct task_struct
void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
- struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
- u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
- bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+ struct mm_struct *real_prev = __this_cpu_read(cpu_tlbstate.loaded_mm);
+ u16 prev_asid = __this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ bool was_lazy = __this_cpu_read(cpu_tlbstate.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -321,7 +321,7 @@ void switch_mm_irqs_off(struct mm_struct
__flush_tlb_all();
}
#endif
- this_cpu_write(cpu_tlbstate.is_lazy, false);
+ __this_cpu_write(cpu_tlbstate.is_lazy, false);
/*
* The membarrier system call requires a full memory barrier and
@@ -330,7 +330,7 @@ void switch_mm_irqs_off(struct mm_struct
* memory barrier and core serializing instruction.
*/
if (real_prev == next) {
- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+ VM_WARN_ON(__this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);
/*
@@ -358,7 +358,7 @@ void switch_mm_irqs_off(struct mm_struct
*/
smp_mb();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
- if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+ if (__this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
next_tlb_gen)
return;
@@ -406,13 +406,13 @@ void switch_mm_irqs_off(struct mm_struct
choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
/* Let nmi_uaccess_okay() know that we're changing CR3. */
- this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+ __this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
barrier();
}
if (need_flush) {
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+ __this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+ __this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
load_new_mm_cr3(next->pgd, new_asid, true);
/*
@@ -435,8 +435,8 @@ void switch_mm_irqs_off(struct mm_struct
/* Make sure we write CR3 before loaded_mm. */
barrier();
- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+ __this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ __this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
if (next != real_prev) {
load_mm_cr4(next);
@@ -529,10 +529,10 @@ static void flush_tlb_func_common(const
* - f->new_tlb_gen: the generation that the requester of the flush
* wants us to catch up to.
*/
- struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
- u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ struct mm_struct *loaded_mm = __this_cpu_read(cpu_tlbstate.loaded_mm);
+ u32 loaded_mm_asid = __this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
- u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+ u64 local_tlb_gen = __this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
@@ -540,10 +540,10 @@ static void flush_tlb_func_common(const
if (unlikely(loaded_mm == &init_mm))
return;
- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
+ VM_WARN_ON(__this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
loaded_mm->context.ctx_id);
- if (this_cpu_read(cpu_tlbstate.is_lazy)) {
+ if (__this_cpu_read(cpu_tlbstate.is_lazy)) {
/*
* We're in lazy mode. We need to at least flush our
* paging-structure cache to avoid speculatively reading
@@ -631,7 +631,7 @@ static void flush_tlb_func_common(const
}
/* Both paths above update our state to mm_tlb_gen. */
- this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
+ __this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
}
static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
@@ -647,7 +647,7 @@ static void flush_tlb_func_remote(void *
inc_irq_stat(irq_tlb_count);
- if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
+ if (f->mm && f->mm != __this_cpu_read(cpu_tlbstate.loaded_mm))
return;
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -749,7 +749,7 @@ void flush_tlb_mm_range(struct mm_struct
info.end = TLB_FLUSH_ALL;
}
- if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+ if (mm == __this_cpu_read(cpu_tlbstate.loaded_mm)) {
VM_WARN_ON(irqs_disabled());
local_irq_disable();
flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock()
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
` (3 preceding siblings ...)
2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
` (2 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit
Nadav reported that code-gen changed because of the this_cpu_*()
constraints, avoid this for select_idle_cpu() because that runs with
preemption (and IRQs) disabled anyway.
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/fair.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6160,6 +6160,7 @@ static int select_idle_cpu(struct task_s
u64 time, cost;
s64 delta;
int cpu, nr = INT_MAX;
+ int this = smp_processor_id();
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6183,7 +6184,7 @@ static int select_idle_cpu(struct task_s
nr = 4;
}
- time = local_clock();
+ time = cpu_clock(this);
for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
if (!--nr)
@@ -6194,7 +6195,7 @@ static int select_idle_cpu(struct task_s
break;
}
- time = local_clock() - time;
+ time = cpu_clock(this) - time;
cost = this_sd->avg_scan_cost;
delta = (s64)(time - cost) / 8;
this_sd->avg_scan_cost += delta;
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg()
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
` (4 preceding siblings ...)
2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra
@ 2019-02-27 10:24 ` Peter Zijlstra
2019-02-27 14:43 ` Peter Zijlstra
2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit
2019-03-08 14:50 ` Peter Zijlstra
7 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:24 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel
And because it's one of _those_ days, I forgot to include one patch...
---
Subject: x86/percpu: Optimize raw_cpu_xchg()
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Feb 27 11:09:56 CET 2019
Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
loop.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/x86/include/asm/percpu.h | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -407,9 +407,21 @@ do { \
#define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val)
#define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val)
#define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val)
-#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val)
-#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val)
-#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val)
+
+/*
+ * raw_cpu_xchg() can use a load-store since it is not required to be
+ * IRQ-safe.
+ */
+#define raw_percpu_xchg_op(var, nval) \
+({ \
+ typeof(var) pxo_ret__ = raw_cpu_read(var); \
+ raw_cpu_write(var, (nval)); \
+ pxo_ret__; \
+})
+
+#define raw_cpu_xchg_1(pcp, val) raw_percpu_xchg_op(pcp, val)
+#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val)
+#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val)
#define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp)
#define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg()
2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
@ 2019-02-27 14:43 ` Peter Zijlstra
0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 14:43 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel
On Wed, Feb 27, 2019 at 11:24:45AM +0100, Peter Zijlstra wrote:
>
> And because it's one of _those_ days, I forgot to include one patch...
>
> ---
> Subject: x86/percpu: Optimize raw_cpu_xchg()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Feb 27 11:09:56 CET 2019
>
> Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
> this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
> loop.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> arch/x86/include/asm/percpu.h | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -407,9 +407,21 @@ do { \
> #define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val)
> #define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val)
> #define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val)
> -#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val)
> -#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val)
> -#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val)
> +
> +/*
> + * raw_cpu_xchg() can use a load-store since it is not required to be
> + * IRQ-safe.
> + */
> +#define raw_percpu_xchg_op(var, nval) \
> +({ \
> + typeof(var) pxo_ret__ = raw_cpu_read(var); \
> + raw_cpu_write(var, (nval)); \
> + pxo_ret__; \
> +})
> +
> +#define raw_cpu_xchg_1(pcp, val) raw_percpu_xchg_op(pcp, val)
> +#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val)
> +#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val)
>
> #define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp)
> #define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp)
And yes, I just added raw_cpu_xchg_8... *sigh*
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
@ 2019-02-27 16:14 ` Linus Torvalds
2019-02-27 16:48 ` Peter Zijlstra
2019-02-27 17:57 ` Nadav Amit
0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 16:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
Nadav Amit, Linux List Kernel Mailing, Nadav Amit
[-- Attachment #1: Type: text/plain, Size: 1078 bytes --]
On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Nadav Amit reported that commit:
>
> b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
>
> added a bunch of constraints to all sorts of code; and while some of
> that was correct and desired, some of that seems superfluous.
Hmm.
I have the strong feeling that we should instead relax this_cpu_read()
again a bit.
In particular, making it "asm volatile" really is a big hammer
approach. It's worth noting that the *other* this_cpu_xyz ops don't
even do that.
I would suggest that instead of making "this_cpu_read()" be asm
volatile, we mark it as potentially changing the memory location it is
touching - the same way the modify/write ops do.
That still means that the read will be forced (like READ_ONCE()), but
allows gcc a bit more flexibility in instruction scheduling, I think.
Trivial (but entirely untested) patch attached.
That said, I didn't actually check how it affects code generation.
Nadav, would you check the code sequences you originally noticed?
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1348 bytes --]
arch/x86/include/asm/percpu.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 1a19d11cfbbd..63b0f361533f 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -185,24 +185,24 @@ do { \
typeof(var) pfo_ret__; \
switch (sizeof(var)) { \
case 1: \
- asm volatile(op "b "__percpu_arg(1)",%0"\
- : "=q" (pfo_ret__) \
- : "m" (var)); \
+ asm(op "b "__percpu_arg(1)",%0"\
+ : "=q" (pfo_ret__), \
+ "+m" (var)); \
break; \
case 2: \
- asm volatile(op "w "__percpu_arg(1)",%0"\
- : "=r" (pfo_ret__) \
- : "m" (var)); \
+ asm(op "w "__percpu_arg(1)",%0"\
+ : "=r" (pfo_ret__), \
+ "+m" (var)); \
break; \
case 4: \
- asm volatile(op "l "__percpu_arg(1)",%0"\
- : "=r" (pfo_ret__) \
- : "m" (var)); \
+ asm(op "l "__percpu_arg(1)",%0"\
+ : "=r" (pfo_ret__), \
+ "+m" (var)); \
break; \
case 8: \
- asm volatile(op "q "__percpu_arg(1)",%0"\
- : "=r" (pfo_ret__) \
- : "m" (var)); \
+ asm(op "q "__percpu_arg(1)",%0"\
+ : "=r" (pfo_ret__), \
+ "+m" (var)); \
break; \
default: __bad_percpu_size(); \
} \
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 16:14 ` Linus Torvalds
@ 2019-02-27 16:48 ` Peter Zijlstra
2019-02-27 17:17 ` Linus Torvalds
2019-02-27 17:57 ` Nadav Amit
1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 16:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
Nadav Amit, Linux List Kernel Mailing, Nadav Amit
On Wed, Feb 27, 2019 at 08:14:09AM -0800, Linus Torvalds wrote:
> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Nadav Amit reported that commit:
> >
> > b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
> >
> > added a bunch of constraints to all sorts of code; and while some of
> > that was correct and desired, some of that seems superfluous.
>
> Hmm.
>
> I have the strong feeling that we should instead relax this_cpu_read()
> again a bit.
>
> In particular, making it "asm volatile" really is a big hammer
> approach. It's worth noting that the *other* this_cpu_xyz ops don't
> even do that.
Right, this patch 'fixes' that :-)
> I would suggest that instead of making "this_cpu_read()" be asm
> volatile, we mark it as potentially changing the memory location it is
> touching - the same way the modify/write ops do.
>
> That still means that the read will be forced (like READ_ONCE()), but
> allows gcc a bit more flexibility in instruction scheduling, I think.
Ah, fair enough, I'll spin a version of this patch with "+m" for
this_cpu and "m" for raw_cpu.
> That said, I didn't actually check how it affects code generation.
> Nadav, would you check the code sequences you originally noticed?
Much of it was the ONCE behaviour defeating CSE I think, but yes, it
would be good to have another look.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 16:48 ` Peter Zijlstra
@ 2019-02-27 17:17 ` Linus Torvalds
2019-02-27 17:34 ` Peter Zijlstra
0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 17:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
Nadav Amit, Linux List Kernel Mailing, Nadav Amit
On Wed, Feb 27, 2019 at 8:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 27, 2019 at 08:14:09AM -0800, Linus Torvalds wrote:
> > In particular, making it "asm volatile" really is a big hammer
> > approach. It's worth noting that the *other* this_cpu_xyz ops don't
> > even do that.
>
> Right, this patch 'fixes' that :-)
Yeah, but I do hate seeing these patches that randomly add double
underscore versions just because the regular one is so bad.
So I'd much rather improve the regular this_cpu_read() instead, and
hope that we don't need to have this kind of big difference between
that and the double-underscore version.
I'm not convinced that the "asm volatile" is right on the _other_ ops
either. You added them to cmpxchg and xadd too, and it's not obvious
that they should have them. They have the "memory" clobber to order
them wrt actual memory ops, why are they now "asm volatile"?
So I don't really like this patch that randomly adds volatile to
things, and then removes it from one special case that I don't think
should have had it in the first place.
It all seems pretty ad-hoc, and we already _know_ that "asm volatile" is bad.
Linus
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 17:17 ` Linus Torvalds
@ 2019-02-27 17:34 ` Peter Zijlstra
2019-02-27 17:38 ` Linus Torvalds
0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 17:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
Nadav Amit, Linux List Kernel Mailing, Nadav Amit
On Wed, Feb 27, 2019 at 09:17:42AM -0800, Linus Torvalds wrote:
> It all seems pretty ad-hoc, and we already _know_ that "asm volatile" is bad.
Ah, all I wanted was the ONCE thing and my inline asm foo sucks. If +m
gets us that, awesome.
But the ONCE thing defeats CSE (on purpose!) so for code-gen that
likes/wants that we then have to use the __this_cpu crud.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 17:34 ` Peter Zijlstra
@ 2019-02-27 17:38 ` Linus Torvalds
0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 17:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
Nadav Amit, Linux List Kernel Mailing, Nadav Amit
On Wed, Feb 27, 2019 at 9:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> But the ONCE thing defeats CSE (on purpose!) so for code-gen that
> likes/wants that we then have to use the __this_cpu crud.
Right. But I do think that if you want CSE on a percpu access, you
might want to write it as such (ie load the value once and then re-use
the value).
But I'm not sure exactly which accesses Nadav noticed to be a problem,
so.. I guess I should just look at the other patches, but I found them
rather nasty and ad-hoc too so I just skimmed them with an "eww" ;)
Linus
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 16:14 ` Linus Torvalds
2019-02-27 16:48 ` Peter Zijlstra
@ 2019-02-27 17:57 ` Nadav Amit
2019-02-27 18:55 ` Nadav Amit
2019-02-27 19:41 ` Linus Torvalds
1 sibling, 2 replies; 23+ messages in thread
From: Nadav Amit @ 2019-02-27 17:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox
> On Feb 27, 2019, at 8:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> Nadav Amit reported that commit:
>>
>> b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
>>
>> added a bunch of constraints to all sorts of code; and while some of
>> that was correct and desired, some of that seems superfluous.
>
> Trivial (but entirely untested) patch attached.
>
> That said, I didn't actually check how it affects code generation.
> Nadav, would you check the code sequences you originally noticed?
The original issue was raised while I was looking into a dropped patch of
Matthew Wilcox that caused code size increase [1]. As a result I noticed
that Peter’s patch caused big changes to the generated assembly across the
kernel - I did not have a specific scenario that I cared about.
The patch you sent (“+m/-volatile”) does increase the code size by 1728
bytes. Although code size is not the only metric for “code optimization”,
the original patch of Peter (“volatile”) only increased the code size by 201
bytes. Peter’s original change also affected only 72 functions vs 228 that
impacted by the new patch.
I’ll have a look at some specific function assembly, but overall, the “+m”
approach might prevent even more code optimizations than the “volatile” one.
I’ll send an example or two later.
Regards,
Nadav
[1] https://marc.info/?l=linux-mm&m=154341370216693&w=2
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 17:57 ` Nadav Amit
@ 2019-02-27 18:55 ` Nadav Amit
2019-02-27 19:41 ` Linus Torvalds
1 sibling, 0 replies; 23+ messages in thread
From: Nadav Amit @ 2019-02-27 18:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox
> On Feb 27, 2019, at 9:57 AM, Nadav Amit <namit@vmware.com> wrote:
>
>> On Feb 27, 2019, at 8:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>> Nadav Amit reported that commit:
>>>
>>> b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
>>>
>>> added a bunch of constraints to all sorts of code; and while some of
>>> that was correct and desired, some of that seems superfluous.
>>
>> Trivial (but entirely untested) patch attached.
>>
>> That said, I didn't actually check how it affects code generation.
>> Nadav, would you check the code sequences you originally noticed?
>
> The original issue was raised while I was looking into a dropped patch of
> Matthew Wilcox that caused code size increase [1]. As a result I noticed
> that Peter’s patch caused big changes to the generated assembly across the
> kernel - I did not have a specific scenario that I cared about.
>
> The patch you sent (“+m/-volatile”) does increase the code size by 1728
> bytes. Although code size is not the only metric for “code optimization”,
> the original patch of Peter (“volatile”) only increased the code size by 201
> bytes. Peter’s original change also affected only 72 functions vs 228 that
> impacted by the new patch.
>
> I’ll have a look at some specific function assembly, but overall, the “+m”
> approach might prevent even more code optimizations than the “volatile” one.
>
> I’ll send an example or two later.
Here is one example:
Dump of assembler code for function event_filter_pid_sched_wakeup_probe_pre:
0xffffffff8117c510 <+0>: push %rbp
0xffffffff8117c511 <+1>: mov %rsp,%rbp
0xffffffff8117c514 <+4>: push %rbx
0xffffffff8117c515 <+5>: mov 0x28(%rdi),%rax
0xffffffff8117c519 <+9>: mov %gs:0x78(%rax),%dl
0xffffffff8117c51d <+13>: test %dl,%dl
0xffffffff8117c51f <+15>: je 0xffffffff8117c535 <event_filter_pid_sched_wakeup_probe_pre+37>
0xffffffff8117c521 <+17>: mov %rdi,%rax
0xffffffff8117c524 <+20>: mov 0x78(%rdi),%rdi
0xffffffff8117c528 <+24>: mov 0x28(%rax),%rbx # REDUNDANT
0xffffffff8117c52c <+28>: callq 0xffffffff81167830 <trace_ignore_this_task>
0xffffffff8117c531 <+33>: mov %al,%gs:0x78(%rbx)
0xffffffff8117c535 <+37>: pop %rbx
0xffffffff8117c536 <+38>: pop %rbp
0xffffffff8117c537 <+39>: retq
The instruction at 0xffffffff8117c528 is redundant, and does not exist
without the recent patch. It seems to be a result of no-strict-aliasing,
which due to the new "memory write” (“+m”) causes the compiler to re-read
the data.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 17:57 ` Nadav Amit
2019-02-27 18:55 ` Nadav Amit
@ 2019-02-27 19:41 ` Linus Torvalds
2019-03-08 13:35 ` Peter Zijlstra
1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 19:41 UTC (permalink / raw)
To: Nadav Amit
Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox
On Wed, Feb 27, 2019 at 9:57 AM Nadav Amit <namit@vmware.com> wrote:
>
> I’ll have a look at some specific function assembly, but overall, the “+m”
> approach might prevent even more code optimizations than the “volatile” one.
Ok, that being the case, let's forget that patch.
I still wonder about the added volatiles to the xadd/cmpxchg cases,
which already had the "memory" clobber which should make the volatile
immaterial..
Linus
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
` (5 preceding siblings ...)
2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
@ 2019-02-27 23:16 ` Nadav Amit
2019-03-08 14:50 ` Peter Zijlstra
7 siblings, 0 replies; 23+ messages in thread
From: Nadav Amit @ 2019-02-27 23:16 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: torvalds, mingo, bp, tglx, luto, linux-kernel
> On Feb 27, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi all,
>
> This is a collection of x86/percpu changes that I had pending and got reminded
> of by Linus' comment yesterday about __this_cpu_xchg().
>
> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
>
> Built and boot tested with CONFIG_DEBUG_PREEMPT=y.
Overall this series affects 70 functions and shortens the code by 326 bytes.
_local_bh_enable() for example is shorten by 14 bytes (26%).
I must admit that I although I pointed some of these issues before, I am not
sure whether they are really important...
Recently, I tried to see how to make the compiler to generate “better code”
from Linux. I sprinkled “pure” attribute on many common function (e.g.,
page_rmapping()), sg_next()); sprinkled const-attribute on some others
(e.g., jiffies_to_msecs()); created a const-alias variable so the compiler
would consider kaslr variables and sme_me_mask as constant after
initialization, and so on.
I was then looking at the changed code, and while some functions were
shorter and some longer, many common functions did look “better”. The only
problem was that any benchmark that I did not show any measurable impact.
So perhaps it is a matter of measurement, but eventually right now there is
no clean win.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
2019-02-27 19:41 ` Linus Torvalds
@ 2019-03-08 13:35 ` Peter Zijlstra
0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 13:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nadav Amit, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox
On Wed, Feb 27, 2019 at 11:41:31AM -0800, Linus Torvalds wrote:
> On Wed, Feb 27, 2019 at 9:57 AM Nadav Amit <namit@vmware.com> wrote:
> >
> > I’ll have a look at some specific function assembly, but overall, the “+m”
> > approach might prevent even more code optimizations than the “volatile” one.
>
> Ok, that being the case, let's forget that patch.
>
> I still wonder about the added volatiles to the xadd/cmpxchg cases,
> which already had the "memory" clobber which should make the volatile
> immaterial..
That was mostly me being OCD style consistent; but also notice that the
atomic ops also often have volatile even though they have a memory
clobber.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
` (6 preceding siblings ...)
2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit
@ 2019-03-08 14:50 ` Peter Zijlstra
2019-03-08 19:35 ` Nadav Amit
7 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 14:50 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 21199 bytes --]
On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote:
> This is a collection of x86/percpu changes that I had pending and got reminded
> of by Linus' comment yesterday about __this_cpu_xchg().
>
> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
(Sorry; this is going to have _wide_ output)
OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with
resp that many patches of this series applied.
When I look at just the vmlinux size output:
$ size defconfig-build*/vmlinux
text data bss dec hex filename
19540631 5040164 1871944 26452739 193a303 defconfig-build/vmlinux
19540635 5040164 1871944 26452743 193a307 defconfig-build1/vmlinux
19540685 5040164 1871944 26452793 193a339 defconfig-build2/vmlinux
19540685 5040164 1871944 26452793 193a339 defconfig-build3/vmlinux
Things appear to get slightly larger; however when I look in more
detail using my (newly written compare script, find attached), I get
things like:
$ ./compare.sh defconfig-build defconfig-build1
arch/x86/mm/fault.o 12850 12818 -32
kernel/power/process.o 3586 3706 +120
kernel/locking/rtmutex.o 1687 1671 -16
kernel/sched/core.o 7127 7181 +54
kernel/time/tick-sched.o 8941 8837 -104
kernel/exit.o 310 385 +75
kernel/softirq.o 1217 1199 -18
kernel/workqueue.o 3240 3288 +48
net/ipv6/tcp_ipv6.o 25434 25345 -89
net/ipv4/tcp_ipv4.o 301 305 +4
total 4768226 4768268 +42
When we look at just tick-sched.o:
$ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o
can_stop_idle_tick.isra.14 146 139 -7
we see a totally different number ?!
$ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o can_stop_idle_tick.isra.14
0000 0000000000000680 <can_stop_idle_tick.isra.14>: | 0000 0000000000000680 <can_stop_idle_tick.isra.14>:
0000 680: 53 push %rbx | 0000 680: 53 push %rbx
0001 681: 89 f8 mov %edi,%eax | 0001 681: 89 f8 mov %edi,%eax
0003 683: 48 0f a3 05 00 00 00 bt %rax,0x0(%rip) # 68b <can_stop_id | 0003 683: 48 0f a3 05 00 00 00 bt %rax,0x0(%rip) # 68b <can_stop_id
000a 68a: 00 | 000a 68a: 00
0007 687: R_X86_64_PC32 __cpu_online_mask-0x4 | 0007 687: R_X86_64_PC32 __cpu_online_mask-0x4
000b 68b: 0f 92 c3 setb %bl | 000b 68b: 0f 92 c3 setb %bl
000e 68e: 73 67 jae 6f7 <can_stop_idle_tick.isra.14+0x77> \ 000e 68e: 73 48 jae 6d8 <can_stop_idle_tick.isra.14+0x58>
0010 690: 8b 06 mov (%rsi),%eax | 0010 690: 8b 06 mov (%rsi),%eax
0012 692: 85 c0 test %eax,%eax | 0012 692: 85 c0 test %eax,%eax
0014 694: 74 21 je 6b7 <can_stop_idle_tick.isra.14+0x37> | 0014 694: 74 21 je 6b7 <can_stop_idle_tick.isra.14+0x37>
0016 696: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax | 0016 696: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
001d 69d: 00 00 | 001d 69d: 00 00
001b 69b: R_X86_64_32S current_task | 001b 69b: R_X86_64_32S current_task
001f 69f: 48 8b 00 mov (%rax),%rax | 001f 69f: 48 8b 00 mov (%rax),%rax
0022 6a2: a8 08 test $0x8,%al | 0022 6a2: a8 08 test $0x8,%al
0024 6a4: 75 11 jne 6b7 <can_stop_idle_tick.isra.14+0x37> | 0024 6a4: 75 11 jne 6b7 <can_stop_idle_tick.isra.14+0x37>
0026 6a6: 65 66 8b 05 00 00 00 mov %gs:0x0(%rip),%ax # 6ae <can_stop \ 0026 6a6: 65 66 8b 35 00 00 00 mov %gs:0x0(%rip),%si # 6ae <can_stop
002d 6ad: 00 | 002d 6ad: 00
002a 6aa: R_X86_64_PC32 irq_stat-0x4 | 002a 6aa: R_X86_64_PC32 irq_stat-0x4
002e 6ae: 66 85 c0 test %ax,%ax \ 002e 6ae: 66 85 f6 test %si,%si
0031 6b1: 75 0a jne 6bd <can_stop_idle_tick.isra.14+0x3d> | 0031 6b1: 75 0a jne 6bd <can_stop_idle_tick.isra.14+0x3d>
0033 6b3: 89 d8 mov %ebx,%eax | 0033 6b3: 89 d8 mov %ebx,%eax
0035 6b5: 5b pop %rbx | 0035 6b5: 5b pop %rbx
0036 6b6: c3 retq | 0036 6b6: c3 retq
0037 6b7: 31 db xor %ebx,%ebx | 0037 6b7: 31 db xor %ebx,%ebx
0039 6b9: 89 d8 mov %ebx,%eax | 0039 6b9: 89 d8 mov %ebx,%eax
003b 6bb: 5b pop %rbx | 003b 6bb: 5b pop %rbx
003c 6bc: c3 retq | 003c 6bc: c3 retq
003d 6bd: 31 db xor %ebx,%ebx | 003d 6bd: 31 db xor %ebx,%ebx
003f 6bf: 83 3d 00 00 00 00 09 cmpl $0x9,0x0(%rip) # 6c6 <can_stop_id | 003f 6bf: 83 3d 00 00 00 00 09 cmpl $0x9,0x0(%rip) # 6c6 <can_stop_id
0041 6c1: R_X86_64_PC32 .bss-0x5 | 0041 6c1: R_X86_64_PC32 .bss-0x5
0046 6c6: 7f eb jg 6b3 <can_stop_idle_tick.isra.14+0x33> | 0046 6c6: 7f eb jg 6b3 <can_stop_idle_tick.isra.14+0x33>
0048 6c8: 65 66 8b 05 00 00 00 mov %gs:0x0(%rip),%ax # 6d0 <can_stop \ 0048 6c8: 0f b7 f6 movzwl %si,%esi
004f 6cf: 00 \ 004b 6cb: f7 c6 ff fd 00 00 test $0xfdff,%esi
004c 6cc: R_X86_64_PC32 irq_stat-0x4 \ 0051 6d1: 74 e0 je 6b3 <can_stop_idle_tick.isra.14+0x33>
0050 6d0: a9 ff fd 00 00 test $0xfdff,%eax \ 0053 6d3: e9 00 00 00 00 jmpq 6d8 <can_stop_idle_tick.isra.14+0x58>
0055 6d5: 74 dc je 6b3 <can_stop_idle_tick.isra.14+0x33> \ 0054 6d4: R_X86_64_PC32 .text.unlikely-0x4
0057 6d7: 65 66 8b 35 00 00 00 mov %gs:0x0(%rip),%si # 6df <can_stop \ 0058 6d8: 3b 3d 00 00 00 00 cmp 0x0(%rip),%edi # 6de <can_stop_id
005e 6de: 00 \ 005a 6da: R_X86_64_PC32 tick_do_timer_cpu-0x4
005b 6db: R_X86_64_PC32 irq_stat-0x4 \ 005e 6de: 75 0a jne 6ea <can_stop_idle_tick.isra.14+0x6a>
005f 6df: 48 c7 c7 00 00 00 00 mov $0x0,%rdi \ 0060 6e0: c7 05 00 00 00 00 ff movl $0xffffffff,0x0(%rip) # 6ea <can_
0062 6e2: R_X86_64_32S .rodata.str1.8 \ 0067 6e7: ff ff ff
0066 6e6: 0f b7 f6 movzwl %si,%esi \ 0062 6e2: R_X86_64_PC32 tick_do_timer_cpu-0x8
0069 6e9: e8 00 00 00 00 callq 6ee <can_stop_idle_tick.isra.14+0x6e> \ 006a 6ea: 48 c7 02 00 00 00 00 movq $0x0,(%rdx)
006a 6ea: R_X86_64_PLT32 printk-0x4 \ 0071 6f1: eb c0 jmp 6b3 <can_stop_idle_tick.isra.14+0x33>
006e 6ee: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 6f5 <can_stop_id \ 0073 6f3: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
0070 6f0: R_X86_64_PC32 .bss-0x5 \ 007a 6fa: 00 00 00 00
0075 6f5: eb bc jmp 6b3 <can_stop_idle_tick.isra.14+0x33> \ 007e 6fe: 66 90 xchg %ax,%ax
0077 6f7: 3b 3d 00 00 00 00 cmp 0x0(%rip),%edi # 6fd <can_stop_id \ fffffffffffff980
0079 6f9: R_X86_64_PC32 tick_do_timer_cpu-0x4 \ 0000 0000000000000000 <can_stop_idle_tick.isra.14.cold.23>:
007d 6fd: 75 0a jne 709 <can_stop_idle_tick.isra.14+0x89> \ 0000 0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
007f 6ff: c7 05 00 00 00 00 ff movl $0xffffffff,0x0(%rip) # 709 <can_ \ 0003 3: R_X86_64_32S .rodata.str1.8
0086 706: ff ff ff \ 0007 7: e8 00 00 00 00 callq c <can_stop_idle_tick.isra.14.cold.23+0x
0081 701: R_X86_64_PC32 tick_do_timer_cpu-0x8 \ 0008 8: R_X86_64_PLT32 printk-0x4
0089 709: 48 c7 02 00 00 00 00 movq $0x0,(%rdx) \ 000c c: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 13 <can_stop_idl
0090 710: eb a1 jmp 6b3 <can_stop_idle_tick.isra.14+0x33> \ 000e e: R_X86_64_PC32 .bss-0x5
0092 712: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) \ 0013 13: e9 00 00 00 00 jmpq 18 <__setup_setup_tick_nohz>
0099 719: 00 00 00 00 \ 0014 14: R_X86_64_PC32 .text+0x6af
009d 71d: 0f 1f 00 nopl (%rax) \
And we see that GCC created a .cold. subfunction because the first patch
removed the volatile from __this_cpu_read() and could thus move it.
Similarly the second patch; which removes volatile from
smp_processor_id():
$ ./compare.sh defconfig-build1 defconfig-build2
arch/x86/events/amd/ibs.o 667 757 +90
arch/x86/kernel/cpu/mce/core.o 2677 2696 +19
arch/x86/kernel/cpu/mce/therm_throt.o 508 527 +19
arch/x86/kernel/cpu/mtrr/generic.o 9523 9203 -320
arch/x86/kernel/acpi/sleep.o 3152 3088 -64
arch/x86/kernel/nmi.o 338 562 +224
arch/x86/kernel/process.o 1554 1586 +32
arch/x86/kernel/tsc_sync.o 5591 5377 -214
kernel/irq/spurious.o 5835 5771 -64
kernel/irq/cpuhotplug.o 2253 2189 -64
kernel/time/clocksource.o 480 593 +113
total 4768268 4768039 -229
we get smaller total executable sections; and even when there is growth:
$ ./compare.sh defconfig-build1 defconfig-build2 arch/x86/events/amd/ibs.o setup_APIC_ibs
0000 0000000000000420 <setup_APIC_ibs>: | 0000 0000000000000420 <setup_APIC_ibs>:
0000 420: 53 push %rbx | 0000 420: 53 push %rbx
0001 421: b9 3a 10 01 c0 mov $0xc001103a,%ecx | 0001 421: b9 3a 10 01 c0 mov $0xc001103a,%ecx
0006 426: 0f 32 rdmsr | 0006 426: 0f 32 rdmsr
0008 428: 48 c1 e2 20 shl $0x20,%rdx | 0008 428: 48 c1 e2 20 shl $0x20,%rdx
000c 42c: 48 89 d3 mov %rdx,%rbx | 000c 42c: 48 89 d3 mov %rdx,%rbx
000f 42f: 48 09 c3 or %rax,%rbx | 000f 42f: 48 09 c3 or %rax,%rbx
0012 432: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) | 0012 432: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0017 437: f6 c7 01 test $0x1,%bh | 0017 437: f6 c7 01 test $0x1,%bh
001a 43a: 74 2a je 466 <setup_APIC_ibs+0x46> \ 001a 43a: 0f 84 00 00 00 00 je 440 <setup_APIC_ibs+0x20>
001c 43c: 89 df mov %ebx,%edi \ 001c 43c: R_X86_64_PC32 .text.unlikely-0x4
001e 43e: 31 c9 xor %ecx,%ecx \ 0020 440: 89 df mov %ebx,%edi
0020 440: 31 f6 xor %esi,%esi \ 0022 442: 31 c9 xor %ecx,%ecx
0022 442: ba 04 00 00 00 mov $0x4,%edx \ 0024 444: 31 f6 xor %esi,%esi
0027 447: 83 e7 0f and $0xf,%edi \ 0026 446: ba 04 00 00 00 mov $0x4,%edx
002a 44a: e8 00 00 00 00 callq 44f <setup_APIC_ibs+0x2f> \ 002b 44b: 83 e7 0f and $0xf,%edi
002b 44b: R_X86_64_PLT32 setup_APIC_eilvt-0x4 \ 002e 44e: e8 00 00 00 00 callq 453 <setup_APIC_ibs+0x33>
002f 44f: 85 c0 test %eax,%eax \ 002f 44f: R_X86_64_PLT32 setup_APIC_eilvt-0x4
0031 451: 75 13 jne 466 <setup_APIC_ibs+0x46> \ 0033 453: 85 c0 test %eax,%eax
0033 453: 5b pop %rbx \ 0035 455: 0f 85 00 00 00 00 jne 45b <setup_APIC_ibs+0x3b>
0034 454: c3 retq \ 0037 457: R_X86_64_PC32 .text.unlikely-0x4
0035 455: 31 d2 xor %edx,%edx \ 003b 45b: 5b pop %rbx
0037 457: 48 89 de mov %rbx,%rsi \ 003c 45c: c3 retq
003a 45a: bf 3a 10 01 c0 mov $0xc001103a,%edi \ 003d 45d: 31 d2 xor %edx,%edx
003f 45f: e8 00 00 00 00 callq 464 <setup_APIC_ibs+0x44> \ 003f 45f: 48 89 de mov %rbx,%rsi
0040 460: R_X86_64_PLT32 do_trace_read_msr-0x4 \ 0042 462: bf 3a 10 01 c0 mov $0xc001103a,%edi
0044 464: eb d1 jmp 437 <setup_APIC_ibs+0x17> \ 0047 467: e8 00 00 00 00 callq 46c <setup_APIC_ibs+0x4c>
0046 466: 65 8b 35 00 00 00 00 mov %gs:0x0(%rip),%esi # 46d <setup_A \ 0048 468: R_X86_64_PLT32 do_trace_read_msr-0x4
0049 469: R_X86_64_PC32 cpu_number-0x4 \ 004c 46c: eb c9 jmp 437 <setup_APIC_ibs+0x17>
004d 46d: 48 c7 c7 00 00 00 00 mov $0x0,%rdi \ 004e 46e: 66 90 xchg %ax,%ax
0050 470: R_X86_64_32S .rodata.str1.8 \ fffffffffffffbe0
0054 474: 5b pop %rbx \ 0000 0000000000000000 <setup_APIC_ibs.cold.9>:
0055 475: e9 00 00 00 00 jmpq 47a <setup_APIC_ibs+0x5a> \ 0000 0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
0056 476: R_X86_64_PLT32 printk-0x4 \ 0003 3: R_X86_64_32S .rodata.str1.8
005a 47a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) \ 0007 7: 5b pop %rbx
fffffffffffffbe0 \ 0008 8: 65 8b 35 00 00 00 00 mov %gs:0x0(%rip),%esi # f <setup_API
\ 000b b: R_X86_64_PC32 cpu_number-0x4
\ 000f f: e9 00 00 00 00 jmpq 14 <force_ibs_eilvt_setup.cold.10>
\ 0010 10: R_X86_64_PLT32 printk-0x4
\ 0000
It is because of cold subfunction creation; with a reduction in side of
the regular path.
The third build included patches 3 and 4 (because they don't much
overlap); and give some meagre savings:
$ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/kernel/irq.o
do_IRQ 195 187 -8
smp_x86_platform_ipi 234 222 -12
smp_kvm_posted_intr_ipi 74 66 -8
smp_kvm_posted_intr_wakeup_ipi 86 78 -8
smp_kvm_posted_intr_nested_ipi 74 66 -8
$ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/mm/tlb.o
flush_tlb_func_common.constprop.13 728 719 -9
switch_mm_irqs_off 1528 1524 -4
Now, I realize you particularly hate the tlb patch; and I'll see if I
can get these same savings with a few less __ added.
But in general, I think these patches are worth it. esp. since I've
already done them :-)
[-- Attachment #2: compare.sh --]
[-- Type: application/x-sh, Size: 1750 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes
2019-03-08 14:50 ` Peter Zijlstra
@ 2019-03-08 19:35 ` Nadav Amit
2019-03-08 20:56 ` Peter Zijlstra
2019-03-08 22:48 ` Peter Zijlstra
0 siblings, 2 replies; 23+ messages in thread
From: Nadav Amit @ 2019-03-08 19:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx, luto, linux-kernel
> On Mar 8, 2019, at 6:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote:
>
>> This is a collection of x86/percpu changes that I had pending and got reminded
>> of by Linus' comment yesterday about __this_cpu_xchg().
>>
>> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
>
> (Sorry; this is going to have _wide_ output)
>
> OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with
> resp that many patches of this series applied.
>
> When I look at just the vmlinux size output:
>
> $ size defconfig-build*/vmlinux
> text data bss dec hex filename
> 19540631 5040164 1871944 26452739 193a303 defconfig-build/vmlinux
> 19540635 5040164 1871944 26452743 193a307 defconfig-build1/vmlinux
> 19540685 5040164 1871944 26452793 193a339 defconfig-build2/vmlinux
> 19540685 5040164 1871944 26452793 193a339 defconfig-build3/vmlinux
>
> Things appear to get slightly larger; however when I look in more
> detail using my (newly written compare script, find attached), I get
> things like:
Nice script! I keep asking myself how comparing two binaries can provide
some “number” to indicate how “good” the binary is (at least relatively to
another one) - either during compilation or after. Code size, as you show,
is the wrong metric.
Anyhow, I am a little disappointed (and surprised) that in most cases that I
played with, this kind of optimizations have marginal impact on performance
at best, even when the binary changes “a lot” and when microbenchmarks are
used.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes
2019-03-08 19:35 ` Nadav Amit
@ 2019-03-08 20:56 ` Peter Zijlstra
2019-03-10 12:46 ` Peter Zijlstra
2019-03-08 22:48 ` Peter Zijlstra
1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 20:56 UTC (permalink / raw)
To: Nadav Amit
Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx, luto, linux-kernel
On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote:
> Nice script! I keep asking myself how comparing two binaries can provide
> some “number” to indicate how “good” the binary is (at least relatively to
> another one) - either during compilation or after. Code size, as you show,
> is the wrong metric.
Right; I'm still pondering other metrics, like:
readelf -WS | grep AX | grep -v -e init -e exit -e altinstr -e unlikely -e fixup
which is only 'fast' path text.
> Anyhow, I am a little disappointed (and surprised) that in most cases that I
> played with, this kind of optimizations have marginal impact on performance
> at best, even when the binary changes “a lot” and when microbenchmarks are
> used.
Right, but if we don't care, it'll be death by 1000 cuts.
Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ?
I'm thinking of a variant of Linus' patch, but I'm confused about the
above.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes
2019-03-08 19:35 ` Nadav Amit
2019-03-08 20:56 ` Peter Zijlstra
@ 2019-03-08 22:48 ` Peter Zijlstra
1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 22:48 UTC (permalink / raw)
To: Nadav Amit
Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx, luto, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 151 bytes --]
On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote:
> Nice script!
Find a new one; this one is fast enough to run a symbol diff on vmlinux.o
[-- Attachment #2: compare.sh --]
[-- Type: application/x-sh, Size: 2221 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes
2019-03-08 20:56 ` Peter Zijlstra
@ 2019-03-10 12:46 ` Peter Zijlstra
0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-10 12:46 UTC (permalink / raw)
To: Nadav Amit
Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx, luto, linux-kernel
On Fri, Mar 08, 2019 at 09:56:37PM +0100, Peter Zijlstra wrote:
> Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ?
>
> I'm thinking of a variant of Linus' patch, but I'm confused about the
> above.
So whatever I tried with +m only made things worse and always affects
thousands of symbols.
Now, afaict the whole percpu_stable_op thing is an ugly hack becaues some
earlier compiler would not CSE the regular percpu_from_op. But since it
does do that today; esp. after my first patch, I tried implementing
this_cpu_read_stable() with percpu_from_op() (no volatile, obv).
That also affects _lots_ of sites, but also significantly shrinks the
kernel image.
It's 2307 symbols affected, but:
17642871 2157438 747808 20548117 1398a15 defconfig-build1/vmlinux.o (patch 1)
17639081 2157438 747808 20544327 1397b47 defconfig-build0/vmlinux.o (patch 1 - percpu_stable_op)
So I think I'll add a patch removing percpu_stable_op and all its users.
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-03-10 12:47 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
2019-02-27 16:14 ` Linus Torvalds
2019-02-27 16:48 ` Peter Zijlstra
2019-02-27 17:17 ` Linus Torvalds
2019-02-27 17:34 ` Peter Zijlstra
2019-02-27 17:38 ` Linus Torvalds
2019-02-27 17:57 ` Nadav Amit
2019-02-27 18:55 ` Nadav Amit
2019-02-27 19:41 ` Linus Torvalds
2019-03-08 13:35 ` Peter Zijlstra
2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra
2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra
2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra
2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra
2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
2019-02-27 14:43 ` Peter Zijlstra
2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit
2019-03-08 14:50 ` Peter Zijlstra
2019-03-08 19:35 ` Nadav Amit
2019-03-08 20:56 ` Peter Zijlstra
2019-03-10 12:46 ` Peter Zijlstra
2019-03-08 22:48 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).