* [PATCH 0/2] x86: Remove ideal_nops[] @ 2021-03-12 11:32 Peter Zijlstra 2021-03-12 11:32 ` [PATCH 1/2] x86: Remove dynamic NOP selection Peter Zijlstra ` (3 more replies) 0 siblings, 4 replies; 51+ messages in thread From: Peter Zijlstra @ 2021-03-12 11:32 UTC (permalink / raw) To: x86, rostedt, hpa, torvalds Cc: linux-kernel, linux-toolchains, peterz, jpoimboe, alexei.starovoitov, mhiramat Hi! A while ago Steve complained about x86 being weird for having different NOPs [1] Having cursed the same thing before, I figured it was time to look at the NOP situation. 32bit simply isn't a performance target anymore, so all we need is a set of NOPs that works on all. x86_64 has two main NOP variants, NOPL and prefix NOP. NOPL was introduced by P6 and is architecturally mandated for x86_64. However, some uarchs made the choice to limit NOPL decoding to a single port, which obviously limits NOPL throughput. Other uarchs have (severe) decoding penalties for excessive (>~3) prefixes, hobbling prefix NOP throughput. But the thing is, all the modern uarchs can handle both without issue; that is AMD K10 (2007) and later and Intel Ivy Bridge (2012) and later. The only exception is Atom, which has the prefix penalty. Since ultimate performance of a 10 year old chip (Intel Sandy Bridge, 2011) is simply irrelevant today, remove variable NOPs and use NOPL. This gives us deterministic NOPs and restores sanity. [1] https://lkml.kernel.org/r/20210302105827.3403656c@gandalf.local.home ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 1/2] x86: Remove dynamic NOP selection 2021-03-12 11:32 [PATCH 0/2] x86: Remove ideal_nops[] Peter Zijlstra @ 2021-03-12 11:32 ` Peter Zijlstra 2021-03-12 12:09 ` Peter Zijlstra 2024-01-20 6:58 ` Thorsten Glaser 2021-03-12 11:32 ` [PATCH 2/2] objtool,x86: Use asm/nops.h Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 2 replies; 51+ messages in thread From: Peter Zijlstra @ 2021-03-12 11:32 UTC (permalink / raw) To: x86, rostedt, hpa, torvalds Cc: linux-kernel, linux-toolchains, peterz, jpoimboe, alexei.starovoitov, mhiramat This ensures that a NOP is a NOP and not a random other instruction that is also a NOP. It allows simplification of dynamic code patching that wants to verify existing code before writing new instructions (ftrace, jump_label, static_call, etc..). Differentiating on NOPs is not a feature. This pessimises 32bit (DONTCARE) and 32bit on 64bit CPUs (CARELESS). 32bit is not a performance target. Everything x86_64 since AMD K10 (2007) and Intel IvyBridge (2012) is fine with using NOPL (as opposed to prefix NOP). And per FEATURE_NOPL being required for x86_64, all x86_64 CPUs can use NOPL. So stop caring about NOPs, simplify things and get on with life. [ The problem seems to be that some uarchs can only decode NOPL on a single front-end port while others have severe decode penalties for excessive prefixes. All modern uarchs can handle both, except Atom, which has prefix penalties. ] [ Also, much doubt you can actually measure any of this on normal workloads. ] After this FEATURE_NOPL is unused except for required-features for x86_64. FEATURE_K8 is only used for PTI and FEATURE_K7 is unused. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> # bpf --- arch/x86/include/asm/jump_label.h | 12 -- arch/x86/include/asm/nops.h | 180 ++++++++++--------------------- arch/x86/include/asm/special_insns.h | 4 arch/x86/kernel/alternative.c | 198 +++-------------------------------- arch/x86/kernel/ftrace.c | 4 arch/x86/kernel/jump_label.c | 32 +---- arch/x86/kernel/kprobes/core.c | 2 arch/x86/kernel/setup.c | 1 arch/x86/kernel/static_call.c | 4 arch/x86/net/bpf_jit_comp.c | 8 - 10 files changed, 98 insertions(+), 347 deletions(-) --- a/arch/x86/include/asm/jump_label.h +++ b/arch/x86/include/asm/jump_label.h @@ -6,12 +6,6 @@ #define JUMP_LABEL_NOP_SIZE 5 -#ifdef CONFIG_X86_64 -# define STATIC_KEY_INIT_NOP P6_NOP5_ATOMIC -#else -# define STATIC_KEY_INIT_NOP GENERIC_NOP5_ATOMIC -#endif - #include <asm/asm.h> #include <asm/nops.h> @@ -23,7 +17,7 @@ static __always_inline bool arch_static_branch(struct static_key * const key, const bool branch) { asm_volatile_goto("1:" - ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t" + ".byte " __stringify(BYTES_NOP5) "\n\t" ".pushsection __jump_table, \"aw\" \n\t" _ASM_ALIGN "\n\t" ".long 1b - ., %l[l_yes] - . \n\t" @@ -63,7 +57,7 @@ static __always_inline bool arch_static_ .long \target - .Lstatic_jump_after_\@ .Lstatic_jump_after_\@: .else - .byte STATIC_KEY_INIT_NOP + .byte BYTES_NOP5 .endif .pushsection __jump_table, "aw" _ASM_ALIGN @@ -75,7 +69,7 @@ static __always_inline bool arch_static_ .macro STATIC_JUMP_IF_FALSE target, key, def .Lstatic_jump_\@: .if \def - .byte STATIC_KEY_INIT_NOP + .byte BYTES_NOP5 .else /* Equivalent to "jmp.d32 \target" */ .byte 0xe9 --- a/arch/x86/include/asm/nops.h +++ b/arch/x86/include/asm/nops.h @@ -4,89 +4,58 @@ /* * Define nops for use with alternative() and for tracing. + */ + +#ifndef CONFIG_64BIT + +/* + * Generic 32bit nops from GAS: + * + * 1: nop + * 2: movl %esi,%esi + * 3: leal 0x0(%esi),%esi + * 4: leal 0x0(%esi,%eiz,1),%esi + * 5: leal %ds:0x0(%esi,%eiz,1),%esi + * 6: leal 0x0(%esi),%esi + * 7: leal 0x0(%esi,%eiz,1),%esi + * 8: leal %ds:0x0(%esi,%eiz,1),%esi * - * *_NOP5_ATOMIC must be a single instruction. + * Except 5 and 8, which are DS prefixed 4 and 7 resp, where GAS would emit 2 + * nop instructions. */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x89,0xf6 +#define BYTES_NOP3 0x8d,0x76,0x00 +#define BYTES_NOP4 0x8d,0x74,0x26,0x00 +#define BYTES_NOP5 0x3e,BYTES_NOP4 +#define BYTES_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00 +#define BYTES_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x3e,BYTES_NOP7 + +#else -#define NOP_DS_PREFIX 0x3e +/* + * Generic 64bit nops from GAS: + * + * 1: nop + * 2: osp nop + * 3: nopl (%eax) + * 4: nopl 0x00(%eax) + * 5: nopl 0x00(%eax,%eax,1) + * 6: osp nopl 0x00(%eax,%eax,1) + * 7: nopl 0x00000000(%eax) + * 8: nopl 0x00000000(%eax,%eax,1) + */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x66,BYTES_NOP1 +#define BYTES_NOP3 0x0f,0x1f,0x00 +#define BYTES_NOP4 0x0f,0x1f,0x40,0x00 +#define BYTES_NOP5 0x0f,0x1f,0x44,0x00,0x00 +#define BYTES_NOP6 0x66,BYTES_NOP5 +#define BYTES_NOP7 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00 -/* generic versions from gas - 1: nop - the following instructions are NOT nops in 64-bit mode, - for 64-bit mode use K8 or P6 nops instead - 2: movl %esi,%esi - 3: leal 0x00(%esi),%esi - 4: leal 0x00(,%esi,1),%esi - 6: leal 0x00000000(%esi),%esi - 7: leal 0x00000000(,%esi,1),%esi -*/ -#define GENERIC_NOP1 0x90 -#define GENERIC_NOP2 0x89,0xf6 -#define GENERIC_NOP3 0x8d,0x76,0x00 -#define GENERIC_NOP4 0x8d,0x74,0x26,0x00 -#define GENERIC_NOP5 GENERIC_NOP1,GENERIC_NOP4 -#define GENERIC_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00 -#define GENERIC_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00 -#define GENERIC_NOP8 GENERIC_NOP1,GENERIC_NOP7 -#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4 - -/* Opteron 64bit nops - 1: nop - 2: osp nop - 3: osp osp nop - 4: osp osp osp nop -*/ -#define K8_NOP1 GENERIC_NOP1 -#define K8_NOP2 0x66,K8_NOP1 -#define K8_NOP3 0x66,K8_NOP2 -#define K8_NOP4 0x66,K8_NOP3 -#define K8_NOP5 K8_NOP3,K8_NOP2 -#define K8_NOP6 K8_NOP3,K8_NOP3 -#define K8_NOP7 K8_NOP4,K8_NOP3 -#define K8_NOP8 K8_NOP4,K8_NOP4 -#define K8_NOP5_ATOMIC 0x66,K8_NOP4 - -/* K7 nops - uses eax dependencies (arbitrary choice) - 1: nop - 2: movl %eax,%eax - 3: leal (,%eax,1),%eax - 4: leal 0x00(,%eax,1),%eax - 6: leal 0x00000000(%eax),%eax - 7: leal 0x00000000(,%eax,1),%eax -*/ -#define K7_NOP1 GENERIC_NOP1 -#define K7_NOP2 0x8b,0xc0 -#define K7_NOP3 0x8d,0x04,0x20 -#define K7_NOP4 0x8d,0x44,0x20,0x00 -#define K7_NOP5 K7_NOP4,K7_NOP1 -#define K7_NOP6 0x8d,0x80,0,0,0,0 -#define K7_NOP7 0x8D,0x04,0x05,0,0,0,0 -#define K7_NOP8 K7_NOP7,K7_NOP1 -#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4 - -/* P6 nops - uses eax dependencies (Intel-recommended choice) - 1: nop - 2: osp nop - 3: nopl (%eax) - 4: nopl 0x00(%eax) - 5: nopl 0x00(%eax,%eax,1) - 6: osp nopl 0x00(%eax,%eax,1) - 7: nopl 0x00000000(%eax) - 8: nopl 0x00000000(%eax,%eax,1) - Note: All the above are assumed to be a single instruction. - There is kernel code that depends on this. -*/ -#define P6_NOP1 GENERIC_NOP1 -#define P6_NOP2 0x66,0x90 -#define P6_NOP3 0x0f,0x1f,0x00 -#define P6_NOP4 0x0f,0x1f,0x40,0 -#define P6_NOP5 0x0f,0x1f,0x44,0x00,0 -#define P6_NOP6 0x66,0x0f,0x1f,0x44,0x00,0 -#define P6_NOP7 0x0f,0x1f,0x80,0,0,0,0 -#define P6_NOP8 0x0f,0x1f,0x84,0x00,0,0,0,0 -#define P6_NOP5_ATOMIC P6_NOP5 +#endif /* CONFIG_64BIT */ #ifdef __ASSEMBLY__ #define _ASM_MK_NOP(x) .byte x @@ -94,54 +63,19 @@ #define _ASM_MK_NOP(x) ".byte " __stringify(x) "\n" #endif -#if defined(CONFIG_MK7) -#define ASM_NOP1 _ASM_MK_NOP(K7_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(K7_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(K7_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(K7_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(K7_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(K7_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(K7_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(K7_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K7_NOP5_ATOMIC) -#elif defined(CONFIG_X86_P6_NOP) -#define ASM_NOP1 _ASM_MK_NOP(P6_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(P6_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(P6_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(P6_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(P6_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(P6_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(P6_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(P6_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(P6_NOP5_ATOMIC) -#elif defined(CONFIG_X86_64) -#define ASM_NOP1 _ASM_MK_NOP(K8_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(K8_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(K8_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(K8_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(K8_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(K8_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(K8_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(K8_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K8_NOP5_ATOMIC) -#else -#define ASM_NOP1 _ASM_MK_NOP(GENERIC_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(GENERIC_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(GENERIC_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(GENERIC_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(GENERIC_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(GENERIC_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(GENERIC_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(GENERIC_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(GENERIC_NOP5_ATOMIC) -#endif +#define ASM_NOP1 _ASM_MK_NOP(BYTES_NOP1) +#define ASM_NOP2 _ASM_MK_NOP(BYTES_NOP2) +#define ASM_NOP3 _ASM_MK_NOP(BYTES_NOP3) +#define ASM_NOP4 _ASM_MK_NOP(BYTES_NOP4) +#define ASM_NOP5 _ASM_MK_NOP(BYTES_NOP5) +#define ASM_NOP6 _ASM_MK_NOP(BYTES_NOP6) +#define ASM_NOP7 _ASM_MK_NOP(BYTES_NOP7) +#define ASM_NOP8 _ASM_MK_NOP(BYTES_NOP8) #define ASM_NOP_MAX 8 -#define NOP_ATOMIC5 (ASM_NOP_MAX+1) /* Entry for the 5-byte atomic NOP */ #ifndef __ASSEMBLY__ -extern const unsigned char * const *ideal_nops; -extern void arch_init_ideal_nops(void); +extern const unsigned char * const x86_nops[]; #endif #endif /* _ASM_X86_NOPS_H */ --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -214,7 +214,7 @@ static inline void clflush(volatile void static inline void clflushopt(volatile void *__p) { - alternative_io(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0", + alternative_io(".byte 0x3e; clflush %P0", ".byte 0x66; clflush %P0", X86_FEATURE_CLFLUSHOPT, "+m" (*(volatile char __force *)__p)); @@ -225,7 +225,7 @@ static inline void clwb(volatile void *_ volatile struct { char x[64]; } *p = __p; asm volatile(ALTERNATIVE_2( - ".byte " __stringify(NOP_DS_PREFIX) "; clflush (%[pax])", + ".byte 0x3e; clflush (%[pax])", ".byte 0x66; clflush (%[pax])", /* clflushopt (%%rax) */ X86_FEATURE_CLFLUSHOPT, ".byte 0x66, 0x0f, 0xae, 0x30", /* clwb (%%rax) */ --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -74,186 +74,30 @@ do { \ } \ } while (0) -/* - * Each GENERIC_NOPX is of X bytes, and defined as an array of bytes - * that correspond to that nop. Getting from one nop to the next, we - * add to the array the offset that is equal to the sum of all sizes of - * nops preceding the one we are after. - * - * Note: The GENERIC_NOP5_ATOMIC is at the end, as it breaks the - * nice symmetry of sizes of the previous nops. - */ -#if defined(GENERIC_NOP1) && !defined(CONFIG_X86_64) -static const unsigned char intelnops[] = +const unsigned char x86nops[] = { - GENERIC_NOP1, - GENERIC_NOP2, - GENERIC_NOP3, - GENERIC_NOP4, - GENERIC_NOP5, - GENERIC_NOP6, - GENERIC_NOP7, - GENERIC_NOP8, - GENERIC_NOP5_ATOMIC -}; -static const unsigned char * const intel_nops[ASM_NOP_MAX+2] = -{ - NULL, - intelnops, - intelnops + 1, - intelnops + 1 + 2, - intelnops + 1 + 2 + 3, - intelnops + 1 + 2 + 3 + 4, - intelnops + 1 + 2 + 3 + 4 + 5, - intelnops + 1 + 2 + 3 + 4 + 5 + 6, - intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, + BYTES_NOP1, + BYTES_NOP2, + BYTES_NOP3, + BYTES_NOP4, + BYTES_NOP5, + BYTES_NOP6, + BYTES_NOP7, + BYTES_NOP8, }; -#endif -#ifdef K8_NOP1 -static const unsigned char k8nops[] = -{ - K8_NOP1, - K8_NOP2, - K8_NOP3, - K8_NOP4, - K8_NOP5, - K8_NOP6, - K8_NOP7, - K8_NOP8, - K8_NOP5_ATOMIC -}; -static const unsigned char * const k8_nops[ASM_NOP_MAX+2] = -{ - NULL, - k8nops, - k8nops + 1, - k8nops + 1 + 2, - k8nops + 1 + 2 + 3, - k8nops + 1 + 2 + 3 + 4, - k8nops + 1 + 2 + 3 + 4 + 5, - k8nops + 1 + 2 + 3 + 4 + 5 + 6, - k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, -}; -#endif - -#if defined(K7_NOP1) && !defined(CONFIG_X86_64) -static const unsigned char k7nops[] = -{ - K7_NOP1, - K7_NOP2, - K7_NOP3, - K7_NOP4, - K7_NOP5, - K7_NOP6, - K7_NOP7, - K7_NOP8, - K7_NOP5_ATOMIC -}; -static const unsigned char * const k7_nops[ASM_NOP_MAX+2] = -{ - NULL, - k7nops, - k7nops + 1, - k7nops + 1 + 2, - k7nops + 1 + 2 + 3, - k7nops + 1 + 2 + 3 + 4, - k7nops + 1 + 2 + 3 + 4 + 5, - k7nops + 1 + 2 + 3 + 4 + 5 + 6, - k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, -}; -#endif - -#ifdef P6_NOP1 -static const unsigned char p6nops[] = -{ - P6_NOP1, - P6_NOP2, - P6_NOP3, - P6_NOP4, - P6_NOP5, - P6_NOP6, - P6_NOP7, - P6_NOP8, - P6_NOP5_ATOMIC -}; -static const unsigned char * const p6_nops[ASM_NOP_MAX+2] = +const unsigned char * const x86_nops[ASM_NOP_MAX+1] = { NULL, - p6nops, - p6nops + 1, - p6nops + 1 + 2, - p6nops + 1 + 2 + 3, - p6nops + 1 + 2 + 3 + 4, - p6nops + 1 + 2 + 3 + 4 + 5, - p6nops + 1 + 2 + 3 + 4 + 5 + 6, - p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, + x86nops, + x86nops + 1, + x86nops + 1 + 2, + x86nops + 1 + 2 + 3, + x86nops + 1 + 2 + 3 + 4, + x86nops + 1 + 2 + 3 + 4 + 5, + x86nops + 1 + 2 + 3 + 4 + 5 + 6, + x86nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, }; -#endif - -/* Initialize these to a safe default */ -#ifdef CONFIG_X86_64 -const unsigned char * const *ideal_nops = p6_nops; -#else -const unsigned char * const *ideal_nops = intel_nops; -#endif - -void __init arch_init_ideal_nops(void) -{ - switch (boot_cpu_data.x86_vendor) { - case X86_VENDOR_INTEL: - /* - * Due to a decoder implementation quirk, some - * specific Intel CPUs actually perform better with - * the "k8_nops" than with the SDM-recommended NOPs. - */ - if (boot_cpu_data.x86 == 6 && - boot_cpu_data.x86_model >= 0x0f && - boot_cpu_data.x86_model != 0x1c && - boot_cpu_data.x86_model != 0x26 && - boot_cpu_data.x86_model != 0x27 && - boot_cpu_data.x86_model < 0x30) { - ideal_nops = k8_nops; - } else if (boot_cpu_has(X86_FEATURE_NOPL)) { - ideal_nops = p6_nops; - } else { -#ifdef CONFIG_X86_64 - ideal_nops = k8_nops; -#else - ideal_nops = intel_nops; -#endif - } - break; - - case X86_VENDOR_HYGON: - ideal_nops = p6_nops; - return; - - case X86_VENDOR_AMD: - if (boot_cpu_data.x86 > 0xf) { - ideal_nops = p6_nops; - return; - } - - fallthrough; - - default: -#ifdef CONFIG_X86_64 - ideal_nops = k8_nops; -#else - if (boot_cpu_has(X86_FEATURE_K8)) - ideal_nops = k8_nops; - else if (boot_cpu_has(X86_FEATURE_K7)) - ideal_nops = k7_nops; - else - ideal_nops = intel_nops; -#endif - } -} /* Use this to add nops to a buffer, then text_poke the whole buffer. */ static void __init_or_module add_nops(void *insns, unsigned int len) @@ -262,7 +106,7 @@ static void __init_or_module add_nops(vo unsigned int noplen = len; if (noplen > ASM_NOP_MAX) noplen = ASM_NOP_MAX; - memcpy(insns, ideal_nops[noplen], noplen); + memcpy(insns, x86_nops[noplen], noplen); insns += noplen; len -= noplen; } @@ -1302,13 +1146,13 @@ static void text_poke_loc_init(struct te default: /* assume NOP */ switch (len) { case 2: /* NOP2 -- emulate as JMP8+0 */ - BUG_ON(memcmp(emulate, ideal_nops[len], len)); + BUG_ON(memcmp(emulate, x86_nops[len], len)); tp->opcode = JMP8_INSN_OPCODE; tp->rel32 = 0; break; case 5: /* NOP5 -- emulate as JMP32+0 */ - BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], len)); + BUG_ON(memcmp(emulate, x86_nops[len], len)); tp->opcode = JMP32_INSN_OPCODE; tp->rel32 = 0; break; --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -66,7 +66,7 @@ int ftrace_arch_code_modify_post_process static const char *ftrace_nop_replace(void) { - return ideal_nops[NOP_ATOMIC5]; + return x86_nops[5]; } static const char *ftrace_call_replace(unsigned long ip, unsigned long addr) @@ -377,7 +377,7 @@ create_trampoline(struct ftrace_ops *ops ip = trampoline + (jmp_offset - start_offset); if (WARN_ON(*(char *)ip != 0x75)) goto fail; - ret = copy_from_kernel_nofault(ip, ideal_nops[2], 2); + ret = copy_from_kernel_nofault(ip, x86_nops[2], 2); if (ret < 0) goto fail; } --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -28,10 +28,8 @@ static void bug_at(const void *ip, int l } static const void * -__jump_label_set_jump_code(struct jump_entry *entry, enum jump_label_type type, int init) +__jump_label_set_jump_code(struct jump_entry *entry, enum jump_label_type type) { - const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP }; - const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5]; const void *expect, *code; const void *addr, *dest; int line; @@ -41,10 +39,8 @@ __jump_label_set_jump_code(struct jump_e code = text_gen_insn(JMP32_INSN_OPCODE, addr, dest); - if (init) { - expect = default_nop; line = __LINE__; - } else if (type == JUMP_LABEL_JMP) { - expect = ideal_nop; line = __LINE__; + if (type == JUMP_LABEL_JMP) { + expect = x86_nops[5]; line = __LINE__; } else { expect = code; line = __LINE__; } @@ -53,7 +49,7 @@ __jump_label_set_jump_code(struct jump_e bug_at(addr, line); if (type == JUMP_LABEL_NOP) - code = ideal_nop; + code = x86_nops[5]; return code; } @@ -62,7 +58,7 @@ static inline void __jump_label_transfor enum jump_label_type type, int init) { - const void *opcode = __jump_label_set_jump_code(entry, type, init); + const void *opcode = __jump_label_set_jump_code(entry, type); /* * As long as only a single processor is running and the code is still @@ -113,7 +109,7 @@ bool arch_jump_label_transform_queue(str } mutex_lock(&text_mutex); - opcode = __jump_label_set_jump_code(entry, type, 0); + opcode = __jump_label_set_jump_code(entry, type); text_poke_queue((void *)jump_entry_code(entry), opcode, JUMP_LABEL_NOP_SIZE, NULL); mutex_unlock(&text_mutex); @@ -136,22 +132,6 @@ static enum { __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type) { - /* - * This function is called at boot up and when modules are - * first loaded. Check if the default nop, the one that is - * inserted at compile time, is the ideal nop. If it is, then - * we do not need to update the nop, and we can leave it as is. - * If it is not, then we need to update the nop to the ideal nop. - */ - if (jlstate == JL_STATE_START) { - const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP }; - const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5]; - - if (memcmp(ideal_nop, default_nop, 5) != 0) - jlstate = JL_STATE_UPDATE; - else - jlstate = JL_STATE_NO_UPDATE; - } if (jlstate == JL_STATE_UPDATE) jump_label_transform(entry, type, 1); } --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -229,7 +229,7 @@ __recover_probed_insn(kprobe_opcode_t *b return 0UL; if (faddr) - memcpy(buf, ideal_nops[NOP_ATOMIC5], 5); + memcpy(buf, x86_nops[5], 5); else buf[0] = kp->opcode; return (unsigned long)buf; --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -822,7 +822,6 @@ void __init setup_arch(char **cmdline_p) idt_setup_early_traps(); early_cpu_init(); - arch_init_ideal_nops(); jump_label_init(); static_call_init(); early_ioremap_init(); --- a/arch/x86/kernel/static_call.c +++ b/arch/x86/kernel/static_call.c @@ -34,7 +34,7 @@ static void __ref __static_call_transfor break; case NOP: - code = ideal_nops[NOP_ATOMIC5]; + code = x86_nops[5]; break; case JMP: @@ -66,7 +66,7 @@ static void __static_call_validate(void return; } else { if (opcode == CALL_INSN_OPCODE || - !memcmp(insn, ideal_nops[NOP_ATOMIC5], 5) || + !memcmp(insn, x86_nops[5], 5) || !memcmp(insn, xor5rax, 5)) return; } --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -282,7 +282,7 @@ static void emit_prologue(u8 **pprog, u3 /* BPF trampoline can be made to work without these nops, * but let's waste 5 bytes for now and optimize later */ - memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt); + memcpy(prog, x86_nops[5], cnt); prog += cnt; if (!ebpf_from_cbpf) { if (tail_call_reachable && !is_subprog) @@ -330,7 +330,7 @@ static int __bpf_arch_text_poke(void *ip void *old_addr, void *new_addr, const bool text_live) { - const u8 *nop_insn = ideal_nops[NOP_ATOMIC5]; + const u8 *nop_insn = x86_nops[5]; u8 old_insn[X86_PATCH_SIZE]; u8 new_insn[X86_PATCH_SIZE]; u8 *prog; @@ -560,7 +560,7 @@ static void emit_bpf_tail_call_direct(st if (stack_depth) EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); - memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE); + memcpy(prog, x86_nops[5], X86_PATCH_SIZE); prog += X86_PATCH_SIZE; /* out: */ @@ -881,7 +881,7 @@ static int emit_nops(u8 **pprog, int len noplen = ASM_NOP_MAX; for (i = 0; i < noplen; i++) - EMIT1(ideal_nops[noplen][i]); + EMIT1(x86_nops[noplen][i]); len -= noplen; } ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2021-03-12 11:32 ` [PATCH 1/2] x86: Remove dynamic NOP selection Peter Zijlstra @ 2021-03-12 12:09 ` Peter Zijlstra 2021-03-12 20:36 ` Linus Torvalds 2024-01-20 6:58 ` Thorsten Glaser 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2021-03-12 12:09 UTC (permalink / raw) To: x86, rostedt, hpa, torvalds Cc: linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, Mar 12, 2021 at 12:32:54PM +0100, Peter Zijlstra wrote: > +#ifndef CONFIG_64BIT > + > +/* > + * Generic 32bit nops from GAS: > + * > + * 1: nop > + * 2: movl %esi,%esi > + * 3: leal 0x0(%esi),%esi > + * 4: leal 0x0(%esi,%eiz,1),%esi > + * 5: leal %ds:0x0(%esi,%eiz,1),%esi > + * 6: leal 0x0(%esi),%esi > + * 7: leal 0x0(%esi,%eiz,1),%esi > + * 8: leal %ds:0x0(%esi,%eiz,1),%esi > * > + * Except 5 and 8, which are DS prefixed 4 and 7 resp, where GAS would emit 2 > + * nop instructions. > */ > +#define BYTES_NOP1 0x90 > +#define BYTES_NOP2 0x89,0xf6 > +#define BYTES_NOP3 0x8d,0x76,0x00 > +#define BYTES_NOP4 0x8d,0x74,0x26,0x00 > +#define BYTES_NOP5 0x3e,BYTES_NOP4 > +#define BYTES_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00 > +#define BYTES_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00 > +#define BYTES_NOP8 0x3e,BYTES_NOP7 > + > +#else > > +/* > + * Generic 64bit nops from GAS: > + * > + * 1: nop > + * 2: osp nop > + * 3: nopl (%eax) > + * 4: nopl 0x00(%eax) > + * 5: nopl 0x00(%eax,%eax,1) > + * 6: osp nopl 0x00(%eax,%eax,1) > + * 7: nopl 0x00000000(%eax) > + * 8: nopl 0x00000000(%eax,%eax,1) > + */ > +#define BYTES_NOP1 0x90 > +#define BYTES_NOP2 0x66,BYTES_NOP1 > +#define BYTES_NOP3 0x0f,0x1f,0x00 > +#define BYTES_NOP4 0x0f,0x1f,0x40,0x00 > +#define BYTES_NOP5 0x0f,0x1f,0x44,0x00,0x00 > +#define BYTES_NOP6 0x66,BYTES_NOP5 > +#define BYTES_NOP7 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00 > +#define BYTES_NOP8 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00 > > +#endif /* CONFIG_64BIT */ Note that this also made all NOPs single instructions and removed the special atomic nop. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2021-03-12 12:09 ` Peter Zijlstra @ 2021-03-12 20:36 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2021-03-12 20:36 UTC (permalink / raw) To: Peter Zijlstra Cc: the arch/x86 maintainers, Steven Rostedt, Peter Anvin, Linux Kernel Mailing List, linux-toolchains, Josh Poimboeuf, Alexei Starovoitov, Masami Hiramatsu On Fri, Mar 12, 2021 at 4:09 AM Peter Zijlstra <peterz@infradead.org> wrote: > > Note that this also made all NOPs single instructions and removed the > special atomic nop. Ack. Good riddance. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2021-03-12 11:32 ` [PATCH 1/2] x86: Remove dynamic NOP selection Peter Zijlstra 2021-03-12 12:09 ` Peter Zijlstra @ 2024-01-20 6:58 ` Thorsten Glaser 2024-01-20 8:22 ` H. Peter Anvin 1 sibling, 1 reply; 51+ messages in thread From: Thorsten Glaser @ 2024-01-20 6:58 UTC (permalink / raw) To: Peter Zijlstra Cc: x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat Peter Zijlstra dixit: >-/* generic versions from gas […] >- 3: leal 0x00(%esi),%esi >- 4: leal 0x00(,%esi,1),%esi >- 6: leal 0x00000000(%esi),%esi >- 7: leal 0x00000000(,%esi,1),%esi vs. >+ * Generic 32bit nops from GAS: […] >+ * 3: leal 0x0(%esi),%esi >+ * 4: leal 0x0(%esi,%eiz,1),%esi >+ * 5: leal %ds:0x0(%esi,%eiz,1),%esi >+ * 6: leal 0x0(%esi),%esi >+ * 7: leal 0x0(%esi,%eiz,1),%esi >+ * 8: leal %ds:0x0(%esi,%eiz,1),%esi I think there’s some mistake introduced. The BYTES_* are identical for e.g. #7, but %eiz must be wrong, it’s not a register. Indeed, gas (on Debian bullseye) does not assemble that either. (Awful AT&T syntax aside…) bye, //mirabilos -- „Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund, mksh auf jedem System zu installieren.“ -- XTaran auf der OpenRheinRuhr, ganz begeistert (EN: “[…]uhr.gz is a reason to install mksh on every system.”) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-20 6:58 ` Thorsten Glaser @ 2024-01-20 8:22 ` H. Peter Anvin 2024-01-20 16:53 ` Thorsten Glaser 2024-01-20 17:00 ` Linus Torvalds 0 siblings, 2 replies; 51+ messages in thread From: H. Peter Anvin @ 2024-01-20 8:22 UTC (permalink / raw) To: Thorsten Glaser, Peter Zijlstra Cc: x86, rostedt, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On January 19, 2024 10:58:56 PM PST, Thorsten Glaser <tg@debian.org> wrote: >Peter Zijlstra dixit: > >>-/* generic versions from gas >[…] >>- 3: leal 0x00(%esi),%esi >>- 4: leal 0x00(,%esi,1),%esi >>- 6: leal 0x00000000(%esi),%esi >>- 7: leal 0x00000000(,%esi,1),%esi > >vs. > >>+ * Generic 32bit nops from GAS: >[…] >>+ * 3: leal 0x0(%esi),%esi >>+ * 4: leal 0x0(%esi,%eiz,1),%esi >>+ * 5: leal %ds:0x0(%esi,%eiz,1),%esi >>+ * 6: leal 0x0(%esi),%esi >>+ * 7: leal 0x0(%esi,%eiz,1),%esi >>+ * 8: leal %ds:0x0(%esi,%eiz,1),%esi > >I think there’s some mistake introduced. The BYTES_* are >identical for e.g. #7, but %eiz must be wrong, it’s not >a register. Indeed, gas (on Debian bullseye) does not >assemble that either. > >(Awful AT&T syntax aside…) > >bye, >//mirabilos %eiz was something that binutils used to put in when disassembling certain redundant encodings with SIB at some point. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-20 8:22 ` H. Peter Anvin @ 2024-01-20 16:53 ` Thorsten Glaser 2024-01-21 23:21 ` H. Peter Anvin 2024-01-20 17:00 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Thorsten Glaser @ 2024-01-20 16:53 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, x86, rostedt, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat H. Peter Anvin dixit: >%eiz was something that binutils used to put in when disassembling >certain redundant encodings with SIB at some point. Ah, fair enough. Maybe this could be added as one more line in the comment or something. Thanks, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-20 16:53 ` Thorsten Glaser @ 2024-01-21 23:21 ` H. Peter Anvin 2024-01-21 23:58 ` Thorsten Glaser 0 siblings, 1 reply; 51+ messages in thread From: H. Peter Anvin @ 2024-01-21 23:21 UTC (permalink / raw) To: Thorsten Glaser Cc: Peter Zijlstra, x86, rostedt, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On 1/20/24 08:53, Thorsten Glaser wrote: > H. Peter Anvin dixit: > >> %eiz was something that binutils used to put in when disassembling >> certain redundant encodings with SIB at some point. > > Ah, fair enough. Maybe this could be added as one more line in > the comment or something. > I think "proper" gas syntax would be 0(%esi,,1), although that doesn't assemble either (I don't believe there is a way to get gas to actually generate this sequence.) But yes, with all even remotely recent CPUs all actually handling nopl properly, this isn't relevant anymore. -hpa ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-21 23:21 ` H. Peter Anvin @ 2024-01-21 23:58 ` Thorsten Glaser 2024-01-22 0:15 ` H. Peter Anvin 0 siblings, 1 reply; 51+ messages in thread From: Thorsten Glaser @ 2024-01-21 23:58 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, x86, rostedt, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat H. Peter Anvin dixit: > But yes, with all even remotely recent CPUs all actually handling nopl > properly, this isn't relevant anymore. This was, incidentally, triggered by looking into a problem report that something did *not* work on a Geode LX system. People don’t just run Linux on “recent CPUs” (though I at least got me an Atom and a Core2Duo for it and run BSD on my Pentium-M and VIA C7 systems). bye, //mirabilos -- „Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund, mksh auf jedem System zu installieren.“ -- XTaran auf der OpenRheinRuhr, ganz begeistert (EN: “[…]uhr.gz is a reason to install mksh on every system.”) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-21 23:58 ` Thorsten Glaser @ 2024-01-22 0:15 ` H. Peter Anvin 2024-01-22 0:56 ` Steven Rostedt 0 siblings, 1 reply; 51+ messages in thread From: H. Peter Anvin @ 2024-01-22 0:15 UTC (permalink / raw) To: Thorsten Glaser Cc: Peter Zijlstra, x86, rostedt, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On January 21, 2024 3:58:11 PM PST, Thorsten Glaser <tg@debian.org> wrote: >H. Peter Anvin dixit: > >> But yes, with all even remotely recent CPUs all actually handling nopl >> properly, this isn't relevant anymore. > >This was, incidentally, triggered by looking into a problem report that >something did *not* work on a Geode LX system. > >People don’t just run Linux on “recent CPUs” (though I at least got me >an Atom and a Core2Duo for it and run BSD on my Pentium-M and VIA C7 >systems). > >bye, >//mirabilos Yes, but it is a matter of where we optimize for performance as opposed to correctness. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-22 0:15 ` H. Peter Anvin @ 2024-01-22 0:56 ` Steven Rostedt 2024-01-22 1:17 ` Thorsten Glaser 2024-01-22 2:15 ` H. Peter Anvin 0 siblings, 2 replies; 51+ messages in thread From: Steven Rostedt @ 2024-01-22 0:56 UTC (permalink / raw) To: H. Peter Anvin Cc: Thorsten Glaser, Peter Zijlstra, x86, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sun, 21 Jan 2024 16:15:57 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > On January 21, 2024 3:58:11 PM PST, Thorsten Glaser <tg@debian.org> wrote: > >H. Peter Anvin dixit: > > > >> But yes, with all even remotely recent CPUs all actually handling nopl > >> properly, this isn't relevant anymore. > > > >This was, incidentally, triggered by looking into a problem report that > >something did *not* work on a Geode LX system. What problem happened? > > > >People don’t just run Linux on “recent CPUs” (though I at least got me > >an Atom and a Core2Duo for it and run BSD on my Pentium-M and VIA C7 > >systems). > > > >bye, > >//mirabilos > > Yes, but it is a matter of where we optimize for performance as opposed to correctness. There is no such thing as "optimize for correctness", it is either correct or it is not. Correctness should always come before performance (at least that is what Thomas has pounded into me ;-) If a kernel use to work on a machine but a newer version no longer works, I call that a regression. -- Steve ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-22 0:56 ` Steven Rostedt @ 2024-01-22 1:17 ` Thorsten Glaser 2024-01-22 2:04 ` H. Peter Anvin 2024-01-22 2:15 ` H. Peter Anvin 1 sibling, 1 reply; 51+ messages in thread From: Thorsten Glaser @ 2024-01-22 1:17 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, Peter Zijlstra, x86, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat Steven Rostedt dixit: >> >This was, incidentally, triggered by looking into a problem report that >> >something did *not* work on a Geode LX system. > >What problem happened? It turned out to be a compiler issue (GCC thinks i686 means PPro, not 686-class CPUs, and -fcf-protection causes long NOPs, which not all 686-class CPUs support, to be inserted). This turned out to break a large part of Debian stable on OLPCs and other systems, and the kernel’s changes in nopl handling were tabled as arguments. https://www.jookia.org/wiki/Nopl has a longer writeup on the nopl history. bye, //mirabilos -- <igli> exceptions: a truly awful implementation of quite a nice idea. <igli> just about the worst way you could do something like that, afaic. <igli> it's like anti-design. <mirabilos> that too… may I quote you on that? <igli> sure, tho i doubt anyone will listen ;) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-22 1:17 ` Thorsten Glaser @ 2024-01-22 2:04 ` H. Peter Anvin 0 siblings, 0 replies; 51+ messages in thread From: H. Peter Anvin @ 2024-01-22 2:04 UTC (permalink / raw) To: Thorsten Glaser, Steven Rostedt Cc: Peter Zijlstra, x86, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On January 21, 2024 5:17:36 PM PST, Thorsten Glaser <tg@debian.org> wrote: >Steven Rostedt dixit: > >>> >This was, incidentally, triggered by looking into a problem report that >>> >something did *not* work on a Geode LX system. >> >>What problem happened? > >It turned out to be a compiler issue (GCC thinks i686 means PPro, >not 686-class CPUs, and -fcf-protection causes long NOPs, which >not all 686-class CPUs support, to be inserted). This turned out >to break a large part of Debian stable on OLPCs and other systems, >and the kernel’s changes in nopl handling were tabled as arguments. > >https://www.jookia.org/wiki/Nopl has a longer writeup on the nopl >history. > >bye, >//mirabilos i686 *is* Pentium Pro... ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-22 0:56 ` Steven Rostedt 2024-01-22 1:17 ` Thorsten Glaser @ 2024-01-22 2:15 ` H. Peter Anvin 2024-01-22 2:22 ` Steven Rostedt 1 sibling, 1 reply; 51+ messages in thread From: H. Peter Anvin @ 2024-01-22 2:15 UTC (permalink / raw) To: Steven Rostedt Cc: Thorsten Glaser, Peter Zijlstra, x86, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On 1/21/24 16:56, Steven Rostedt wrote: >> >> Yes, but it is a matter of where we optimize for performance as opposed to correctness. > > There is no such thing as "optimize for correctness", it is either > correct or it is not. Correctness should always come before performance > (at least that is what Thomas has pounded into me ;-) > > If a kernel use to work on a machine but a newer version no longer > works, I call that a regression. > There absolutely is such a thing as "optimize for correctness." It means to keep the code clean, easily testable, and with a minimal number of distinct code paths so that regressions and *especially* uncaught regressions get caught quickly. -hpa ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-22 2:15 ` H. Peter Anvin @ 2024-01-22 2:22 ` Steven Rostedt 2024-01-22 2:31 ` H. Peter Anvin 0 siblings, 1 reply; 51+ messages in thread From: Steven Rostedt @ 2024-01-22 2:22 UTC (permalink / raw) To: H. Peter Anvin Cc: Thorsten Glaser, Peter Zijlstra, x86, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sun, 21 Jan 2024 18:15:39 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > On 1/21/24 16:56, Steven Rostedt wrote: > >> > >> Yes, but it is a matter of where we optimize for performance as opposed to correctness. > > > > There is no such thing as "optimize for correctness", it is either > > correct or it is not. Correctness should always come before performance > > (at least that is what Thomas has pounded into me ;-) > > > > If a kernel use to work on a machine but a newer version no longer > > works, I call that a regression. > > > > There absolutely is such a thing as "optimize for correctness." It means > to keep the code clean, easily testable, and with a minimal number of > distinct code paths so that regressions and *especially* uncaught > regressions get caught quickly. I call that maintainability, not correctness. It is either correct and works, or is incorrect and does not work. You can change code to be more maintainable and still make it incorrect. -- Steve ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-22 2:22 ` Steven Rostedt @ 2024-01-22 2:31 ` H. Peter Anvin 0 siblings, 0 replies; 51+ messages in thread From: H. Peter Anvin @ 2024-01-22 2:31 UTC (permalink / raw) To: Steven Rostedt Cc: Thorsten Glaser, Peter Zijlstra, x86, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On January 21, 2024 6:22:36 PM PST, Steven Rostedt <rostedt@goodmis.org> wrote: >On Sun, 21 Jan 2024 18:15:39 -0800 >"H. Peter Anvin" <hpa@zytor.com> wrote: > >> On 1/21/24 16:56, Steven Rostedt wrote: >> >> >> >> Yes, but it is a matter of where we optimize for performance as opposed to correctness. >> > >> > There is no such thing as "optimize for correctness", it is either >> > correct or it is not. Correctness should always come before performance >> > (at least that is what Thomas has pounded into me ;-) >> > >> > If a kernel use to work on a machine but a newer version no longer >> > works, I call that a regression. >> > >> >> There absolutely is such a thing as "optimize for correctness." It means >> to keep the code clean, easily testable, and with a minimal number of >> distinct code paths so that regressions and *especially* uncaught >> regressions get caught quickly. > >I call that maintainability, not correctness. It is either correct and >works, or is incorrect and does not work. > >You can change code to be more maintainable and still make it incorrect. > >-- Steve Yes, of course. That's called failure :) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-20 8:22 ` H. Peter Anvin 2024-01-20 16:53 ` Thorsten Glaser @ 2024-01-20 17:00 ` Linus Torvalds 2024-01-20 17:19 ` Thorsten Glaser 2024-01-21 22:36 ` [PATCH 1/2] x86: Remove dynamic NOP selection David Laight 1 sibling, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2024-01-20 17:00 UTC (permalink / raw) To: H. Peter Anvin Cc: Thorsten Glaser, Peter Zijlstra, x86, rostedt, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, 20 Jan 2024 at 00:28, H. Peter Anvin <hpa@zytor.com> wrote: > > %eiz was something that binutils used to put in when disassembling certain redundant encodings with SIB at some point. Yeah, it's purely (bad) syntactic sugar for "no register". Somebody decided that the fact that so many RISC architectures have a "zero register" means that they should make x86 look like it has a "zero register" too. I assume it regularized some very silly decoding issue, but it was horrible. It's not the worst thing I've ever seen - in objdump output, and it's easy to just remove with a sed script or a simple search-and-replace in the editor. Unlike some of the other "design" choices of objdump. On that note, does anybody have a better disassembler than objdump? Or even a script around it to make it more useful? I do use "objdump --disassemble" a fair amount, and I hate how bad it is. My pet peeve is the crazy relocation handling (or lack there-of). IOW, if I do something like objdump --disassemble \ --no-show-raw-insn --no-addresses \ kernel/exit.o I get output like this: call <delayed_put_task_struct+0x1a> whis is garbage: it's not calling delayed_put_task_struct+0x1a at all, that's just "the offset bytes are all zero because the data is in the relocation". And if I add "-r" to get relocation info, I get call <delayed_put_task_struct+0x1a> R_X86_64_PLT32 rethook_flush_task-0x4 which shows the raw relocation data, but with truly mind-bogglingly horrendous syntax. Is there some sane tool that just does the sane thing and shows this as call rethook_flush_task which is what the thing actually means? And no, the llvm-objdump thing isn't any better. It isn't compatible with the GNU binutils objdump, but it does the same insanely bad decoding. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-20 17:00 ` Linus Torvalds @ 2024-01-20 17:19 ` Thorsten Glaser 2024-01-20 18:21 ` disassemblers (was Re: [PATCH 1/2] x86: Remove dynamic NOP selection) Thorsten Glaser 2024-01-21 22:36 ` [PATCH 1/2] x86: Remove dynamic NOP selection David Laight 1 sibling, 1 reply; 51+ messages in thread From: Thorsten Glaser @ 2024-01-20 17:19 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Peter Zijlstra, x86, rostedt, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat Linus Torvalds dixit: >On that note, does anybody have a better disassembler than objdump? Or >even a script around it to make it more useful? I do use "objdump >--disassemble" a fair amount, and I hate how bad it is. Other than -Mintel (and -Mintel,i8086 for boot code) to make the syntax 90–95% less awful, I’m sorry I’ve also been looking. >My pet peeve is the crazy relocation handling (or lack there-of). IOW, Yes! This! I’ve been putting markers into the file and then disassembling the final linked thing instead of just the object file I need because of this. >Is there some sane tool that just does the sane thing and shows this as The only other disassemblers I know don’t know about ELF objects at all, I’m sorry to say. I didn’t know about objdump -r, but that’s truly awful to read. Given a wide enough screen, an intermediate | sed 's/^\t/&&&&/' at least moves the relocation info more to the right to interrupt the reading flow less. Thanks, //mirabilos -- FWIW, I'm quite impressed with mksh interactively. I thought it was much *much* more bare bones. But it turns out it beats the living hell out of ksh93 in that respect. I'd even consider it for my daily use if I hadn't wasted half my life on my zsh setup. :-) -- Frank Terbeck in #!/bin/mksh ^ permalink raw reply [flat|nested] 51+ messages in thread
* disassemblers (was Re: [PATCH 1/2] x86: Remove dynamic NOP selection) 2024-01-20 17:19 ` Thorsten Glaser @ 2024-01-20 18:21 ` Thorsten Glaser 0 siblings, 0 replies; 51+ messages in thread From: Thorsten Glaser @ 2024-01-20 18:21 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Peter Zijlstra, x86, rostedt, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat Dixi quod… >>Is there some sane tool that just does the sane thing and shows this as > >The only other disassemblers I know don’t know about ELF objects >at all, I’m sorry to say. I have searched through my bookmarks and found “Agner Fog’s objconv” https://www.agner.org/optimize/#objconv which I had not yet tried as it comes with a .exe but apparently, the included GPL source builds on GNU/Linux (and BSD and MacOSX) as well. Usage is: ./objconv -fgasm filename.o This will write filename.asm ⚠ into the same directory as the .o file, surprisingly. It works for i386 and amd64 but not x32 (aka amd64ilp32) which is mis-disassembled as if it were i386. Sample output fragment: tsv_header: sub rsp, 8 # 00E3 _ 48: 83. EC, 08 lea rdi, [.LC7+rip] # 00E7 _ 48: 8D. 3D, 00000000(rel) call puts@PLT # 00EE _ E8, 00000000(PLT r) add rsp, 8 # 00F3 _ 48: 83. C4, 08 ret # 00F7 _ C3 Bit irritating is it uses decimal numbers… sub rsp, 232 # 0102 _ 48: 81. EC, 000000E8 … and the way the input is separated with colon, period and comma, but it’s legible enough. Credits to Peter Cordes for the discovery. bye, //mirabilos -- When he found out that the m68k port was in a pretty bad shape, he did not, like many before him, shrug and move on; instead, he took it upon himself to start compiling things, just so he could compile his shell. How's that for dedication. -- Wouter, about my Debian/m68k revival ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-20 17:00 ` Linus Torvalds 2024-01-20 17:19 ` Thorsten Glaser @ 2024-01-21 22:36 ` David Laight 2024-01-21 23:10 ` H. Peter Anvin 1 sibling, 1 reply; 51+ messages in thread From: David Laight @ 2024-01-21 22:36 UTC (permalink / raw) To: 'Linus Torvalds', H. Peter Anvin Cc: Thorsten Glaser, Peter Zijlstra, x86, rostedt, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat From: Linus Torvalds > Sent: 20 January 2024 17:00 ... > And if I add "-r" to get relocation info, I get > > call <delayed_put_task_struct+0x1a> > R_X86_64_PLT32 rethook_flush_task-0x4 > > which shows the raw relocation data, but with truly mind-bogglingly > horrendous syntax. > > Is there some sane tool that just does the sane thing and shows this as > > call rethook_flush_task > > which is what the thing actually means? While you are re-writing a disassembler, remember to print the contents of string when you get a reference into .rodata.str :-) How many times have you had to dig out a printf format string in order to locate the source associated with some object code? It is so much easier if the disassembler does it for you. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [PATCH 1/2] x86: Remove dynamic NOP selection 2024-01-21 22:36 ` [PATCH 1/2] x86: Remove dynamic NOP selection David Laight @ 2024-01-21 23:10 ` H. Peter Anvin 0 siblings, 0 replies; 51+ messages in thread From: H. Peter Anvin @ 2024-01-21 23:10 UTC (permalink / raw) To: David Laight, 'Linus Torvalds' Cc: Thorsten Glaser, Peter Zijlstra, x86, rostedt, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On January 21, 2024 2:36:32 PM PST, David Laight <David.Laight@ACULAB.COM> wrote: >From: Linus Torvalds >> Sent: 20 January 2024 17:00 >... >> And if I add "-r" to get relocation info, I get >> >> call <delayed_put_task_struct+0x1a> >> R_X86_64_PLT32 rethook_flush_task-0x4 >> >> which shows the raw relocation data, but with truly mind-bogglingly >> horrendous syntax. >> >> Is there some sane tool that just does the sane thing and shows this as >> >> call rethook_flush_task >> >> which is what the thing actually means? > >While you are re-writing a disassembler, remember to print the >contents of string when you get a reference into .rodata.str :-) > >How many times have you had to dig out a printf format string in >order to locate the source associated with some object code? >It is so much easier if the disassembler does it for you. > > David > >- >Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK >Registration No: 1397386 (Wales) > Probably don't even need to rewrite the disassembler. Postprocessing is probably sufficient. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 2/2] objtool,x86: Use asm/nops.h 2021-03-12 11:32 [PATCH 0/2] x86: Remove ideal_nops[] Peter Zijlstra 2021-03-12 11:32 ` [PATCH 1/2] x86: Remove dynamic NOP selection Peter Zijlstra @ 2021-03-12 11:32 ` Peter Zijlstra 2021-03-12 14:29 ` [PATCH 0/2] x86: Remove ideal_nops[] Sedat Dilek 2021-03-12 20:59 ` Borislav Petkov 3 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2021-03-12 11:32 UTC (permalink / raw) To: x86, rostedt, hpa, torvalds Cc: linux-kernel, linux-toolchains, peterz, jpoimboe, alexei.starovoitov, mhiramat Since the kernel will rely on a single canonical set of NOPs, make sure objtool uses the exact same ones. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- tools/arch/x86/include/asm/nops.h | 81 ++++++++++++++++++++++++++++++++++++++ tools/objtool/arch/x86/decode.c | 13 +++--- tools/objtool/sync-check.sh | 1 3 files changed, 90 insertions(+), 5 deletions(-) --- /dev/null +++ b/tools/arch/x86/include/asm/nops.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_NOPS_H +#define _ASM_X86_NOPS_H + +/* + * Define nops for use with alternative() and for tracing. + */ + +#ifndef CONFIG_64BIT + +/* + * Generic 32bit nops from GAS: + * + * 1: nop + * 2: movl %esi,%esi + * 3: leal 0x0(%esi),%esi + * 4: leal 0x0(%esi,%eiz,1),%esi + * 5: leal %ds:0x0(%esi,%eiz,1),%esi + * 6: leal 0x0(%esi),%esi + * 7: leal 0x0(%esi,%eiz,1),%esi + * 8: leal %ds:0x0(%esi,%eiz,1),%esi + * + * Except 5 and 8, which are DS prefixed 4 and 7 resp, where GAS would emit 2 + * nop instructions. + */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x89,0xf6 +#define BYTES_NOP3 0x8d,0x76,0x00 +#define BYTES_NOP4 0x8d,0x74,0x26,0x00 +#define BYTES_NOP5 0x3e,BYTES_NOP4 +#define BYTES_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00 +#define BYTES_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x3e,BYTES_NOP7 + +#else + +/* + * Generic 64bit nops from GAS: + * + * 1: nop + * 2: osp nop + * 3: nopl (%eax) + * 4: nopl 0x00(%eax) + * 5: nopl 0x00(%eax,%eax,1) + * 6: osp nopl 0x00(%eax,%eax,1) + * 7: nopl 0x00000000(%eax) + * 8: nopl 0x00000000(%eax,%eax,1) + */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x66,BYTES_NOP1 +#define BYTES_NOP3 0x0f,0x1f,0x00 +#define BYTES_NOP4 0x0f,0x1f,0x40,0x00 +#define BYTES_NOP5 0x0f,0x1f,0x44,0x00,0x00 +#define BYTES_NOP6 0x66,BYTES_NOP5 +#define BYTES_NOP7 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00 + +#endif /* CONFIG_64BIT */ + +#ifdef __ASSEMBLY__ +#define _ASM_MK_NOP(x) .byte x +#else +#define _ASM_MK_NOP(x) ".byte " __stringify(x) "\n" +#endif + +#define ASM_NOP1 _ASM_MK_NOP(BYTES_NOP1) +#define ASM_NOP2 _ASM_MK_NOP(BYTES_NOP2) +#define ASM_NOP3 _ASM_MK_NOP(BYTES_NOP3) +#define ASM_NOP4 _ASM_MK_NOP(BYTES_NOP4) +#define ASM_NOP5 _ASM_MK_NOP(BYTES_NOP5) +#define ASM_NOP6 _ASM_MK_NOP(BYTES_NOP6) +#define ASM_NOP7 _ASM_MK_NOP(BYTES_NOP7) +#define ASM_NOP8 _ASM_MK_NOP(BYTES_NOP8) + +#define ASM_NOP_MAX 8 + +#ifndef __ASSEMBLY__ +extern const unsigned char * const x86_nops[]; +#endif + +#endif /* _ASM_X86_NOPS_H */ --- a/tools/objtool/arch/x86/decode.c +++ b/tools/objtool/arch/x86/decode.c @@ -11,6 +11,9 @@ #include "../../../arch/x86/lib/inat.c" #include "../../../arch/x86/lib/insn.c" +#define CONFIG_64BIT 1 +#include <asm/nops.h> + #include <asm/orc_types.h> #include <objtool/check.h> #include <objtool/elf.h> @@ -640,11 +643,11 @@ void arch_initial_func_cfi_state(struct const char *arch_nop_insn(int len) { static const char nops[5][5] = { - /* 1 */ { 0x90 }, - /* 2 */ { 0x66, 0x90 }, - /* 3 */ { 0x0f, 0x1f, 0x00 }, - /* 4 */ { 0x0f, 0x1f, 0x40, 0x00 }, - /* 5 */ { 0x0f, 0x1f, 0x44, 0x00, 0x00 }, + { BYTES_NOP1 }, + { BYTES_NOP2 }, + { BYTES_NOP3 }, + { BYTES_NOP4 }, + { BYTES_NOP5 }, }; if (len < 1 || len > 5) { --- a/tools/objtool/sync-check.sh +++ b/tools/objtool/sync-check.sh @@ -10,6 +10,7 @@ FILES="include/linux/objtool.h" if [ "$SRCARCH" = "x86" ]; then FILES="$FILES +arch/x86/include/asm/nops.h arch/x86/include/asm/inat_types.h arch/x86/include/asm/orc_types.h arch/x86/include/asm/emulate_prefix.h ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 11:32 [PATCH 0/2] x86: Remove ideal_nops[] Peter Zijlstra 2021-03-12 11:32 ` [PATCH 1/2] x86: Remove dynamic NOP selection Peter Zijlstra 2021-03-12 11:32 ` [PATCH 2/2] objtool,x86: Use asm/nops.h Peter Zijlstra @ 2021-03-12 14:29 ` Sedat Dilek 2021-03-12 14:47 ` Borislav Petkov 2021-03-12 20:59 ` Borislav Petkov 3 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-12 14:29 UTC (permalink / raw) To: Peter Zijlstra Cc: x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, Mar 12, 2021 at 1:00 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Hi! > > A while ago Steve complained about x86 being weird for having different NOPs [1] > > Having cursed the same thing before, I figured it was time to look at the NOP > situation. > > 32bit simply isn't a performance target anymore, so all we need is a set of > NOPs that works on all. > > x86_64 has two main NOP variants, NOPL and prefix NOP. NOPL was introduced by > P6 and is architecturally mandated for x86_64. However, some uarchs made the > choice to limit NOPL decoding to a single port, which obviously limits NOPL > throughput. Other uarchs have (severe) decoding penalties for excessive (>~3) > prefixes, hobbling prefix NOP throughput. > > But the thing is, all the modern uarchs can handle both without issue; that is > AMD K10 (2007) and later and Intel Ivy Bridge (2012) and later. The only > exception is Atom, which has the prefix penalty. > > Since ultimate performance of a 10 year old chip (Intel Sandy Bridge, 2011) is > simply irrelevant today, remove variable NOPs and use NOPL. > Hi Peter, I am an Intel SandyBridge power user and want the ultimate performance on my hardware. What does this change exactly mean to/for me? I got this laptop as the last gift for my birthday in 2012 from my mother. She died the same year. So, this is a bit sentimental hardware for me. It's amazing what this laptop all was involved in. 10+ years of LLVM/Clang for Linux-kernel and Linux graphics stack. Worked in a Ubuntu/precise 12.04 LTS WUBI (installation) environment - 5 years (full LTS period) long! How many Linux-kernel bugs got reported and/or fixed... Debian/stretch...Debian/bullseye with no fresh installation. Rolling release. I remember my decision in March 2012 not to choose that Asus notebook with the first hardware-revision of IvyBridge and bought conservatively a SandyBridge Gen. 2 Samsung notebook. It's a pity to see no or restricted/limited Vulkan support. If you are not concerned - life goes on for you. It's like being white colored not understanding what "Black Lives Matter" really means. If people use or talk about white/black listings then allow/deny lists. Or being a female software developer having a 10-15% less salary because you are not male - in the same department! This week we had our 100th anniversary of International Women's Day. I am not black - I am male - I am not concerned - Live goes on? Again, this machine is able to do fast Linux-kernel builds with an adapted Debian Linux v5.10 kernel-config. If you do NOT use Debian's LLVM/Clang - means build a selfmade stage1-only LLVM toolchain (saves ~1 hour of build-time) - or a ThinLTO+PGO optimized LLVM toolchain (saves again ~1 hour of build-time). Latest Linus Git plus With Clang-CFI took me today approx. 04:20 [hh:mm] with a selfmade stage1-only LLVM toolchain version 12.0.0-rc3. Again, this is amazing. What I wanna try to say is: This is old hardware but you can - if you are a smart enough - optimize your builds. On the other hand I can understand dropping support for XXX whatever hardware... Where is the limit(ation): Support 10 years or 7 years old hardware? Sorry, I am a bit concerned that this is the beginning - or a backdoor ? - to drop (optimized) Intel SandyBridge support. So, what do I need to do - to have "ultimate performance" back for SandyBridge with your patchset :-)? Yes, you are right: Life goes on. Regards, - Sedat - > This gives us deterministic NOPs and restores sanity. > > > > [1] https://lkml.kernel.org/r/20210302105827.3403656c@gandalf.local.home > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 14:29 ` [PATCH 0/2] x86: Remove ideal_nops[] Sedat Dilek @ 2021-03-12 14:47 ` Borislav Petkov 2021-03-12 17:26 ` Steven Rostedt 0 siblings, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2021-03-12 14:47 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, Mar 12, 2021 at 03:29:48PM +0100, Sedat Dilek wrote: > What does this change exactly mean to/for me? Probably nothing. I would be very surprised if it would be at all noticeable for you - it's not like the kernel is executing long streams of NOPs in fast paths. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 14:47 ` Borislav Petkov @ 2021-03-12 17:26 ` Steven Rostedt 2021-03-12 17:35 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Steven Rostedt @ 2021-03-12 17:26 UTC (permalink / raw) To: Borislav Petkov Cc: Sedat Dilek, Peter Zijlstra, x86, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, 12 Mar 2021 15:47:26 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Fri, Mar 12, 2021 at 03:29:48PM +0100, Sedat Dilek wrote: > > What does this change exactly mean to/for me? > > Probably nothing. > > I would be very surprised if it would be at all noticeable for you - > it's not like the kernel is executing long streams of NOPs in fast > paths. > With ftrace enabled, every function starts with a NOP. But that said, the simple answer is for Sedat to apply the patches on his box and do some performance testing. It doesn't matter if you are white, black, male, female, or anything in between. As my daughter's swim coach said; it's the numbers that matter here. Run a bunch of benchmarks on your box on the latest kernel, apply Peter's patches, and then run the benchmarks again on the latest kernel with Peter's patches and then report the difference. If it's negligible then there's nothing to worry about. -- Steve ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 17:26 ` Steven Rostedt @ 2021-03-12 17:35 ` Sedat Dilek 2021-03-12 17:46 ` Borislav Petkov 2021-03-12 17:47 ` Steven Rostedt 0 siblings, 2 replies; 51+ messages in thread From: Sedat Dilek @ 2021-03-12 17:35 UTC (permalink / raw) To: Steven Rostedt Cc: Borislav Petkov, Peter Zijlstra, x86, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, Mar 12, 2021 at 6:26 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 12 Mar 2021 15:47:26 +0100 > Borislav Petkov <bp@alien8.de> wrote: > > > On Fri, Mar 12, 2021 at 03:29:48PM +0100, Sedat Dilek wrote: > > > What does this change exactly mean to/for me? > > > > Probably nothing. > > > > I would be very surprised if it would be at all noticeable for you - > > it's not like the kernel is executing long streams of NOPs in fast > > paths. > > > > With ftrace enabled, every function starts with a NOP. But that said, the > simple answer is for Sedat to apply the patches on his box and do some > performance testing. It doesn't matter if you are white, black, male, > female, or anything in between. As my daughter's swim coach said; it's the > numbers that matter here. Run a bunch of benchmarks on your box on the > latest kernel, apply Peter's patches, and then run the benchmarks again on > the latest kernel with Peter's patches and then report the difference. If > it's negligible then there's nothing to worry about. > Hey Steve, you degraded me to a number :-). I dunno which Git tree this patchset applies to, but I check if I can apply the patchset to my current local Git. Then build a kernel in the same build-environment. Lemme see. To say with Linus's words: "Numbers talk - bullshit walks." - Sedat - ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 17:35 ` Sedat Dilek @ 2021-03-12 17:46 ` Borislav Petkov 2021-03-12 17:47 ` Steven Rostedt 1 sibling, 0 replies; 51+ messages in thread From: Borislav Petkov @ 2021-03-12 17:46 UTC (permalink / raw) To: Sedat Dilek Cc: Steven Rostedt, Peter Zijlstra, x86, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, Mar 12, 2021 at 06:35:45PM +0100, Sedat Dilek wrote: > Hey Steve, you degraded me to a number :-). How did he degrade you to a number?! Actually, he went the length to patiently explain what you could do. > I dunno which Git tree this patchset applies to, but I check if I can https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master branch. > apply the patchset to my current local Git. > Then build a kernel in the same build-environment. Yes, what Steve said. You can run some benchmarks and compare before/after numbers. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 17:35 ` Sedat Dilek 2021-03-12 17:46 ` Borislav Petkov @ 2021-03-12 17:47 ` Steven Rostedt 2021-03-12 18:13 ` Sedat Dilek 1 sibling, 1 reply; 51+ messages in thread From: Steven Rostedt @ 2021-03-12 17:47 UTC (permalink / raw) To: Sedat Dilek Cc: Borislav Petkov, Peter Zijlstra, x86, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, 12 Mar 2021 18:35:45 +0100 Sedat Dilek <sedat.dilek@gmail.com> wrote: > Hey Steve, you degraded me to a number :-). It's the internet, everyone is a number. > > I dunno which Git tree this patchset applies to, but I check if I can > apply the patchset to my current local Git. Try Linus's latest. > Then build a kernel in the same build-environment. > Lemme see. > > To say with Linus's words: > "Numbers talk - bullshit walks." Exactly. -- Steve ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 17:47 ` Steven Rostedt @ 2021-03-12 18:13 ` Sedat Dilek 2021-03-12 19:03 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-12 18:13 UTC (permalink / raw) To: Steven Rostedt Cc: Borislav Petkov, Peter Zijlstra, x86, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat [-- Attachment #1: Type: text/plain, Size: 730 bytes --] On Fri, Mar 12, 2021 at 6:47 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 12 Mar 2021 18:35:45 +0100 > Sedat Dilek <sedat.dilek@gmail.com> wrote: > > > > Hey Steve, you degraded me to a number :-). > > It's the internet, everyone is a number. > > > > > I dunno which Git tree this patchset applies to, but I check if I can > > apply the patchset to my current local Git. > > Try Linus's latest. > $ git describe origin/HEAD v5.12-rc2-338-gf78d76e72a46 I adapted 1/2 in arch/x86/include/asm/jump_label.h to fit ^^^, see attachment. - Sedat - > > Then build a kernel in the same build-environment. > > Lemme see. > > > > To say with Linus's words: > > "Numbers talk - bullshit walks." > > Exactly. > > -- Steve [-- Attachment #2: 20210312_peterz_x86_remove_ideal_nops-dileks-v2.mbx --] [-- Type: application/octet-stream, Size: 26033 bytes --] From MAILER-DAEMON Fri Mar 12 17:46:39 2021 Message-ID: <20210312115749.065275711@infradead.org> Date: Fri, 12 Mar 2021 12:32:54 +0100 From: Peter Zijlstra <peterz@infradead.org> To: x86@kernel.org, rostedt@goodmis.org, hpa@zytor.com, torvalds@linuxfoundation.org Cc: linux-kernel@vger.kernel.org, linux-toolchains@vger.kernel.org, peterz@infradead.org, jpoimboe@redhat.com, alexei.starovoitov@gmail.com, mhiramat@kernel.org Subject: [PATCH 1/2] x86: Remove dynamic NOP selection References: <20210312113253.305040674@infradead.org> List-ID: <linux-toolchains.vger.kernel.org> X-Mailing-List: linux-toolchains@vger.kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit This ensures that a NOP is a NOP and not a random other instruction that is also a NOP. It allows simplification of dynamic code patching that wants to verify existing code before writing new instructions (ftrace, jump_label, static_call, etc..). Differentiating on NOPs is not a feature. This pessimises 32bit (DONTCARE) and 32bit on 64bit CPUs (CARELESS). 32bit is not a performance target. Everything x86_64 since AMD K10 (2007) and Intel IvyBridge (2012) is fine with using NOPL (as opposed to prefix NOP). And per FEATURE_NOPL being required for x86_64, all x86_64 CPUs can use NOPL. So stop caring about NOPs, simplify things and get on with life. [ The problem seems to be that some uarchs can only decode NOPL on a single front-end port while others have severe decode penalties for excessive prefixes. All modern uarchs can handle both, except Atom, which has prefix penalties. ] [ Also, much doubt you can actually measure any of this on normal workloads. ] After this FEATURE_NOPL is unused except for required-features for x86_64. FEATURE_K8 is only used for PTI and FEATURE_K7 is unused. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> # bpf --- Changes v1->v2 (dileks): - Adapt to fit Linux v5.12-rc2-338-gf78d76e72a46 arch/x86/include/asm/jump_label.h | 12 -- arch/x86/include/asm/nops.h | 180 ++++++++++--------------------- arch/x86/include/asm/special_insns.h | 4 arch/x86/kernel/alternative.c | 198 +++-------------------------------- arch/x86/kernel/ftrace.c | 4 arch/x86/kernel/jump_label.c | 32 +---- arch/x86/kernel/kprobes/core.c | 2 arch/x86/kernel/setup.c | 1 arch/x86/kernel/static_call.c | 4 arch/x86/net/bpf_jit_comp.c | 8 - 10 files changed, 98 insertions(+), 347 deletions(-) --- a/arch/x86/include/asm/jump_label.h +++ b/arch/x86/include/asm/jump_label.h @@ -6,12 +6,6 @@ #define JUMP_LABEL_NOP_SIZE 5 -#ifdef CONFIG_X86_64 -# define STATIC_KEY_INIT_NOP P6_NOP5_ATOMIC -#else -# define STATIC_KEY_INIT_NOP GENERIC_NOP5_ATOMIC -#endif - #include <asm/asm.h> #include <asm/nops.h> @@ -23,7 +17,7 @@ static __always_inline bool arch_static_branch(struct static_key *key, bool branch) { asm_volatile_goto("1:" - ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t" + ".byte " __stringify(BYTES_NOP5) "\n\t" ".pushsection __jump_table, \"aw\" \n\t" _ASM_ALIGN "\n\t" ".long 1b - ., %l[l_yes] - . \n\t" @@ -63,7 +57,7 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool .long \target - .Lstatic_jump_after_\@ .Lstatic_jump_after_\@: .else - .byte STATIC_KEY_INIT_NOP + .byte BYTES_NOP5 .endif .pushsection __jump_table, "aw" _ASM_ALIGN @@ -75,7 +69,7 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool .macro STATIC_JUMP_IF_FALSE target, key, def .Lstatic_jump_\@: .if \def - .byte STATIC_KEY_INIT_NOP + .byte BYTES_NOP5 .else /* Equivalent to "jmp.d32 \target" */ .byte 0xe9 --- a/arch/x86/include/asm/nops.h +++ b/arch/x86/include/asm/nops.h @@ -4,89 +4,58 @@ /* * Define nops for use with alternative() and for tracing. + */ + +#ifndef CONFIG_64BIT + +/* + * Generic 32bit nops from GAS: + * + * 1: nop + * 2: movl %esi,%esi + * 3: leal 0x0(%esi),%esi + * 4: leal 0x0(%esi,%eiz,1),%esi + * 5: leal %ds:0x0(%esi,%eiz,1),%esi + * 6: leal 0x0(%esi),%esi + * 7: leal 0x0(%esi,%eiz,1),%esi + * 8: leal %ds:0x0(%esi,%eiz,1),%esi * - * *_NOP5_ATOMIC must be a single instruction. + * Except 5 and 8, which are DS prefixed 4 and 7 resp, where GAS would emit 2 + * nop instructions. */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x89,0xf6 +#define BYTES_NOP3 0x8d,0x76,0x00 +#define BYTES_NOP4 0x8d,0x74,0x26,0x00 +#define BYTES_NOP5 0x3e,BYTES_NOP4 +#define BYTES_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00 +#define BYTES_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x3e,BYTES_NOP7 + +#else -#define NOP_DS_PREFIX 0x3e +/* + * Generic 64bit nops from GAS: + * + * 1: nop + * 2: osp nop + * 3: nopl (%eax) + * 4: nopl 0x00(%eax) + * 5: nopl 0x00(%eax,%eax,1) + * 6: osp nopl 0x00(%eax,%eax,1) + * 7: nopl 0x00000000(%eax) + * 8: nopl 0x00000000(%eax,%eax,1) + */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x66,BYTES_NOP1 +#define BYTES_NOP3 0x0f,0x1f,0x00 +#define BYTES_NOP4 0x0f,0x1f,0x40,0x00 +#define BYTES_NOP5 0x0f,0x1f,0x44,0x00,0x00 +#define BYTES_NOP6 0x66,BYTES_NOP5 +#define BYTES_NOP7 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00 -/* generic versions from gas - 1: nop - the following instructions are NOT nops in 64-bit mode, - for 64-bit mode use K8 or P6 nops instead - 2: movl %esi,%esi - 3: leal 0x00(%esi),%esi - 4: leal 0x00(,%esi,1),%esi - 6: leal 0x00000000(%esi),%esi - 7: leal 0x00000000(,%esi,1),%esi -*/ -#define GENERIC_NOP1 0x90 -#define GENERIC_NOP2 0x89,0xf6 -#define GENERIC_NOP3 0x8d,0x76,0x00 -#define GENERIC_NOP4 0x8d,0x74,0x26,0x00 -#define GENERIC_NOP5 GENERIC_NOP1,GENERIC_NOP4 -#define GENERIC_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00 -#define GENERIC_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00 -#define GENERIC_NOP8 GENERIC_NOP1,GENERIC_NOP7 -#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4 - -/* Opteron 64bit nops - 1: nop - 2: osp nop - 3: osp osp nop - 4: osp osp osp nop -*/ -#define K8_NOP1 GENERIC_NOP1 -#define K8_NOP2 0x66,K8_NOP1 -#define K8_NOP3 0x66,K8_NOP2 -#define K8_NOP4 0x66,K8_NOP3 -#define K8_NOP5 K8_NOP3,K8_NOP2 -#define K8_NOP6 K8_NOP3,K8_NOP3 -#define K8_NOP7 K8_NOP4,K8_NOP3 -#define K8_NOP8 K8_NOP4,K8_NOP4 -#define K8_NOP5_ATOMIC 0x66,K8_NOP4 - -/* K7 nops - uses eax dependencies (arbitrary choice) - 1: nop - 2: movl %eax,%eax - 3: leal (,%eax,1),%eax - 4: leal 0x00(,%eax,1),%eax - 6: leal 0x00000000(%eax),%eax - 7: leal 0x00000000(,%eax,1),%eax -*/ -#define K7_NOP1 GENERIC_NOP1 -#define K7_NOP2 0x8b,0xc0 -#define K7_NOP3 0x8d,0x04,0x20 -#define K7_NOP4 0x8d,0x44,0x20,0x00 -#define K7_NOP5 K7_NOP4,K7_NOP1 -#define K7_NOP6 0x8d,0x80,0,0,0,0 -#define K7_NOP7 0x8D,0x04,0x05,0,0,0,0 -#define K7_NOP8 K7_NOP7,K7_NOP1 -#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4 - -/* P6 nops - uses eax dependencies (Intel-recommended choice) - 1: nop - 2: osp nop - 3: nopl (%eax) - 4: nopl 0x00(%eax) - 5: nopl 0x00(%eax,%eax,1) - 6: osp nopl 0x00(%eax,%eax,1) - 7: nopl 0x00000000(%eax) - 8: nopl 0x00000000(%eax,%eax,1) - Note: All the above are assumed to be a single instruction. - There is kernel code that depends on this. -*/ -#define P6_NOP1 GENERIC_NOP1 -#define P6_NOP2 0x66,0x90 -#define P6_NOP3 0x0f,0x1f,0x00 -#define P6_NOP4 0x0f,0x1f,0x40,0 -#define P6_NOP5 0x0f,0x1f,0x44,0x00,0 -#define P6_NOP6 0x66,0x0f,0x1f,0x44,0x00,0 -#define P6_NOP7 0x0f,0x1f,0x80,0,0,0,0 -#define P6_NOP8 0x0f,0x1f,0x84,0x00,0,0,0,0 -#define P6_NOP5_ATOMIC P6_NOP5 +#endif /* CONFIG_64BIT */ #ifdef __ASSEMBLY__ #define _ASM_MK_NOP(x) .byte x @@ -94,54 +63,19 @@ #define _ASM_MK_NOP(x) ".byte " __stringify(x) "\n" #endif -#if defined(CONFIG_MK7) -#define ASM_NOP1 _ASM_MK_NOP(K7_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(K7_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(K7_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(K7_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(K7_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(K7_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(K7_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(K7_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K7_NOP5_ATOMIC) -#elif defined(CONFIG_X86_P6_NOP) -#define ASM_NOP1 _ASM_MK_NOP(P6_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(P6_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(P6_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(P6_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(P6_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(P6_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(P6_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(P6_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(P6_NOP5_ATOMIC) -#elif defined(CONFIG_X86_64) -#define ASM_NOP1 _ASM_MK_NOP(K8_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(K8_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(K8_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(K8_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(K8_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(K8_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(K8_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(K8_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K8_NOP5_ATOMIC) -#else -#define ASM_NOP1 _ASM_MK_NOP(GENERIC_NOP1) -#define ASM_NOP2 _ASM_MK_NOP(GENERIC_NOP2) -#define ASM_NOP3 _ASM_MK_NOP(GENERIC_NOP3) -#define ASM_NOP4 _ASM_MK_NOP(GENERIC_NOP4) -#define ASM_NOP5 _ASM_MK_NOP(GENERIC_NOP5) -#define ASM_NOP6 _ASM_MK_NOP(GENERIC_NOP6) -#define ASM_NOP7 _ASM_MK_NOP(GENERIC_NOP7) -#define ASM_NOP8 _ASM_MK_NOP(GENERIC_NOP8) -#define ASM_NOP5_ATOMIC _ASM_MK_NOP(GENERIC_NOP5_ATOMIC) -#endif +#define ASM_NOP1 _ASM_MK_NOP(BYTES_NOP1) +#define ASM_NOP2 _ASM_MK_NOP(BYTES_NOP2) +#define ASM_NOP3 _ASM_MK_NOP(BYTES_NOP3) +#define ASM_NOP4 _ASM_MK_NOP(BYTES_NOP4) +#define ASM_NOP5 _ASM_MK_NOP(BYTES_NOP5) +#define ASM_NOP6 _ASM_MK_NOP(BYTES_NOP6) +#define ASM_NOP7 _ASM_MK_NOP(BYTES_NOP7) +#define ASM_NOP8 _ASM_MK_NOP(BYTES_NOP8) #define ASM_NOP_MAX 8 -#define NOP_ATOMIC5 (ASM_NOP_MAX+1) /* Entry for the 5-byte atomic NOP */ #ifndef __ASSEMBLY__ -extern const unsigned char * const *ideal_nops; -extern void arch_init_ideal_nops(void); +extern const unsigned char * const x86_nops[]; #endif #endif /* _ASM_X86_NOPS_H */ --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -214,7 +214,7 @@ static inline void clflush(volatile void static inline void clflushopt(volatile void *__p) { - alternative_io(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0", + alternative_io(".byte 0x3e; clflush %P0", ".byte 0x66; clflush %P0", X86_FEATURE_CLFLUSHOPT, "+m" (*(volatile char __force *)__p)); @@ -225,7 +225,7 @@ static inline void clwb(volatile void *_ volatile struct { char x[64]; } *p = __p; asm volatile(ALTERNATIVE_2( - ".byte " __stringify(NOP_DS_PREFIX) "; clflush (%[pax])", + ".byte 0x3e; clflush (%[pax])", ".byte 0x66; clflush (%[pax])", /* clflushopt (%%rax) */ X86_FEATURE_CLFLUSHOPT, ".byte 0x66, 0x0f, 0xae, 0x30", /* clwb (%%rax) */ --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -74,186 +74,30 @@ do { \ } \ } while (0) -/* - * Each GENERIC_NOPX is of X bytes, and defined as an array of bytes - * that correspond to that nop. Getting from one nop to the next, we - * add to the array the offset that is equal to the sum of all sizes of - * nops preceding the one we are after. - * - * Note: The GENERIC_NOP5_ATOMIC is at the end, as it breaks the - * nice symmetry of sizes of the previous nops. - */ -#if defined(GENERIC_NOP1) && !defined(CONFIG_X86_64) -static const unsigned char intelnops[] = +const unsigned char x86nops[] = { - GENERIC_NOP1, - GENERIC_NOP2, - GENERIC_NOP3, - GENERIC_NOP4, - GENERIC_NOP5, - GENERIC_NOP6, - GENERIC_NOP7, - GENERIC_NOP8, - GENERIC_NOP5_ATOMIC -}; -static const unsigned char * const intel_nops[ASM_NOP_MAX+2] = -{ - NULL, - intelnops, - intelnops + 1, - intelnops + 1 + 2, - intelnops + 1 + 2 + 3, - intelnops + 1 + 2 + 3 + 4, - intelnops + 1 + 2 + 3 + 4 + 5, - intelnops + 1 + 2 + 3 + 4 + 5 + 6, - intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, + BYTES_NOP1, + BYTES_NOP2, + BYTES_NOP3, + BYTES_NOP4, + BYTES_NOP5, + BYTES_NOP6, + BYTES_NOP7, + BYTES_NOP8, }; -#endif -#ifdef K8_NOP1 -static const unsigned char k8nops[] = -{ - K8_NOP1, - K8_NOP2, - K8_NOP3, - K8_NOP4, - K8_NOP5, - K8_NOP6, - K8_NOP7, - K8_NOP8, - K8_NOP5_ATOMIC -}; -static const unsigned char * const k8_nops[ASM_NOP_MAX+2] = -{ - NULL, - k8nops, - k8nops + 1, - k8nops + 1 + 2, - k8nops + 1 + 2 + 3, - k8nops + 1 + 2 + 3 + 4, - k8nops + 1 + 2 + 3 + 4 + 5, - k8nops + 1 + 2 + 3 + 4 + 5 + 6, - k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, -}; -#endif - -#if defined(K7_NOP1) && !defined(CONFIG_X86_64) -static const unsigned char k7nops[] = -{ - K7_NOP1, - K7_NOP2, - K7_NOP3, - K7_NOP4, - K7_NOP5, - K7_NOP6, - K7_NOP7, - K7_NOP8, - K7_NOP5_ATOMIC -}; -static const unsigned char * const k7_nops[ASM_NOP_MAX+2] = -{ - NULL, - k7nops, - k7nops + 1, - k7nops + 1 + 2, - k7nops + 1 + 2 + 3, - k7nops + 1 + 2 + 3 + 4, - k7nops + 1 + 2 + 3 + 4 + 5, - k7nops + 1 + 2 + 3 + 4 + 5 + 6, - k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, -}; -#endif - -#ifdef P6_NOP1 -static const unsigned char p6nops[] = -{ - P6_NOP1, - P6_NOP2, - P6_NOP3, - P6_NOP4, - P6_NOP5, - P6_NOP6, - P6_NOP7, - P6_NOP8, - P6_NOP5_ATOMIC -}; -static const unsigned char * const p6_nops[ASM_NOP_MAX+2] = +const unsigned char * const x86_nops[ASM_NOP_MAX+1] = { NULL, - p6nops, - p6nops + 1, - p6nops + 1 + 2, - p6nops + 1 + 2 + 3, - p6nops + 1 + 2 + 3 + 4, - p6nops + 1 + 2 + 3 + 4 + 5, - p6nops + 1 + 2 + 3 + 4 + 5 + 6, - p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, - p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8, + x86nops, + x86nops + 1, + x86nops + 1 + 2, + x86nops + 1 + 2 + 3, + x86nops + 1 + 2 + 3 + 4, + x86nops + 1 + 2 + 3 + 4 + 5, + x86nops + 1 + 2 + 3 + 4 + 5 + 6, + x86nops + 1 + 2 + 3 + 4 + 5 + 6 + 7, }; -#endif - -/* Initialize these to a safe default */ -#ifdef CONFIG_X86_64 -const unsigned char * const *ideal_nops = p6_nops; -#else -const unsigned char * const *ideal_nops = intel_nops; -#endif - -void __init arch_init_ideal_nops(void) -{ - switch (boot_cpu_data.x86_vendor) { - case X86_VENDOR_INTEL: - /* - * Due to a decoder implementation quirk, some - * specific Intel CPUs actually perform better with - * the "k8_nops" than with the SDM-recommended NOPs. - */ - if (boot_cpu_data.x86 == 6 && - boot_cpu_data.x86_model >= 0x0f && - boot_cpu_data.x86_model != 0x1c && - boot_cpu_data.x86_model != 0x26 && - boot_cpu_data.x86_model != 0x27 && - boot_cpu_data.x86_model < 0x30) { - ideal_nops = k8_nops; - } else if (boot_cpu_has(X86_FEATURE_NOPL)) { - ideal_nops = p6_nops; - } else { -#ifdef CONFIG_X86_64 - ideal_nops = k8_nops; -#else - ideal_nops = intel_nops; -#endif - } - break; - - case X86_VENDOR_HYGON: - ideal_nops = p6_nops; - return; - - case X86_VENDOR_AMD: - if (boot_cpu_data.x86 > 0xf) { - ideal_nops = p6_nops; - return; - } - - fallthrough; - - default: -#ifdef CONFIG_X86_64 - ideal_nops = k8_nops; -#else - if (boot_cpu_has(X86_FEATURE_K8)) - ideal_nops = k8_nops; - else if (boot_cpu_has(X86_FEATURE_K7)) - ideal_nops = k7_nops; - else - ideal_nops = intel_nops; -#endif - } -} /* Use this to add nops to a buffer, then text_poke the whole buffer. */ static void __init_or_module add_nops(void *insns, unsigned int len) @@ -262,7 +106,7 @@ static void __init_or_module add_nops(vo unsigned int noplen = len; if (noplen > ASM_NOP_MAX) noplen = ASM_NOP_MAX; - memcpy(insns, ideal_nops[noplen], noplen); + memcpy(insns, x86_nops[noplen], noplen); insns += noplen; len -= noplen; } @@ -1302,13 +1146,13 @@ static void text_poke_loc_init(struct te default: /* assume NOP */ switch (len) { case 2: /* NOP2 -- emulate as JMP8+0 */ - BUG_ON(memcmp(emulate, ideal_nops[len], len)); + BUG_ON(memcmp(emulate, x86_nops[len], len)); tp->opcode = JMP8_INSN_OPCODE; tp->rel32 = 0; break; case 5: /* NOP5 -- emulate as JMP32+0 */ - BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], len)); + BUG_ON(memcmp(emulate, x86_nops[len], len)); tp->opcode = JMP32_INSN_OPCODE; tp->rel32 = 0; break; --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -66,7 +66,7 @@ int ftrace_arch_code_modify_post_process static const char *ftrace_nop_replace(void) { - return ideal_nops[NOP_ATOMIC5]; + return x86_nops[5]; } static const char *ftrace_call_replace(unsigned long ip, unsigned long addr) @@ -377,7 +377,7 @@ create_trampoline(struct ftrace_ops *ops ip = trampoline + (jmp_offset - start_offset); if (WARN_ON(*(char *)ip != 0x75)) goto fail; - ret = copy_from_kernel_nofault(ip, ideal_nops[2], 2); + ret = copy_from_kernel_nofault(ip, x86_nops[2], 2); if (ret < 0) goto fail; } --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -28,10 +28,8 @@ static void bug_at(const void *ip, int l } static const void * -__jump_label_set_jump_code(struct jump_entry *entry, enum jump_label_type type, int init) +__jump_label_set_jump_code(struct jump_entry *entry, enum jump_label_type type) { - const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP }; - const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5]; const void *expect, *code; const void *addr, *dest; int line; @@ -41,10 +39,8 @@ __jump_label_set_jump_code(struct jump_e code = text_gen_insn(JMP32_INSN_OPCODE, addr, dest); - if (init) { - expect = default_nop; line = __LINE__; - } else if (type == JUMP_LABEL_JMP) { - expect = ideal_nop; line = __LINE__; + if (type == JUMP_LABEL_JMP) { + expect = x86_nops[5]; line = __LINE__; } else { expect = code; line = __LINE__; } @@ -53,7 +49,7 @@ __jump_label_set_jump_code(struct jump_e bug_at(addr, line); if (type == JUMP_LABEL_NOP) - code = ideal_nop; + code = x86_nops[5]; return code; } @@ -62,7 +58,7 @@ static inline void __jump_label_transfor enum jump_label_type type, int init) { - const void *opcode = __jump_label_set_jump_code(entry, type, init); + const void *opcode = __jump_label_set_jump_code(entry, type); /* * As long as only a single processor is running and the code is still @@ -113,7 +109,7 @@ bool arch_jump_label_transform_queue(str } mutex_lock(&text_mutex); - opcode = __jump_label_set_jump_code(entry, type, 0); + opcode = __jump_label_set_jump_code(entry, type); text_poke_queue((void *)jump_entry_code(entry), opcode, JUMP_LABEL_NOP_SIZE, NULL); mutex_unlock(&text_mutex); @@ -136,22 +132,6 @@ static enum { __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type) { - /* - * This function is called at boot up and when modules are - * first loaded. Check if the default nop, the one that is - * inserted at compile time, is the ideal nop. If it is, then - * we do not need to update the nop, and we can leave it as is. - * If it is not, then we need to update the nop to the ideal nop. - */ - if (jlstate == JL_STATE_START) { - const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP }; - const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5]; - - if (memcmp(ideal_nop, default_nop, 5) != 0) - jlstate = JL_STATE_UPDATE; - else - jlstate = JL_STATE_NO_UPDATE; - } if (jlstate == JL_STATE_UPDATE) jump_label_transform(entry, type, 1); } --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -229,7 +229,7 @@ __recover_probed_insn(kprobe_opcode_t *b return 0UL; if (faddr) - memcpy(buf, ideal_nops[NOP_ATOMIC5], 5); + memcpy(buf, x86_nops[5], 5); else buf[0] = kp->opcode; return (unsigned long)buf; --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -822,7 +822,6 @@ void __init setup_arch(char **cmdline_p) idt_setup_early_traps(); early_cpu_init(); - arch_init_ideal_nops(); jump_label_init(); static_call_init(); early_ioremap_init(); --- a/arch/x86/kernel/static_call.c +++ b/arch/x86/kernel/static_call.c @@ -34,7 +34,7 @@ static void __ref __static_call_transfor break; case NOP: - code = ideal_nops[NOP_ATOMIC5]; + code = x86_nops[5]; break; case JMP: @@ -66,7 +66,7 @@ static void __static_call_validate(void return; } else { if (opcode == CALL_INSN_OPCODE || - !memcmp(insn, ideal_nops[NOP_ATOMIC5], 5) || + !memcmp(insn, x86_nops[5], 5) || !memcmp(insn, xor5rax, 5)) return; } --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -282,7 +282,7 @@ static void emit_prologue(u8 **pprog, u3 /* BPF trampoline can be made to work without these nops, * but let's waste 5 bytes for now and optimize later */ - memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt); + memcpy(prog, x86_nops[5], cnt); prog += cnt; if (!ebpf_from_cbpf) { if (tail_call_reachable && !is_subprog) @@ -330,7 +330,7 @@ static int __bpf_arch_text_poke(void *ip void *old_addr, void *new_addr, const bool text_live) { - const u8 *nop_insn = ideal_nops[NOP_ATOMIC5]; + const u8 *nop_insn = x86_nops[5]; u8 old_insn[X86_PATCH_SIZE]; u8 new_insn[X86_PATCH_SIZE]; u8 *prog; @@ -560,7 +560,7 @@ static void emit_bpf_tail_call_direct(st if (stack_depth) EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); - memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE); + memcpy(prog, x86_nops[5], X86_PATCH_SIZE); prog += X86_PATCH_SIZE; /* out: */ @@ -881,7 +881,7 @@ static int emit_nops(u8 **pprog, int len noplen = ASM_NOP_MAX; for (i = 0; i < noplen; i++) - EMIT1(ideal_nops[noplen][i]); + EMIT1(x86_nops[noplen][i]); len -= noplen; } From MAILER-DAEMON Fri Mar 12 17:46:39 2021 Message-ID: <20210312115749.136357911@infradead.org> Date: Fri, 12 Mar 2021 12:32:55 +0100 From: Peter Zijlstra <peterz@infradead.org> To: x86@kernel.org, rostedt@goodmis.org, hpa@zytor.com, torvalds@linuxfoundation.org Cc: linux-kernel@vger.kernel.org, linux-toolchains@vger.kernel.org, peterz@infradead.org, jpoimboe@redhat.com, alexei.starovoitov@gmail.com, mhiramat@kernel.org Subject: [PATCH 2/2] objtool,x86: Use asm/nops.h References: <20210312113253.305040674@infradead.org> List-ID: <linux-toolchains.vger.kernel.org> X-Mailing-List: linux-toolchains@vger.kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Since the kernel will rely on a single canonical set of NOPs, make sure objtool uses the exact same ones. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- tools/arch/x86/include/asm/nops.h | 81 ++++++++++++++++++++++++++++++++++++++ tools/objtool/arch/x86/decode.c | 13 +++--- tools/objtool/sync-check.sh | 1 3 files changed, 90 insertions(+), 5 deletions(-) --- /dev/null +++ b/tools/arch/x86/include/asm/nops.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_NOPS_H +#define _ASM_X86_NOPS_H + +/* + * Define nops for use with alternative() and for tracing. + */ + +#ifndef CONFIG_64BIT + +/* + * Generic 32bit nops from GAS: + * + * 1: nop + * 2: movl %esi,%esi + * 3: leal 0x0(%esi),%esi + * 4: leal 0x0(%esi,%eiz,1),%esi + * 5: leal %ds:0x0(%esi,%eiz,1),%esi + * 6: leal 0x0(%esi),%esi + * 7: leal 0x0(%esi,%eiz,1),%esi + * 8: leal %ds:0x0(%esi,%eiz,1),%esi + * + * Except 5 and 8, which are DS prefixed 4 and 7 resp, where GAS would emit 2 + * nop instructions. + */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x89,0xf6 +#define BYTES_NOP3 0x8d,0x76,0x00 +#define BYTES_NOP4 0x8d,0x74,0x26,0x00 +#define BYTES_NOP5 0x3e,BYTES_NOP4 +#define BYTES_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00 +#define BYTES_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x3e,BYTES_NOP7 + +#else + +/* + * Generic 64bit nops from GAS: + * + * 1: nop + * 2: osp nop + * 3: nopl (%eax) + * 4: nopl 0x00(%eax) + * 5: nopl 0x00(%eax,%eax,1) + * 6: osp nopl 0x00(%eax,%eax,1) + * 7: nopl 0x00000000(%eax) + * 8: nopl 0x00000000(%eax,%eax,1) + */ +#define BYTES_NOP1 0x90 +#define BYTES_NOP2 0x66,BYTES_NOP1 +#define BYTES_NOP3 0x0f,0x1f,0x00 +#define BYTES_NOP4 0x0f,0x1f,0x40,0x00 +#define BYTES_NOP5 0x0f,0x1f,0x44,0x00,0x00 +#define BYTES_NOP6 0x66,BYTES_NOP5 +#define BYTES_NOP7 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00 +#define BYTES_NOP8 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00 + +#endif /* CONFIG_64BIT */ + +#ifdef __ASSEMBLY__ +#define _ASM_MK_NOP(x) .byte x +#else +#define _ASM_MK_NOP(x) ".byte " __stringify(x) "\n" +#endif + +#define ASM_NOP1 _ASM_MK_NOP(BYTES_NOP1) +#define ASM_NOP2 _ASM_MK_NOP(BYTES_NOP2) +#define ASM_NOP3 _ASM_MK_NOP(BYTES_NOP3) +#define ASM_NOP4 _ASM_MK_NOP(BYTES_NOP4) +#define ASM_NOP5 _ASM_MK_NOP(BYTES_NOP5) +#define ASM_NOP6 _ASM_MK_NOP(BYTES_NOP6) +#define ASM_NOP7 _ASM_MK_NOP(BYTES_NOP7) +#define ASM_NOP8 _ASM_MK_NOP(BYTES_NOP8) + +#define ASM_NOP_MAX 8 + +#ifndef __ASSEMBLY__ +extern const unsigned char * const x86_nops[]; +#endif + +#endif /* _ASM_X86_NOPS_H */ --- a/tools/objtool/arch/x86/decode.c +++ b/tools/objtool/arch/x86/decode.c @@ -11,6 +11,9 @@ #include "../../../arch/x86/lib/inat.c" #include "../../../arch/x86/lib/insn.c" +#define CONFIG_64BIT 1 +#include <asm/nops.h> + #include <asm/orc_types.h> #include <objtool/check.h> #include <objtool/elf.h> @@ -640,11 +643,11 @@ void arch_initial_func_cfi_state(struct const char *arch_nop_insn(int len) { static const char nops[5][5] = { - /* 1 */ { 0x90 }, - /* 2 */ { 0x66, 0x90 }, - /* 3 */ { 0x0f, 0x1f, 0x00 }, - /* 4 */ { 0x0f, 0x1f, 0x40, 0x00 }, - /* 5 */ { 0x0f, 0x1f, 0x44, 0x00, 0x00 }, + { BYTES_NOP1 }, + { BYTES_NOP2 }, + { BYTES_NOP3 }, + { BYTES_NOP4 }, + { BYTES_NOP5 }, }; if (len < 1 || len > 5) { --- a/tools/objtool/sync-check.sh +++ b/tools/objtool/sync-check.sh @@ -10,6 +10,7 @@ FILES="include/linux/objtool.h" if [ "$SRCARCH" = "x86" ]; then FILES="$FILES +arch/x86/include/asm/nops.h arch/x86/include/asm/inat_types.h arch/x86/include/asm/orc_types.h arch/x86/include/asm/emulate_prefix.h ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 18:13 ` Sedat Dilek @ 2021-03-12 19:03 ` Sedat Dilek 0 siblings, 0 replies; 51+ messages in thread From: Sedat Dilek @ 2021-03-12 19:03 UTC (permalink / raw) To: Steven Rostedt Cc: Borislav Petkov, Peter Zijlstra, x86, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, Mar 12, 2021 at 7:13 PM Sedat Dilek <sedat.dilek@gmail.com> wrote: > > On Fri, Mar 12, 2021 at 6:47 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Fri, 12 Mar 2021 18:35:45 +0100 > > Sedat Dilek <sedat.dilek@gmail.com> wrote: > > > > > > > Hey Steve, you degraded me to a number :-). > > > > It's the internet, everyone is a number. > > > > > > > > I dunno which Git tree this patchset applies to, but I check if I can > > > apply the patchset to my current local Git. > > > > Try Linus's latest. > > > > $ git describe origin/HEAD > v5.12-rc2-338-gf78d76e72a46 > > I adapted 1/2 in arch/x86/include/asm/jump_label.h to fit ^^^, see attachment. > Forget this. With latest Linus Git you need to apply "x86/jump_label: Mark arguments as const to satisfy asm constraints" from tip Git. - Sedat - [1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=864b435514b286c0be2a38a02f487aa28d990ef8 > > > > Then build a kernel in the same build-environment. > > > Lemme see. > > > > > > To say with Linus's words: > > > "Numbers talk - bullshit walks." > > > > Exactly. > > > > -- Steve ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-12 11:32 [PATCH 0/2] x86: Remove ideal_nops[] Peter Zijlstra ` (2 preceding siblings ...) 2021-03-12 14:29 ` [PATCH 0/2] x86: Remove ideal_nops[] Sedat Dilek @ 2021-03-12 20:59 ` Borislav Petkov [not found] ` <CA+icZUWSCS6vAQOXoG6nsW+Dbnogivzf+rmegCTMjz5hjE5cKQ@mail.gmail.com> 3 siblings, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2021-03-12 20:59 UTC (permalink / raw) To: Peter Zijlstra Cc: x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Fri, Mar 12, 2021 at 12:32:53PM +0100, Peter Zijlstra wrote: > Since ultimate performance of a 10 year old chip (Intel Sandy Bridge, 2011) is > simply irrelevant today, remove variable NOPs and use NOPL. Just ran them on my SNB box: cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz stepping : 7 with the usual perf stat kernel build workload with CONFIG_DYNAMIC_FTRACE and CONFIG_FUNCTION_TRACER where each function has a NOP at its beginning when ftrace is disabled (thx Steve). ./tools/perf/perf stat --repeat 5 --sync --pre=/root/bin/pre-build-kernel.sh -- make -s -j9 bzImage before: tip-master Performance counter stats for 'make -s -j9 bzImage' (5 runs): 3,213,728.10 msec task-clock # 7.307 CPUs utilized ( +- 0.01% ) 339,270 context-switches # 0.106 K/sec ( +- 0.09% ) 31,472 cpu-migrations # 0.010 K/sec ( +- 0.64% ) 62,070,684 page-faults # 0.019 M/sec ( +- 0.01% ) 11,498,198,009,323 cycles # 3.578 GHz ( +- 0.01% ) (83.33%) 8,235,957,366,696 stalled-cycles-frontend # 71.63% frontend cycles idle ( +- 0.01% ) (83.33%) 5,976,456,688,814 stalled-cycles-backend # 51.98% backend cycles idle ( +- 0.02% ) (66.67%) 7,553,156,344,376 instructions # 0.66 insn per cycle # 1.09 stalled cycles per insn ( +- 0.00% ) (83.33%) 1,635,468,917,524 branches # 508.901 M/sec ( +- 0.00% ) (83.34%) 51,888,292,932 branch-misses # 3.17% of all branches ( +- 0.02% ) (83.33%) 439.809 +- 0.156 seconds time elapsed ( +- 0.04% ) after: tip-master-nops Performance counter stats for 'make -s -j9 bzImage' (5 runs): 3,217,113.67 msec task-clock # 7.307 CPUs utilized ( +- 0.03% ) 339,425 context-switches # 0.106 K/sec ( +- 0.20% ) 31,724 cpu-migrations # 0.010 K/sec ( +- 0.54% ) 62,027,130 page-faults # 0.019 M/sec ( +- 0.01% ) 11,508,779,965,901 cycles # 3.577 GHz ( +- 0.03% ) (83.34%) 8,241,212,210,440 stalled-cycles-frontend # 71.61% frontend cycles idle ( +- 0.04% ) (83.33%) 5,982,615,533,177 stalled-cycles-backend # 51.98% backend cycles idle ( +- 0.06% ) (66.66%) 7,546,407,430,314 instructions # 0.66 insn per cycle # 1.09 stalled cycles per insn ( +- 0.00% ) (83.33%) 1,634,187,006,479 branches # 507.967 M/sec ( +- 0.00% ) (83.33%) 51,941,580,371 branch-misses # 3.18% of all branches ( +- 0.01% ) (83.33%) 440.266 +- 0.195 seconds time elapsed ( +- 0.04% ) So here's numbers talk, bullshit walks. And with those numbers no bullshit can remain lingering around anyway. Cheers! -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <CA+icZUWSCS6vAQOXoG6nsW+Dbnogivzf+rmegCTMjz5hjE5cKQ@mail.gmail.com>]
* Re: [PATCH 0/2] x86: Remove ideal_nops[] [not found] ` <CA+icZUWSCS6vAQOXoG6nsW+Dbnogivzf+rmegCTMjz5hjE5cKQ@mail.gmail.com> @ 2021-03-13 8:49 ` Borislav Petkov 2021-03-13 11:23 ` Borislav Petkov 2021-03-13 12:10 ` Sedat Dilek 0 siblings, 2 replies; 51+ messages in thread From: Borislav Petkov @ 2021-03-13 8:49 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 06:26:15AM +0100, Sedat Dilek wrote: > x86/jump_label: Mark arguments as const to satisfy asm constraints Where do I find this patch? > x86: Remove dynamic NOP selection > objtool,x86: Use asm/nops.h > > My benchmark was to build a Linux-kernel with LLVM/Clang v12.0.0-rc3 > on Debian/testing AMD64. > > Patchset applied for a first build: > > Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 > PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-7-amd64-clang12-cfi > KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza There's a reason I have -s for silent in the build - printing output during the build creates a *lot* of variance. And you have excessive printing with V=1 and KBUILD_VERBOSE=1. Also, you need to repeat those workloads a couple of times - one is not enough. That's why I have --repeat 5 in there. Also, you need --pre=/root/bin/pre-build-kernel.sh where that script is: --- #!/bin/bash echo $0 make -s clean echo 3 > /proc/sys/vm/drop_caches --- so that you can avoid pagecache influence. Lemme rerun here with clang. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 8:49 ` Borislav Petkov @ 2021-03-13 11:23 ` Borislav Petkov 2021-03-13 12:10 ` Sedat Dilek 1 sibling, 0 replies; 51+ messages in thread From: Borislav Petkov @ 2021-03-13 11:23 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 09:49:23AM +0100, Borislav Petkov wrote: > Lemme rerun here with clang. clang11 is almost twice as slow as gcc but difference is still negligible: ~0.6 seconds. ./tools/perf/perf stat --repeat 5 --sync --pre=/root/bin/pre-build-kernel.sh -- make -s -j9 LLVM=1 LLVM_IAS=1 bzImage before: Performance counter stats for 'make -s -j9 LLVM=1 LLVM_IAS=1 bzImage' (5 runs): 5,576,081.48 msec task-clock # 7.664 CPUs utilized ( +- 0.03% ) 496,841 context-switches # 0.089 K/sec ( +- 0.11% ) 30,245 cpu-migrations # 0.005 K/sec ( +- 0.53% ) 49,702,714 page-faults # 0.009 M/sec ( +- 0.00% ) 19,954,704,926,347 cycles # 3.579 GHz ( +- 0.02% ) (83.33%) 15,920,125,996,460 stalled-cycles-frontend # 79.78% frontend cycles idle ( +- 0.03% ) (83.33%) 13,177,812,137,935 stalled-cycles-backend # 66.04% backend cycles idle ( +- 0.04% ) (66.67%) 8,778,060,061,848 instructions # 0.44 insn per cycle # 1.81 stalled cycles per insn ( +- 0.00% ) (83.33%) 1,852,121,066,032 branches # 332.155 M/sec ( +- 0.00% ) (83.33%) 84,048,262,434 branch-misses # 4.54% of all branches ( +- 0.02% ) (83.33%) 727.572 +- 0.305 seconds time elapsed ( +- 0.04% ) after: Performance counter stats for 'make -s -j9 LLVM=1 LLVM_IAS=1 bzImage' (5 runs): 5,581,654.38 msec task-clock # 7.665 CPUs utilized ( +- 0.01% ) 496,274 context-switches # 0.089 K/sec ( +- 0.12% ) 30,645 cpu-migrations # 0.005 K/sec ( +- 0.54% ) 49,711,551 page-faults # 0.009 M/sec ( +- 0.01% ) 19,968,933,753,686 cycles # 3.578 GHz ( +- 0.01% ) (83.33%) 15,925,776,797,854 stalled-cycles-frontend # 79.75% frontend cycles idle ( +- 0.01% ) (83.33%) 13,182,158,323,446 stalled-cycles-backend # 66.01% backend cycles idle ( +- 0.01% ) (66.67%) 8,778,619,885,119 instructions # 0.44 insn per cycle # 1.81 stalled cycles per insn ( +- 0.00% ) (83.33%) 1,852,096,100,464 branches # 331.818 M/sec ( +- 0.01% ) (83.33%) 84,264,257,355 branch-misses # 4.55% of all branches ( +- 0.03% ) (83.33%) 728.2400 +- 0.0613 seconds time elapsed ( +- 0.01% ) -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 8:49 ` Borislav Petkov 2021-03-13 11:23 ` Borislav Petkov @ 2021-03-13 12:10 ` Sedat Dilek 2021-03-13 12:15 ` Borislav Petkov 1 sibling, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-13 12:10 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 9:51 AM Borislav Petkov <bp@alien8.de> wrote: > > On Sat, Mar 13, 2021 at 06:26:15AM +0100, Sedat Dilek wrote: > > x86/jump_label: Mark arguments as const to satisfy asm constraints > > Where do I find this patch? > Here we go: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=864b435514b286c0be2a38a02f487aa28d990ef8 > > x86: Remove dynamic NOP selection > > objtool,x86: Use asm/nops.h > > > > My benchmark was to build a Linux-kernel with LLVM/Clang v12.0.0-rc3 > > on Debian/testing AMD64. > > > > Patchset applied for a first build: > > > > Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 > > PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-7-amd64-clang12-cfi > > KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza > > There's a reason I have -s for silent in the build - printing output > during the build creates a *lot* of variance. And you have excessive > printing with V=1 and KBUILD_VERBOSE=1. > > Also, you need to repeat those workloads a couple of times - one is not > enough. That's why I have --repeat 5 in there. > > Also, you need --pre=/root/bin/pre-build-kernel.sh where that script is: > > --- > #!/bin/bash > echo $0 > > make -s clean > echo 3 > /proc/sys/vm/drop_caches > --- > > so that you can avoid pagecache influence. > OK, I see. - Sedat - > Lemme rerun here with clang. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 12:10 ` Sedat Dilek @ 2021-03-13 12:15 ` Borislav Petkov 2021-03-13 12:38 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2021-03-13 12:15 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 01:10:29PM +0100, Sedat Dilek wrote: > Here we go: > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=864b435514b286c0be2a38a02f487aa28d990ef8 That's why I told earlier you to use tip/master - that patch is already in it and all you would've needed to do is to apply the two nop patches. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 12:15 ` Borislav Petkov @ 2021-03-13 12:38 ` Sedat Dilek 2021-03-13 12:49 ` Borislav Petkov 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-13 12:38 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 1:15 PM Borislav Petkov <bp@alien8.de> wrote: > > On Sat, Mar 13, 2021 at 01:10:29PM +0100, Sedat Dilek wrote: > > Here we go: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/patch/?id=864b435514b286c0be2a38a02f487aa28d990ef8 > > That's why I told earlier you to use tip/master - that patch is already > in it and all you would've needed to do is to apply the two nop patches. > Thanks for all your testings and suggestions. For me it was easier to apply these 3 patches on top of my custom patchset to see what impact Peter's patchset. AFAICS you did a 5 times x86-64 defconfig with dropped pagecache and `make -j9`? I run my "normal" workflow(s) (and build-script) for easier comparison on my side. Big thank-you for testing with LLVM/Clang v11.x - twice as slow as with GCC :-(. A selfmade ThinLTO+PGO optimized LLVM tooolchain v11.x/v12-rcX/v13-git is here as fast as Debian's GCC-v10.2.1 to build a Linux-kernel - approx. 03:30 [hh:mm] - full adapted Debian v5.10.y kernel-config. Does your distribution offer LLVM/Clang v12.0.0-rc3 (released this week) binaries? - Sedat - ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 12:38 ` Sedat Dilek @ 2021-03-13 12:49 ` Borislav Petkov 2021-03-13 12:58 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2021-03-13 12:49 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 01:38:22PM +0100, Sedat Dilek wrote: > AFAICS you did a 5 times x86-64 defconfig with dropped pagecache and `make -j9`? The tailored .config for that particular test box. > Does your distribution offer LLVM/Clang v12.0.0-rc3 (released this > week) binaries? The partition on that box I used is debian testing, so no: $ apt search llvm-1* 2>/dev/null | grep llvm-1 libllvm-11-ocaml-dev/testing,testing 1:11.0.1-2 amd64 llvm-10/now 1:10.0.1-8+b1 amd64 [installed,local] llvm-10-dev/now 1:10.0.1-8+b1 amd64 [installed,local] llvm-10-runtime/now 1:10.0.1-8+b1 amd64 [installed,local] llvm-10-tools/now 1:10.0.1-8+b1 amd64 [installed,local] llvm-11/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] llvm-11-dev/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] llvm-11-doc/testing,testing 1:11.0.1-2 all llvm-11-examples/testing,testing 1:11.0.1-2 all llvm-11-runtime/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] llvm-11-tools/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 12:49 ` Borislav Petkov @ 2021-03-13 12:58 ` Sedat Dilek 2021-03-13 13:29 ` Borislav Petkov 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-13 12:58 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 1:49 PM Borislav Petkov <bp@alien8.de> wrote: > > On Sat, Mar 13, 2021 at 01:38:22PM +0100, Sedat Dilek wrote: > > AFAICS you did a 5 times x86-64 defconfig with dropped pagecache and `make -j9`? > > The tailored .config for that particular test box. > > > Does your distribution offer LLVM/Clang v12.0.0-rc3 (released this > > week) binaries? > > The partition on that box I used is debian testing, so no: > > $ apt search llvm-1* 2>/dev/null | grep llvm-1 > libllvm-11-ocaml-dev/testing,testing 1:11.0.1-2 amd64 > llvm-10/now 1:10.0.1-8+b1 amd64 [installed,local] > llvm-10-dev/now 1:10.0.1-8+b1 amd64 [installed,local] > llvm-10-runtime/now 1:10.0.1-8+b1 amd64 [installed,local] > llvm-10-tools/now 1:10.0.1-8+b1 amd64 [installed,local] > llvm-11/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] > llvm-11-dev/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] > llvm-11-doc/testing,testing 1:11.0.1-2 all > llvm-11-examples/testing,testing 1:11.0.1-2 all > llvm-11-runtime/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] > llvm-11-tools/testing,testing,now 1:11.0.1-2 amd64 [installed,automatic] > You can add Debian/experimental APT sources.list ... [ /etc/apt/sources.list.d/debian-experimental.list ] deb http://ftp.debian.org/debian experimental main contrib non-free deb https://deb.debian.org/debian experimental main non-free contrib [ /etc/apt/preferences.d/99_debian-experimental.pref ] Package: * Pin: release o=Debian,a=experimental Pin-Priority: 99 This gives LLVM/Clang v12 packages an APT prio of 99 - meaning no auto-upgrade installations will be done. You have full control by doing it manually. Renew informations from APT repositories: root# apt-get update What clang-12 version is/are available? root# apt-cache policy clang-12 Simulate an install (note: --no-install-recommends option): root# apt-get install llvm-12 clang-12 lld-12 llvm-12-tools --no-install-recommends -t experimental -s option -s: simulate Really do an installation of LLVM/Clang v12 stuff: root# apt-get install llvm-12 clang-12 lld-12 llvm-12-tools --no-install-recommends -t experimental -y option -y: yes If you like to test. Of course you can use packages from <apt-llvm.org> repositories. I can give you APT sources.list plus pref files if you desire. Have more fun. - Sedat - ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 12:58 ` Sedat Dilek @ 2021-03-13 13:29 ` Borislav Petkov 2021-03-13 13:47 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2021-03-13 13:29 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 01:58:56PM +0100, Sedat Dilek wrote: > You can add Debian/experimental APT sources.list ... I could but I don't expect clang12 to behave any differently here. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 13:29 ` Borislav Petkov @ 2021-03-13 13:47 ` Sedat Dilek 2021-03-15 17:04 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-13 13:47 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Sat, Mar 13, 2021 at 2:29 PM Borislav Petkov <bp@alien8.de> wrote: > > On Sat, Mar 13, 2021 at 01:58:56PM +0100, Sedat Dilek wrote: > > You can add Debian/experimental APT sources.list ... > > I could but I don't expect clang12 to behave any differently here. > Agreed in things of build-time. There were some improvements and optimizations to LLVM/Clang but twice as slow is really hard compared with GCC. I was thinking more in the direction of "compatibility" of tip tree with recent LLVM/Clang other than what is officially supported via Kbuild-system. Let me look if I will do a selfmade ThinLTO+PGO optimized LLVM toolchain v12.0.0-rc3 this weekend. - Sedat - ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-13 13:47 ` Sedat Dilek @ 2021-03-15 17:04 ` Sedat Dilek 2021-03-15 17:15 ` Borislav Petkov 2021-03-15 18:10 ` Peter Zijlstra 0 siblings, 2 replies; 51+ messages in thread From: Sedat Dilek @ 2021-03-15 17:04 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat [-- Attachment #1: Type: text/plain, Size: 4618 bytes --] On Sat, Mar 13, 2021 at 2:47 PM Sedat Dilek <sedat.dilek@gmail.com> wrote: [ ... ] > Let me look if I will do a selfmade ThinLTO+PGO optimized LLVM > toolchain v12.0.0-rc3 this weekend. > I did it. Here some fresh numbers: [ Selfmade LLVM toolchain v12.0.0-rc3 "stage1-only" ] [ Host-Kernel: 5.12.0-rc2-8-amd64-clang12-cfi includes Peter's NOPS patchset ] Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-9-amd64-clang12-cfi KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza KBUILD_BUILD_USER=sedat.dilek@gmail.com KBUILD_BUILD_TIMESTAMP=2021-03-13 bindeb-pkg KDEB_PKGVERSION=5.12.0~rc2-9~bullseye+dileks1': 55936351.95 msec task-clock # 3.580 CPUs utilized 8291848 context-switches # 0.148 K/sec 269686 cpu-migrations # 0.005 K/sec 288389721 page-faults # 0.005 M/sec 108344049253836 cycles # 1.937 GHz 83228135285263 stalled-cycles-frontend # 76.82% frontend cycles idle 65616255370809 stalled-cycles-backend # 60.56% backend cycles idle 59590373937199 instructions # 0.55 insn per cycle # 1.40 stalled cycles per insn 10906265495505 branches # 194.976 M/sec 488578274434 branch-misses # 4.48% of all branches 15622.926203302 seconds time elapsed 53453.974928000 seconds user 2526.773533000 seconds sys [ Selfmade LLVM toolchain v12.0.0-rc3 "thinlto_pgo_optimized" ] [ Host-Kernel: Debian's 5.10.19-1 kernel ] Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-10-amd64-clang12-cfi KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza KBUILD_BUILD_USER=sedat.dilek@gmail.com KBUILD_BUILD_TIMESTAMP=2021-03-14 bindeb-pkg KDEB_PKGVERSION=5.12.0~rc2-10~bullseye+dileks1': 40223080.69 msec task-clock # 3.434 CPUs utilized 7438923 context-switches # 0.185 K/sec 245636 cpu-migrations # 0.006 K/sec 288073015 page-faults # 0.007 M/sec 77325441657129 cycles # 1.922 GHz 55357463522675 stalled-cycles-frontend # 71.59% frontend cycles idle 38978871249074 stalled-cycles-backend # 50.41% backend cycles idle 55178265045056 instructions # 0.71 insn per cycle # 1.00 stalled cycles per insn 9749166033571 branches # 242.377 M/sec 431303563167 branch-misses # 4.42% of all branches 11714.751645982 seconds time elapsed 37951.117840000 seconds user 2313.807151000 seconds sys [ Selfmade LLVM toolchain v12.0.0-rc3 "thinlto_pgo_optimized" ] [ Host-Kernel: 5.12.0-rc2-10-amd64-clang12-cfi includes Peter's NOPS patchset ] Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-1-amd64-clang12-cfi KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza KBUILD_BUILD_USER=sedat.dilek@gmail.com KBUILD_BUILD_TIMESTAMP=2021-03-15 bindeb-pkg KDEB_PKGVERSION=5.12.0~rc3-1~bullseye+dileks1': 40632207.25 msec task-clock # 3.406 CPUs utilized 8216832 context-switches # 0.202 K/sec 277610 cpu-migrations # 0.007 K/sec 281331052 page-faults # 0.007 M/sec 77031538570411 cycles # 1.896 GHz (83.33%) 55247905369487 stalled-cycles-frontend # 71.72% frontend cycles idle (83.33%) 39046795510242 stalled-cycles-backend # 50.69% backend cycles idle (66.67%) 54592585444704 instructions # 0.71 insn per cycle # 1.01 stalled cycles per insn (83.33%) 9641589406714 branches # 237.289 M/sec (83.33%) 435317273069 branch-misses # 4.51% of all branches (83.33%) 11928.047003788 seconds time elapsed 38187.685111000 seconds user 2502.075987000 seconds sys As said in an earlier email: A ThinLTO+PGO optimized LLVM-toolchain saves here approx. 60mins of build-time. Depending on the host-kernel including Peter's NOPS patchset: 3mins longer build-time. Brewing time of one single Turkish Tea bag. Attached are the 3 build-time log-files. - Sedat - [-- Attachment #2: build-time_5.12.0-rc2-9-amd64-clang12-cfi.txt --] [-- Type: text/plain, Size: 1344 bytes --] Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-9-amd64-clang12-cfi KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza KBUILD_BUILD_USER=sedat.dilek@gmail.com KBUILD_BUILD_TIMESTAMP=2021-03-13 bindeb-pkg KDEB_PKGVERSION=5.12.0~rc2-9~bullseye+dileks1': 55936351.95 msec task-clock # 3.580 CPUs utilized 8291848 context-switches # 0.148 K/sec 269686 cpu-migrations # 0.005 K/sec 288389721 page-faults # 0.005 M/sec 108344049253836 cycles # 1.937 GHz 83228135285263 stalled-cycles-frontend # 76.82% frontend cycles idle 65616255370809 stalled-cycles-backend # 60.56% backend cycles idle 59590373937199 instructions # 0.55 insn per cycle # 1.40 stalled cycles per insn 10906265495505 branches # 194.976 M/sec 488578274434 branch-misses # 4.48% of all branches 15622.926203302 seconds time elapsed 53453.974928000 seconds user 2526.773533000 seconds sys [-- Attachment #3: build-time_5.12.0-rc2-10-amd64-clang12-cfi.txt --] [-- Type: text/plain, Size: 1346 bytes --] Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-10-amd64-clang12-cfi KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza KBUILD_BUILD_USER=sedat.dilek@gmail.com KBUILD_BUILD_TIMESTAMP=2021-03-14 bindeb-pkg KDEB_PKGVERSION=5.12.0~rc2-10~bullseye+dileks1': 40223080.69 msec task-clock # 3.434 CPUs utilized 7438923 context-switches # 0.185 K/sec 245636 cpu-migrations # 0.006 K/sec 288073015 page-faults # 0.007 M/sec 77325441657129 cycles # 1.922 GHz 55357463522675 stalled-cycles-frontend # 71.59% frontend cycles idle 38978871249074 stalled-cycles-backend # 50.41% backend cycles idle 55178265045056 instructions # 0.71 insn per cycle # 1.00 stalled cycles per insn 9749166033571 branches # 242.377 M/sec 431303563167 branch-misses # 4.42% of all branches 11714.751645982 seconds time elapsed 37951.117840000 seconds user 2313.807151000 seconds sys [-- Attachment #4: build-time_5.12.0-rc3-1-amd64-clang12-cfi.txt --] [-- Type: text/plain, Size: 1404 bytes --] Performance counter stats for 'make V=1 -j4 LLVM=1 LLVM_IAS=1 PAHOLE=/opt/pahole/bin/pahole LOCALVERSION=-1-amd64-clang12-cfi KBUILD_VERBOSE=1 KBUILD_BUILD_HOST=iniza KBUILD_BUILD_USER=sedat.dilek@gmail.com KBUILD_BUILD_TIMESTAMP=2021-03-15 bindeb-pkg KDEB_PKGVERSION=5.12.0~rc3-1~bullseye+dileks1': 40632207.25 msec task-clock # 3.406 CPUs utilized 8216832 context-switches # 0.202 K/sec 277610 cpu-migrations # 0.007 K/sec 281331052 page-faults # 0.007 M/sec 77031538570411 cycles # 1.896 GHz (83.33%) 55247905369487 stalled-cycles-frontend # 71.72% frontend cycles idle (83.33%) 39046795510242 stalled-cycles-backend # 50.69% backend cycles idle (66.67%) 54592585444704 instructions # 0.71 insn per cycle # 1.01 stalled cycles per insn (83.33%) 9641589406714 branches # 237.289 M/sec (83.33%) 435317273069 branch-misses # 4.51% of all branches (83.33%) 11928.047003788 seconds time elapsed 38187.685111000 seconds user 2502.075987000 seconds sys ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-15 17:04 ` Sedat Dilek @ 2021-03-15 17:15 ` Borislav Petkov 2021-03-15 17:19 ` Sedat Dilek 2021-03-15 18:10 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2021-03-15 17:15 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Mon, Mar 15, 2021 at 06:04:41PM +0100, Sedat Dilek wrote: > Here some fresh numbers: Lemme paste my previous reply which still holds true here: "There's a reason I have -s for silent in the build - printing output during the build creates a *lot* of variance. And you have excessive printing with V=1 and KBUILD_VERBOSE=1. Also, you need to repeat those workloads a couple of times - one is not enough. That's why I have --repeat 5 in there. Also, you need --pre=/root/bin/pre-build-kernel.sh where that script is: --- #!/bin/bash echo $0 make -s clean echo 3 > /proc/sys/vm/drop_caches --- so that you can avoid pagecache influence." -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-15 17:15 ` Borislav Petkov @ 2021-03-15 17:19 ` Sedat Dilek 2021-03-15 17:23 ` Borislav Petkov 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-15 17:19 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Mon, Mar 15, 2021 at 6:15 PM Borislav Petkov <bp@alien8.de> wrote: > > On Mon, Mar 15, 2021 at 06:04:41PM +0100, Sedat Dilek wrote: > > Here some fresh numbers: > > Lemme paste my previous reply which still holds true here: > > "There's a reason I have -s for silent in the build - printing output > during the build creates a *lot* of variance. And you have excessive > printing with V=1 and KBUILD_VERBOSE=1. > I have this for diagnostic reasons. Yes, I can drop V=1 and KBUILD_VERBOSE=1. This is a good idea for a fast build. > Also, you need to repeat those workloads a couple of times - one is not > enough. That's why I have --repeat 5 in there. > > Also, you need --pre=/root/bin/pre-build-kernel.sh where that script is: > > --- > #!/bin/bash > echo $0 > > make -s clean > echo 3 > /proc/sys/vm/drop_caches > --- > > so that you can avoid pagecache influence." > With my next build I try to apply this. - Sedat - ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-15 17:19 ` Sedat Dilek @ 2021-03-15 17:23 ` Borislav Petkov 0 siblings, 0 replies; 51+ messages in thread From: Borislav Petkov @ 2021-03-15 17:23 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Mon, Mar 15, 2021 at 06:19:34PM +0100, Sedat Dilek wrote: > With my next build I try to apply this. Your perf tool command should look something like this: perf stat --repeat 5 --sync --pre=/root/bin/pre-build-kernel.sh -- make -s -j9 LLVM=1 LLVM_IAS=1 bzImage Also, needless to say, your box needs to not run anything else during the measurement. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-15 17:04 ` Sedat Dilek 2021-03-15 17:15 ` Borislav Petkov @ 2021-03-15 18:10 ` Peter Zijlstra 2021-03-15 18:23 ` Sedat Dilek 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2021-03-15 18:10 UTC (permalink / raw) To: Sedat Dilek Cc: Borislav Petkov, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Mon, Mar 15, 2021 at 06:04:41PM +0100, Sedat Dilek wrote: > make V=1 -j4 LLVM=1 LLVM_IAS=1 So for giggles I checked, neither GCC nor LLVM seem to emit prefix NOPs when building with -march=sandybridge, they always use MOPL. Furthermore, the kernel explicitly sets: -falign-jumps=1 -falign-loops=1, which, when not specified, default to 16 or so. This means that your userspace is *littered* with NOPL, even when you build your entire distro from source with -march=sandybridge. (arch/gentoo FTW I suppose). (The only good new is that recent LLVM has a pass to use alternative instruction encoding in order to grow a basic block in size in order to minimize the amount of NOP it needs to emit at the end in order to satisfy the jump/loop alignment.) So if you *really* deeply care about NOP performance on your SNB, start by teaching LLVM about prefix NOPs and rebuild your complete userspace. At that point, you can do some trivial patches to the kernel to make it use -march=sandybridge and prefix NOPs too. Until that time, the vast majority of NOPs your CPU will execute will be NOPL. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-15 18:10 ` Peter Zijlstra @ 2021-03-15 18:23 ` Sedat Dilek 2021-03-15 22:14 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-15 18:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Mon, Mar 15, 2021 at 7:10 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Mar 15, 2021 at 06:04:41PM +0100, Sedat Dilek wrote: > > > make V=1 -j4 LLVM=1 LLVM_IAS=1 > > So for giggles I checked, neither GCC nor LLVM seem to emit prefix NOPs > when building with -march=sandybridge, they always use MOPL. > > Furthermore, the kernel explicitly sets: -falign-jumps=1 > -falign-loops=1, which, when not specified, default to 16 or so. > > This means that your userspace is *littered* with NOPL, even when you > build your entire distro from source with -march=sandybridge. > (arch/gentoo FTW I suppose). > That reminds me of the Git repo of the wireguard maintainer. "x86: enable additional cpu optimizations for gcc v9.1+" You mean something like that ^^? - Sedat - [1] https://git.zx2c4.com/laptop-kernel/commit/?id=116badbe0a18bc36ba90acb8b80cff41f9ab0686 > (The only good new is that recent LLVM has a pass to use alternative > instruction encoding in order to grow a basic block in size in order to > minimize the amount of NOP it needs to emit at the end in order to > satisfy the jump/loop alignment.) > > So if you *really* deeply care about NOP performance on your SNB, start > by teaching LLVM about prefix NOPs and rebuild your complete userspace. > At that point, you can do some trivial patches to the kernel to make it > use -march=sandybridge and prefix NOPs too. > > Until that time, the vast majority of NOPs your CPU will execute will be > NOPL. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-15 18:23 ` Sedat Dilek @ 2021-03-15 22:14 ` Peter Zijlstra 2021-03-16 5:56 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2021-03-15 22:14 UTC (permalink / raw) To: Sedat Dilek Cc: Borislav Petkov, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Mon, Mar 15, 2021 at 07:23:29PM +0100, Sedat Dilek wrote: > You mean something like that ^^? > > - Sedat - > > [1] https://git.zx2c4.com/laptop-kernel/commit/?id=116badbe0a18bc36ba90acb8b80cff41f9ab0686 *shudder*, I was more thinking you'd simply add it to you CFLAGS when building. I don't see any point in having that in Kconfig. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-15 22:14 ` Peter Zijlstra @ 2021-03-16 5:56 ` Sedat Dilek 2021-03-27 12:08 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-16 5:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat On Mon, Mar 15, 2021 at 11:14 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Mar 15, 2021 at 07:23:29PM +0100, Sedat Dilek wrote: > > > You mean something like that ^^? > > > > - Sedat - > > > > [1] https://git.zx2c4.com/laptop-kernel/commit/?id=116badbe0a18bc36ba90acb8b80cff41f9ab0686 > > *shudder*, I was more thinking you'd simply add it to you CFLAGS when > building. I don't see any point in having that in Kconfig. Simply adding the CFLAGS to arch/x86/Makefile. If I forgot to mention: Tested-by: Sedat Dilek <sedat.dilek@gmail.com>. # LLVM/Clang v12.0.0-rc3 - Sedat - ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-16 5:56 ` Sedat Dilek @ 2021-03-27 12:08 ` Sedat Dilek 2021-03-27 20:02 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Sedat Dilek @ 2021-03-27 12:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, x86, rostedt, hpa, torvalds, linux-kernel, linux-toolchains, jpoimboe, alexei.starovoitov, mhiramat Out of curiosity I tried in my build-environment and my testing-rules to have comparable numbers... ..without passing "V=1" and "KBUILD_VERBOSE=1" as make-options: NOTE: Identical linux-config plus LLVM/Clang v12.0.0-rc3. debian-5.10.19 as host-kernel: 11655.755564957 seconds time elapsed dileks-5.12-rc3 plus x86-nops as host-kernel: 11941.439350080 seconds time elapsed I compared the build-times only: Approx. 04:45 [mm:ss] in the worst case. ( Brewing time of a strong Turkish tea-bag ~5mins. ) I will keep both make-options to see what's going on in my builds. - Sedat - ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-27 12:08 ` Sedat Dilek @ 2021-03-27 20:02 ` Linus Torvalds 2021-03-30 12:31 ` Sedat Dilek 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2021-03-27 20:02 UTC (permalink / raw) To: Sedat Dilek Cc: Peter Zijlstra, Borislav Petkov, the arch/x86 maintainers, Steven Rostedt, Peter Anvin, Linux Kernel Mailing List, linux-toolchains, Josh Poimboeuf, Alexei Starovoitov, Masami Hiramatsu On Sat, Mar 27, 2021 at 5:08 AM Sedat Dilek <sedat.dilek@gmail.com> wrote: > > debian-5.10.19 as host-kernel: > 11655.755564957 seconds time elapsed > > dileks-5.12-rc3 plus x86-nops as host-kernel: > 11941.439350080 seconds time elapsed That's 2.5% - a huge difference. Particularly since kernel build times shouldn't even be that kernel-intensive. I think there's something else going on than the nops. Same config? There are likely many other differences between 5.10.19 and 5.12-rc3. So can you check just plain 5.12-rc3 and then 5.12-rc3 plus x86-nops, with otherwise identical configuration? Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/2] x86: Remove ideal_nops[] 2021-03-27 20:02 ` Linus Torvalds @ 2021-03-30 12:31 ` Sedat Dilek 0 siblings, 0 replies; 51+ messages in thread From: Sedat Dilek @ 2021-03-30 12:31 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Borislav Petkov, the arch/x86 maintainers, Steven Rostedt, Peter Anvin, Linux Kernel Mailing List, linux-toolchains, Josh Poimboeuf, Alexei Starovoitov, Masami Hiramatsu On Sat, Mar 27, 2021 at 9:02 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > On Sat, Mar 27, 2021 at 5:08 AM Sedat Dilek <sedat.dilek@gmail.com> wrote: > > > > debian-5.10.19 as host-kernel: > > 11655.755564957 seconds time elapsed > > > > dileks-5.12-rc3 plus x86-nops as host-kernel: > > 11941.439350080 seconds time elapsed > > That's 2.5% - a huge difference. Particularly since kernel build times > shouldn't even be that kernel-intensive. > > I think there's something else going on than the nops. Same config? > There are likely many other differences between 5.10.19 and 5.12-rc3. > > So can you check just plain 5.12-rc3 and then 5.12-rc3 plus x86-nops, > with otherwise identical configuration? > Hi Linus, I re-checked my linux-config and custom patchset. I had "kbuild: add CONFIG_VMLINUX_MAP expert option" in my queue and build with CONFIG_VMLINUX_MAP=y. This option generated here an approx. 30MiB big vmlinux.map file. Cannot say how long this is taking in seconds but that can explain the the time-diff. [ The above option is helpful to analyze a recent Linux-kernel build with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. Always, I was able to build but not boot on bare metal with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. With a LLVM toolchain, of course. ] ( In the meantime Debian has a 5.20.26 kernel released - so if you want I can re-test with Linux v5.12-rc5. ) Regards, - Sedat - [1] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/commit/?h=kbuild&id=babd8cd96d333cb83c9b8abf4f01ab1f161d6ec4 ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2024-01-22 2:32 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-12 11:32 [PATCH 0/2] x86: Remove ideal_nops[] Peter Zijlstra 2021-03-12 11:32 ` [PATCH 1/2] x86: Remove dynamic NOP selection Peter Zijlstra 2021-03-12 12:09 ` Peter Zijlstra 2021-03-12 20:36 ` Linus Torvalds 2024-01-20 6:58 ` Thorsten Glaser 2024-01-20 8:22 ` H. Peter Anvin 2024-01-20 16:53 ` Thorsten Glaser 2024-01-21 23:21 ` H. Peter Anvin 2024-01-21 23:58 ` Thorsten Glaser 2024-01-22 0:15 ` H. Peter Anvin 2024-01-22 0:56 ` Steven Rostedt 2024-01-22 1:17 ` Thorsten Glaser 2024-01-22 2:04 ` H. Peter Anvin 2024-01-22 2:15 ` H. Peter Anvin 2024-01-22 2:22 ` Steven Rostedt 2024-01-22 2:31 ` H. Peter Anvin 2024-01-20 17:00 ` Linus Torvalds 2024-01-20 17:19 ` Thorsten Glaser 2024-01-20 18:21 ` disassemblers (was Re: [PATCH 1/2] x86: Remove dynamic NOP selection) Thorsten Glaser 2024-01-21 22:36 ` [PATCH 1/2] x86: Remove dynamic NOP selection David Laight 2024-01-21 23:10 ` H. Peter Anvin 2021-03-12 11:32 ` [PATCH 2/2] objtool,x86: Use asm/nops.h Peter Zijlstra 2021-03-12 14:29 ` [PATCH 0/2] x86: Remove ideal_nops[] Sedat Dilek 2021-03-12 14:47 ` Borislav Petkov 2021-03-12 17:26 ` Steven Rostedt 2021-03-12 17:35 ` Sedat Dilek 2021-03-12 17:46 ` Borislav Petkov 2021-03-12 17:47 ` Steven Rostedt 2021-03-12 18:13 ` Sedat Dilek 2021-03-12 19:03 ` Sedat Dilek 2021-03-12 20:59 ` Borislav Petkov [not found] ` <CA+icZUWSCS6vAQOXoG6nsW+Dbnogivzf+rmegCTMjz5hjE5cKQ@mail.gmail.com> 2021-03-13 8:49 ` Borislav Petkov 2021-03-13 11:23 ` Borislav Petkov 2021-03-13 12:10 ` Sedat Dilek 2021-03-13 12:15 ` Borislav Petkov 2021-03-13 12:38 ` Sedat Dilek 2021-03-13 12:49 ` Borislav Petkov 2021-03-13 12:58 ` Sedat Dilek 2021-03-13 13:29 ` Borislav Petkov 2021-03-13 13:47 ` Sedat Dilek 2021-03-15 17:04 ` Sedat Dilek 2021-03-15 17:15 ` Borislav Petkov 2021-03-15 17:19 ` Sedat Dilek 2021-03-15 17:23 ` Borislav Petkov 2021-03-15 18:10 ` Peter Zijlstra 2021-03-15 18:23 ` Sedat Dilek 2021-03-15 22:14 ` Peter Zijlstra 2021-03-16 5:56 ` Sedat Dilek 2021-03-27 12:08 ` Sedat Dilek 2021-03-27 20:02 ` Linus Torvalds 2021-03-30 12:31 ` Sedat Dilek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).