LKML Archive on lore.kernel.org
 help / color / Atom feed
* [RFC v2 0/6] x86: dynamic indirect branch promotion
@ 2018-12-31  7:21 Nadav Amit
  2018-12-31  7:21 ` [RFC v2 1/6] x86: introduce kernel restartable sequence Nadav Amit
                   ` (7 more replies)
  0 siblings, 8 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31  7:21 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree
  Cc: H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse, Nadav Amit

This is a revised version of optpolines (formerly named retpolines) for
dynamic indirect branch promotion in order to reduce retpoline overheads
[1].

This version address some of the concerns that were raised before.
Accordingly, the code was slightly simplified and patching is now done
using the regular int3/breakpoint mechanism.

Outline optpolines for multiple targets was added. I do not think the
way I implemented it is the correct one. In my original (private)
version, if there are more targets than the outline block can hold, the
outline block is completely removed. However, I think this is
more-or-less how Josh wanted it to be.

The code modifications are now done using a gcc-plugin. This allows to
easily ignore code from init and other code sections. I think it should
also allow us to add opt-in/opt-out support for each branch, for example
by marking function pointers using address-space attributes.

All of these changes required some optimizations to go away to keep the
code simple. I have still did not run the benchmarks again. 

So I might have not addressed all the open issues, but it is rather hard
to finish the implementation since some still open high-level decisions
affect the way in which optimizations should be done.

Specifically:

- Is it going to be the only indirect branch promotion mechanism? If so,
  it probably should also provide interface similar to Josh's
  "static-calls" with annotations.

- Should it also be used when retpolines are disabled (in the config)?
  This does complicate the implementation a bit (RFC v1 supported it).

- Is it going to be opt-in or opt-out? If it is an opt-out mechanism,
  memory and performance optimizations need to be more aggressive.

- Do we use periodic learning or not? Josh suggested to reconfigure the
  branches whenever a new target is found. However, I do not know at
  this time how to do learning efficiently, without making learning much
  more expensive.

[1] https://lore.kernel.org/patchwork/cover/1001332/

Nadav Amit (6):
  x86: introduce kernel restartable sequence
  objtool: ignore instructions
  x86: patch indirect branch promotion
  x86: interface for accessing indirect branch locations
  x86: learning and patching indirect branch targets
  x86: outline optpoline

 arch/x86/Kconfig                             |    4 +
 arch/x86/entry/entry_64.S                    |   16 +-
 arch/x86/include/asm/nospec-branch.h         |   83 ++
 arch/x86/include/asm/sections.h              |    2 +
 arch/x86/kernel/Makefile                     |    1 +
 arch/x86/kernel/asm-offsets.c                |    9 +
 arch/x86/kernel/nospec-branch.c              | 1293 ++++++++++++++++++
 arch/x86/kernel/traps.c                      |    7 +
 arch/x86/kernel/vmlinux.lds.S                |    7 +
 arch/x86/lib/retpoline.S                     |   83 ++
 include/linux/cpuhotplug.h                   |    1 +
 include/linux/module.h                       |    9 +
 kernel/module.c                              |    8 +
 scripts/Makefile.gcc-plugins                 |    3 +
 scripts/gcc-plugins/x86_call_markup_plugin.c |  329 +++++
 tools/objtool/check.c                        |   21 +-
 16 files changed, 1872 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/kernel/nospec-branch.c
 create mode 100644 scripts/gcc-plugins/x86_call_markup_plugin.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 1/6] x86: introduce kernel restartable sequence
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
@ 2018-12-31  7:21 ` Nadav Amit
  2018-12-31 20:08   ` Andy Lutomirski
                     ` (2 more replies)
  2018-12-31  7:21 ` [RFC v2 2/6] objtool: ignore instructions Nadav Amit
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31  7:21 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree
  Cc: H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse, Nadav Amit

It is sometimes beneficial to have a restartable sequence - very few
instructions which if they are preempted jump to a predefined point.

To provide such functionality on x86-64, we use an empty REX-prefix
(opcode 0x40) as an indication for instruction in such a sequence. Before
calling the schedule IRQ routine, if the "magic" prefix is found, we
call a routine to adjust the instruction pointer.  It is expected that
this opcode is not in common use.

The following patch will make use of this function. Since there are no
other users (yet?), the patch does not bother to create a general
infrastructure and API that others can use for such sequences. Yet, it
should not be hard to make such extension later.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/entry/entry_64.S            | 16 ++++++++++++++--
 arch/x86/include/asm/nospec-branch.h | 12 ++++++++++++
 arch/x86/kernel/traps.c              |  7 +++++++
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..e144ff8b914f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -644,12 +644,24 @@ retint_kernel:
 	/* Interrupts are off */
 	/* Check if we need preemption */
 	btl	$9, EFLAGS(%rsp)		/* were interrupts off? */
-	jnc	1f
+	jnc	2f
 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	2f
+
+	/*
+	 * Allow to use restartable code sections in the kernel. Consider an
+	 * instruction with the first byte having REX prefix without any bits
+	 * set as an indication for an instruction in such a section.
+	 */
+	movq    RIP(%rsp), %rax
+	cmpb    $KERNEL_RESTARTABLE_PREFIX, (%rax)
 	jnz	1f
+	mov	%rsp, %rdi
+	call	restart_kernel_rseq
+1:
 	call	preempt_schedule_irq
 	jmp	0b
-1:
+2:
 #endif
 	/*
 	 * The iretq could re-enable interrupts:
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index dad12b767ba0..be4713ef0940 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -54,6 +54,12 @@
 	jnz	771b;				\
 	add	$(BITS_PER_LONG/8) * nr, sp;
 
+/*
+ * An empty REX-prefix is an indication that this instruction is part of kernel
+ * restartable sequence.
+ */
+#define KERNEL_RESTARTABLE_PREFIX		(0x40)
+
 #ifdef __ASSEMBLY__
 
 /*
@@ -150,6 +156,12 @@
 #endif
 .endm
 
+.macro restartable_seq_prefix
+#ifdef CONFIG_PREEMPT
+	.byte	KERNEL_RESTARTABLE_PREFIX
+#endif
+.endm
+
 #else /* __ASSEMBLY__ */
 
 #define ANNOTATE_NOSPEC_ALTERNATIVE				\
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 85cccadb9a65..b1e855bad5ac 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -59,6 +59,7 @@
 #include <asm/fpu/xstate.h>
 #include <asm/vm86.h>
 #include <asm/umip.h>
+#include <asm/nospec-branch.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -186,6 +187,12 @@ int fixup_bug(struct pt_regs *regs, int trapnr)
 	return 0;
 }
 
+asmlinkage __visible void restart_kernel_rseq(struct pt_regs *regs)
+{
+	if (user_mode(regs) || *(u8 *)regs->ip != KERNEL_RESTARTABLE_PREFIX)
+		return;
+}
+
 static nokprobe_inline int
 do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str,
 		  struct pt_regs *regs,	long error_code)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 2/6] objtool: ignore instructions
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
  2018-12-31  7:21 ` [RFC v2 1/6] x86: introduce kernel restartable sequence Nadav Amit
@ 2018-12-31  7:21 ` Nadav Amit
  2018-12-31  7:21 ` [RFC v2 3/6] x86: patch indirect branch promotion Nadav Amit
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31  7:21 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree
  Cc: H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse, Nadav Amit

In certain cases there is a need to suppress objtool warnings on
specific instructions. Provide an interface to achieve this goal.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 tools/objtool/check.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0414a0d52262..c890d714fb73 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -431,10 +431,11 @@ static int add_dead_ends(struct objtool_file *file)
 /*
  * Warnings shouldn't be reported for ignored functions.
  */
-static void add_ignores(struct objtool_file *file)
+static int add_ignores(struct objtool_file *file)
 {
 	struct instruction *insn;
 	struct section *sec;
+	struct rela *rela;
 	struct symbol *func;
 
 	for_each_sec(file, sec) {
@@ -449,6 +450,20 @@ static void add_ignores(struct objtool_file *file)
 				insn->ignore = true;
 		}
 	}
+
+	sec = find_section_by_name(file->elf, ".rela.discard.ignore");
+	if (!sec)
+		return 0;
+
+	list_for_each_entry(rela, &sec->rela_list, list) {
+		insn = find_insn(file, rela->sym->sec, rela->addend);
+		if (!insn) {
+			WARN("bad .discard.ignore entry");
+			return -1;
+		}
+		insn->ignore = true;
+	}
+	return 0;
 }
 
 /*
@@ -1237,7 +1252,9 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
-	add_ignores(file);
+	ret = add_ignores(file);
+	if (ret)
+		return ret;
 
 	ret = add_nospec_ignores(file);
 	if (ret)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 3/6] x86: patch indirect branch promotion
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
  2018-12-31  7:21 ` [RFC v2 1/6] x86: introduce kernel restartable sequence Nadav Amit
  2018-12-31  7:21 ` [RFC v2 2/6] objtool: ignore instructions Nadav Amit
@ 2018-12-31  7:21 ` Nadav Amit
  2018-12-31  7:21 ` [RFC v2 4/6] x86: interface for accessing indirect branch locations Nadav Amit
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31  7:21 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree
  Cc: H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse, Nadav Amit

To perform indirect branch promotion, we need to find all the locations
and patch them, while ignore various code sections (e.g., init,
alternatives). Using a GCC plugin allows us to do so. It is also
possible to add on top of this plugin and opt-in/out mechanism.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/Kconfig                             |   4 +
 arch/x86/include/asm/nospec-branch.h         |  71 ++++
 arch/x86/kernel/Makefile                     |   1 +
 arch/x86/kernel/asm-offsets.c                |   9 +
 arch/x86/kernel/nospec-branch.c              |  11 +
 arch/x86/kernel/vmlinux.lds.S                |   7 +
 arch/x86/lib/retpoline.S                     |  83 +++++
 scripts/Makefile.gcc-plugins                 |   3 +
 scripts/gcc-plugins/x86_call_markup_plugin.c | 329 +++++++++++++++++++
 9 files changed, 518 insertions(+)
 create mode 100644 arch/x86/kernel/nospec-branch.c
 create mode 100644 scripts/gcc-plugins/x86_call_markup_plugin.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e65105c1f875..b0956fb7b40b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2904,6 +2904,10 @@ config X86_DMA_REMAP
 config HAVE_GENERIC_GUP
 	def_bool y
 
+config OPTPOLINE
+       def_bool y
+       depends on X86_64 && RETPOLINE && GCC_PLUGINS
+
 source "drivers/firmware/Kconfig"
 
 source "arch/x86/kvm/Kconfig"
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index be4713ef0940..cb0a7613dd0a 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -9,6 +9,7 @@
 #include <asm/alternative-asm.h>
 #include <asm/cpufeatures.h>
 #include <asm/msr-index.h>
+#include <asm/percpu.h>
 
 /*
  * Fill the CPU return stack buffer.
@@ -30,6 +31,9 @@
 #define RSB_CLEAR_LOOPS		32	/* To forcibly overwrite all entries */
 #define RSB_FILL_LOOPS		16	/* To avoid underflow */
 
+#define OPTPOLINE_SAMPLES_NUM		(1 << 8)
+#define OPTPOLINE_SAMPLES_MASK		(OPTPOLINE_SAMPLES_NUM - 1)
+
 /*
  * Google experimented with loop-unrolling and this turned out to be
  * the optimal version — two calls, each with their own speculation
@@ -299,6 +303,73 @@ static inline void indirect_branch_prediction_barrier(void)
 	alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB);
 }
 
+/* Data structure that is used during the learning stage */
+struct optpoline_sample {
+	u32 src;
+	u32 tgt;
+	u32 cnt;
+} __packed;
+
+DECLARE_PER_CPU_ALIGNED(struct optpoline_sample[OPTPOLINE_SAMPLES_NUM],
+		       optpoline_samples);
+
+DECLARE_PER_CPU(u8, has_optpoline_samples);
+
+/*
+ * Information for optpolines as it is saved in the source.
+ */
+struct optpoline_entry {
+	void *rip;
+	u8 reg;
+} __packed;
+
+/*
+ * Reflects the structure of the assembly code. We exclude the compare
+ * opcode which depends on the register.
+ */
+struct optpoline_code {
+	union {
+		struct {
+			u8 rex;
+			u8 opcode;
+			u8 modrm;
+			u32 imm;
+		} __packed cmp;
+		struct {
+			u8 opcode;
+			s8 rel;
+		} __packed skip;
+		struct {
+			u8 opcode;
+			s32 rel;
+		} __packed patching_call;
+	} __packed;
+	struct {
+		u8 rex;
+		u8 opcode;
+		s8 rel;
+	} __packed jnz;
+	struct {
+		u8 rex;
+		u8 opcode;
+		s32 rel;
+	} __packed call;
+	struct {
+		/* Instruction is not patched, so no prefix needed */
+		u8 opcode;
+		u8 rel;
+	} __packed jmp_done;
+	struct {
+		u8 rex;
+		u8 opcode;
+		s32 rel;
+	} __packed fallback;
+} __packed;
+
+extern const void *indirect_thunks[16];
+extern const void *save_optpoline_funcs[16];
+extern const void *skip_optpoline_funcs[16];
+
 /* The Intel SPEC CTRL MSR base value cache */
 extern u64 x86_spec_ctrl_base;
 
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8824d01c0c35..7c342cfd3771 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -149,4 +149,5 @@ ifeq ($(CONFIG_X86_64),y)
 
 	obj-$(CONFIG_MMCONF_FAM10H)	+= mmconf-fam10h_64.o
 	obj-y				+= vsmp_64.o
+	obj-$(CONFIG_OPTPOLINE)		+= nospec-branch.o
 endif
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 168543d077d7..e5b6236fdcb2 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
 #include <asm/bootparam.h>
 #include <asm/suspend.h>
 #include <asm/tlbflush.h>
+#include <asm/nospec-branch.h>
 
 #ifdef CONFIG_XEN
 #include <xen/interface/xen.h>
@@ -105,4 +106,12 @@ static void __used common(void)
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+	/* Relpolines */
+	OFFSET(OPTPOLINE_SAMPLE_src, optpoline_sample, src);
+	OFFSET(OPTPOLINE_SAMPLE_tgt, optpoline_sample, tgt);
+	OFFSET(OPTPOLINE_SAMPLE_cnt, optpoline_sample, cnt);
+	DEFINE(OPTPOLINE_CODE_SIZE, sizeof(struct optpoline_code));
+	DEFINE(OPTPOLINE_CODE_patching_call_end,
+	       offsetofend(struct optpoline_code, patching_call));
 }
diff --git a/arch/x86/kernel/nospec-branch.c b/arch/x86/kernel/nospec-branch.c
new file mode 100644
index 000000000000..5ae12681b23b
--- /dev/null
+++ b/arch/x86/kernel/nospec-branch.c
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Nadav Amit <namit@vmware.com>
+ */
+
+#include <linux/percpu.h>
+#include <asm/nospec-branch.h>
+
+DEFINE_PER_CPU_ALIGNED(struct optpoline_sample[OPTPOLINE_SAMPLES_NUM],
+		       optpoline_samples);
+DEFINE_PER_CPU(u8, has_optpoline_samples);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 0d618ee634ac..6faf89098e40 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -355,6 +355,13 @@ SECTIONS
 	.data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) {
 		NOSAVE_DATA
 	}
+
+	. = ALIGN(8);
+	.optpolines : AT(ADDR(.optpolines) - LOAD_OFFSET) {
+		__optpolines = .;
+		*(.optpolines)
+		__optpolines_end = .;
+	}
 #endif
 
 	/* BSS */
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index c909961e678a..e53a08a9a385 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -7,6 +7,7 @@
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
 #include <asm/nospec-branch.h>
+#include <asm/asm-offsets.h>
 
 .macro THUNK reg
 	.section .text.__x86.indirect_thunk
@@ -45,4 +46,86 @@ GENERATE_THUNK(r12)
 GENERATE_THUNK(r13)
 GENERATE_THUNK(r14)
 GENERATE_THUNK(r15)
+
+#ifdef CONFIG_OPTPOLINE
+
+.macro save_optpoline reg:req
+ENTRY(save_optpoline_\reg\())
+	pushq 	%rdi
+	pushq	%rsi
+	pushq	%rcx
+
+	/* First load the destination, for the case rsi is the destination */
+.if "\reg" != "rdi"
+	mov	%\reg, %rdi
+.endif
+	mov	24(%rsp), %rsi
+
+	/* Compute the xor as an index in the table */
+	mov	%rsi, %rcx
+	xor	%rdi, %rcx
+	and	$OPTPOLINE_SAMPLES_MASK, %ecx
+
+	/* Entry size is 12-bit */
+	shl	$2, %ecx				# ecx *= 4
+	lea	optpoline_samples(%rcx,%rcx,2), %rcx	# rcx *= 3
+
+	movl	%esi, PER_CPU_VAR(OPTPOLINE_SAMPLE_src)(%rcx)
+	movl	%edi, PER_CPU_VAR(OPTPOLINE_SAMPLE_tgt)(%rcx)
+	incl	PER_CPU_VAR(OPTPOLINE_SAMPLE_cnt)(%rcx)
+	movb	$1, PER_CPU_VAR(has_optpoline_samples)
+
+	popq	%rcx
+	popq	%rsi
+	popq	%rdi
+	ANNOTATE_NOSPEC_ALTERNATIVE
+	ALTERNATIVE __stringify(ANNOTATE_RETPOLINE_SAFE; jmp *%\reg\()),\
+		"jmp __x86_indirect_thunk_\reg",			\
+		X86_FEATURE_RETPOLINE
+
+ENDPROC(save_optpoline_\reg\())
+_ASM_NOKPROBE(save_optpoline_\reg\())
+EXPORT_SYMBOL(save_optpoline_\reg\())
+.endm
+
+.macro skip_optpoline reg:req
+ENTRY(skip_optpoline_\reg\())
+	addq	$(OPTPOLINE_CODE_SIZE - OPTPOLINE_CODE_patching_call_end), (%_ASM_SP)
+	jmp	__x86_indirect_thunk_\reg
+ENDPROC(skip_optpoline_\reg\())
+_ASM_NOKPROBE(skip_optpoline_\reg\())
+EXPORT_SYMBOL(skip_optpoline_\reg\())
+.endm
+
+#define ARCH_REG_NAMES rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,r14,r15
+
+.irp reg,ARCH_REG_NAMES
+.if \reg != "rsp"
+save_optpoline reg=\reg
+skip_optpoline reg=\reg
+.endif
+.endr
+
+/*
+ * List of indirect thunks
+ */
+.macro create_func_per_reg_list name:req func_prefix:req
+.global \name
+\name:
+.irp reg,ARCH_REG_NAMES
+.if \reg != "rsp"
+.quad \func_prefix\()_\reg
+.else
+.quad 0
+.endif
+.endr
+.endm
+
+.pushsection .rodata
+create_func_per_reg_list name=indirect_thunks func_prefix=__x86_indirect_thunk
+create_func_per_reg_list name=save_optpoline_funcs func_prefix=save_optpoline
+create_func_per_reg_list name=skip_optpoline_funcs func_prefix=skip_optpoline
+.popsection
+
+#endif
 #endif
diff --git a/scripts/Makefile.gcc-plugins b/scripts/Makefile.gcc-plugins
index 46c5c6809806..796b6d59f27e 100644
--- a/scripts/Makefile.gcc-plugins
+++ b/scripts/Makefile.gcc-plugins
@@ -31,6 +31,9 @@ gcc-plugin-cflags-$(CONFIG_GCC_PLUGIN_STACKLEAK)		\
 		+= -DSTACKLEAK_PLUGIN
 gcc-plugin-cflags-$(CONFIG_GCC_PLUGIN_STACKLEAK)		\
 		+= -fplugin-arg-stackleak_plugin-track-min-size=$(CONFIG_STACKLEAK_TRACK_MIN_SIZE)
+
+gcc-plugin-$(CONFIG_OPTPOLINE) 			+= x86_call_markup_plugin.so
+
 ifdef CONFIG_GCC_PLUGIN_STACKLEAK
     DISABLE_STACKLEAK_PLUGIN += -fplugin-arg-stackleak_plugin-disable
 endif
diff --git a/scripts/gcc-plugins/x86_call_markup_plugin.c b/scripts/gcc-plugins/x86_call_markup_plugin.c
new file mode 100644
index 000000000000..fb01cf36c26f
--- /dev/null
+++ b/scripts/gcc-plugins/x86_call_markup_plugin.c
@@ -0,0 +1,329 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Nadav Amit <namit@vmware.com>
+ */
+
+#include "gcc-common.h"
+
+__visible int plugin_is_GPL_compatible;
+
+static struct plugin_info kernexec_plugin_info = {
+	.version	= "201607271510vanilla",
+	.help		= "method=call\tmaniuplation method\n"
+};
+
+static bool include_emitted;
+
+#define N_CLOBBERED_FUNC_REGS			(4)
+
+struct reg_pair {
+	machine_mode	mode;
+	unsigned int	regno;
+};
+
+static const struct reg_pair clobbered_func_regs[N_CLOBBERED_FUNC_REGS] = {
+	{DImode, R11_REG},
+	{DImode, R10_REG},
+	{CCmode, FLAGS_REG},
+	{CCFPmode, FPSR_REG}
+};
+
+struct output_pair {
+	const char *constraint;
+	unsigned int regno;
+};
+
+#define N_OUTPUT_FUNC_REGS			(7)
+
+/* VREG indicates the call register, which is N_OUTPUT_FUNC_REGS + 1 */
+#define VREG					"8"
+
+static const struct output_pair output_regs[N_OUTPUT_FUNC_REGS] = {
+	/* Order must not be changed, since inputs regard outputs */
+	{"=r", SP_REG},
+	{"=D", DI_REG},
+	{"=S", SI_REG},
+	{"=c", CX_REG},
+	{"=d", DX_REG},
+	{"+r", R8_REG},
+	{"+r", R9_REG}
+};
+
+#define KERNEL_RESTARTABLE_PREFIX		"0x40"
+
+/*
+ * %V8, since 8 = N_OUTPUT_FUNC_REGS + 1
+ *
+ * There are a few suboptimization in this code, that can be addressed in the
+ * future. They simplify the code, though.
+ *
+ * 1. We always encode a longer version of CMP, even 'cmp eax, imm' is possible.
+ * 2. We always encode the "restartable" prefix, even on non-preemptive or
+ *    voluntary-preemption kernels.
+ */
+const char *call_block =
+	"# INDIRECT BRANCH -------------------				\n"
+	"	i = 0							\n"
+	"	.irp reg_it, rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,r14,r15\n"
+	"	.ifc \"%V" VREG "\", \"\\reg_it\"			\n"
+	"		reg_num=i					\n"
+	"	.endif							\n"
+	"	i = i + 1						\n"
+	"	.endr							\n"
+	"1:								\n"
+	".section .optpolines,\"a\"					\n"
+	"	.quad		1b					\n"
+	"	.byte		reg_num					\n"
+	".previous							\n"
+	"								\n"
+	"	.byte 0x48 | ((reg_num & 8) >> 3)			\n"
+	"	.byte 0x81, 0xf8 | (reg_num & 7)			\n"
+	"	.long 0							\n"
+	"								\n"
+	"	# jmp 4f, patched to jnz in runtime			\n"
+	"	.byte " KERNEL_RESTARTABLE_PREFIX ", 0xeb, 4f - 2f	\n"
+	"								\n"
+	"	# call retpoline, tell objtool about it			\n"
+	"2:								\n"
+	"	.pushsection .discard.ignore				\n"
+	"	.long 2b - .						\n"
+	"	.popsection						\n"
+	"	.byte " KERNEL_RESTARTABLE_PREFIX ", 0xe8		\n"
+	"	.long __x86_indirect_thunk_%V " VREG " - 3f		\n"
+	"3:								\n"
+	"	# jmp 5f,  tell objtool about it			\n"
+	"	.pushsection .discard.ignore				\n"
+	"	.long 3b - .						\n"
+	"	.popsection						\n"
+	"	.byte 0xeb, 5f - 4f					\n"
+	"4:								\n"
+	"	# retpoline						\n"
+	"	.byte " KERNEL_RESTARTABLE_PREFIX ", 0xe8		\n"
+	"	.long __x86_indirect_thunk_%V" VREG " - 5f 		\n"
+	"5:								\n"
+	" # ----------------------------------				\n";
+
+static unsigned int x86_call_markup_execute(void)
+{
+	rtx_insn *insn;
+	rtx annotate;
+	const char *buf;
+	const char * name;
+
+	insn = get_first_nonnote_insn();
+	if (!insn)
+		return 0;
+
+	/* Do not patch init (and other) section calls */
+	if (current_function_decl) {
+	       const char *sec_name = DECL_SECTION_NAME(current_function_decl);
+
+	       if (sec_name)
+		       return 0;
+	}
+
+	buf = call_block;
+
+	for (insn = get_insns(); insn; insn = NEXT_INSN(insn)) {
+		unsigned int i, j, n_inputs;
+		bool has_output;
+		rtvec arg_vec, constraint_vec, label_vec;
+		rtx operands, call, call_op, annotate;
+		rtx asm_op, new_body, p, clob;
+		rtx output_reg;
+		rtx body;
+
+		if (!CALL_P(insn))
+			continue;
+
+		body = PATTERN(insn);
+		switch (GET_CODE(body)) {
+		case CALL:
+			/* A call with no return value */
+			has_output = false;
+			call = body;
+			break;
+		case SET:
+			/* A call with a return value */
+			has_output = true;
+			call = SET_SRC(body);
+			break;
+		default:
+			return -1;
+		}
+
+		if (GET_CODE(call) != CALL)
+			continue;
+
+		call_op = XEXP(XEXP(call, 0), 0);
+
+		switch (GET_CODE(call_op)) {
+		case SYMBOL_REF:
+			/* direct call */
+			continue;
+		case REG:
+			break;
+		default:
+			return -1;	/* ERROR */
+		}
+
+		/* Count the inputs */
+		for (n_inputs = 0, p = CALL_INSN_FUNCTION_USAGE (insn); p; p = XEXP (p, 1)) {
+			if (GET_CODE (XEXP (p, 0)) != USE)
+				return -1;
+			n_inputs++;
+		}
+
+		label_vec = rtvec_alloc(0);
+		arg_vec = rtvec_alloc(2 + n_inputs);
+		constraint_vec = rtvec_alloc(2 + n_inputs);
+
+		i = 0;
+
+		/* AX input */
+		RTVEC_ELT(arg_vec, i) = call_op;
+		RTVEC_ELT(constraint_vec, i) =
+			gen_rtx_ASM_INPUT_loc(GET_MODE(call_op), "r",
+					      RTL_LOCATION(call_op));
+		i++;
+
+		/* SP input */
+		RTVEC_ELT(arg_vec, i) = gen_rtx_REG(DImode, SP_REG);
+		RTVEC_ELT(constraint_vec, i) =
+			gen_rtx_ASM_INPUT_loc(DImode, "1",
+					      RTL_LOCATION(call_op));
+		i++;
+
+		for (p = CALL_INSN_FUNCTION_USAGE(insn); p; p = XEXP (p, 1)) {
+			const char *constraint;
+			rtx input;
+
+			if (GET_CODE (XEXP (p, 0)) != USE)
+				continue;
+
+			input = XEXP(XEXP(p, 0), 0);
+
+			if (MEM_P(input)) {
+				constraint = "m";
+			} else if (REG_P(input)) {
+				switch (REGNO(input)) {
+				case DI_REG:
+					constraint = "D";
+					break;
+				case SI_REG:
+					constraint = "S";
+					break;
+				case DX_REG:
+					constraint = "d";
+					break;
+				case CX_REG:
+					constraint = "c";
+					break;
+				case R8_REG:
+					constraint = "r";
+					break;
+				case R9_REG:
+					constraint = "r";
+					break;
+				default:
+					return -1;
+				}
+			} else {
+				return -1;
+			}
+			RTVEC_ELT(arg_vec, i) = input;
+			rtx input_rtx = gen_rtx_ASM_INPUT_loc(GET_MODE(input),
+							      ggc_strdup(constraint),
+							      RTL_LOCATION(input));
+
+			RTVEC_ELT(constraint_vec, i) = input_rtx;
+			i++;
+		}
+
+		new_body = gen_rtx_PARALLEL(VOIDmode,
+				rtvec_alloc(1 + 1 + N_OUTPUT_FUNC_REGS +
+					    N_CLOBBERED_FUNC_REGS));
+
+		/*
+		 * The function output. If none still mark as if AX is
+		 * written to ensure it is clobbered.
+		 */
+		i = 0;
+		output_reg = has_output ? SET_DEST(body) :
+					  gen_rtx_REG(DImode, AX_REG);
+		asm_op = gen_rtx_ASM_OPERANDS(VOIDmode, ggc_strdup(buf), "=a", i,
+					      arg_vec, constraint_vec,
+					      label_vec, RTL_LOCATION(insn));
+		XVECEXP(new_body, 0, i++) = gen_rtx_SET(output_reg, asm_op);
+
+		/*
+		 * SP is used as output. Since there is always an output, we do
+		 * not use MEM_VOLATILE_P
+		 */
+		for (j = 0; j < N_OUTPUT_FUNC_REGS; j++) {
+			const struct output_pair *output = &output_regs[j];
+			rtx reg_rtx;
+
+			asm_op = gen_rtx_ASM_OPERANDS(VOIDmode, ggc_strdup(buf),
+						      output->constraint, i,
+						      arg_vec, constraint_vec,
+						      label_vec, RTL_LOCATION(insn));
+
+			reg_rtx = gen_rtx_REG(DImode, output->regno);
+			XVECEXP(new_body, 0, i++) = gen_rtx_SET(reg_rtx, asm_op);
+		}
+
+		/* Add the clobbers */
+		for (j = 0; j < N_CLOBBERED_FUNC_REGS; j++) {
+			const struct reg_pair *regs = &clobbered_func_regs[j];
+
+			clob = gen_rtx_REG(regs->mode, regs->regno);
+			clob = gen_rtx_CLOBBER(VOIDmode, clob);
+			XVECEXP(new_body, 0, i++) = clob;
+		}
+
+		/* Memory clobber */
+		clob = gen_rtx_SCRATCH(VOIDmode);
+		clob = gen_rtx_MEM(BLKmode, clob);
+		clob = gen_rtx_CLOBBER(VOIDmode, clob);
+		XVECEXP(new_body, 0, i++) = clob;
+
+		if (n_inputs >= 5)
+			emit_insn_before(gen_rtx_USE(VOIDmode,
+					  gen_rtx_REG(DImode, R8_REG)), insn);
+		if (n_inputs >= 6)
+			emit_insn_before(gen_rtx_USE(VOIDmode,
+					  gen_rtx_REG(DImode, R9_REG)), insn);
+
+		emit_insn_before(new_body, insn);
+
+		delete_insn(insn);
+	}
+	return 0;
+}
+
+#define PASS_NAME x86_call_markup
+#define NO_GATE
+
+#include "gcc-generate-rtl-pass.h"
+
+__visible int plugin_init(struct plugin_name_args *plugin_info,
+			  struct plugin_gcc_version *version)
+{
+	const char * const plugin_name = plugin_info->base_name;
+	const int argc = plugin_info->argc;
+	const struct plugin_argument *argv = plugin_info->argv;
+
+	if (!plugin_default_version_check(version, &gcc_version)) {
+		error(G_("incompatible gcc/plugin versions"));
+		return 1;
+	}
+
+	register_callback(plugin_name, PLUGIN_INFO, NULL, &kernexec_plugin_info);
+
+	PASS_INFO(x86_call_markup, "expand", 1, PASS_POS_INSERT_AFTER);
+	register_callback(plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL,
+			  &x86_call_markup_pass_info);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 4/6] x86: interface for accessing indirect branch locations
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
                   ` (2 preceding siblings ...)
  2018-12-31  7:21 ` [RFC v2 3/6] x86: patch indirect branch promotion Nadav Amit
@ 2018-12-31  7:21 ` Nadav Amit
  2018-12-31  7:21 ` [RFC v2 5/6] x86: learning and patching indirect branch targets Nadav Amit
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31  7:21 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree
  Cc: H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse, Nadav Amit

Adding a C interface to access the locations of indirect branches. To be
used for dynamic patching.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/sections.h | 2 ++
 include/linux/module.h          | 9 +++++++++
 kernel/module.c                 | 8 ++++++++
 3 files changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 8ea1cfdbeabc..acba39349f46 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -4,6 +4,7 @@
 
 #include <asm-generic/sections.h>
 #include <asm/extable.h>
+#include <asm/nospec-branch.h>
 
 extern char __brk_base[], __brk_limit[];
 extern struct exception_table_entry __stop___ex_table[];
@@ -11,6 +12,7 @@ extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
+extern const struct optpoline_entry __optpolines[], __optpolines_end[];
 #endif
 
 #endif	/* _ASM_X86_SECTIONS_H */
diff --git a/include/linux/module.h b/include/linux/module.h
index fce6b4335e36..2590719dd8f2 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -40,6 +40,8 @@ struct modversion_info {
 
 struct module;
 struct exception_table_entry;
+struct optpoline_entry;
+struct optpoline;
 
 struct module_kobject {
 	struct kobject kobj;
@@ -387,6 +389,13 @@ struct module {
 	unsigned int num_exentries;
 	struct exception_table_entry *extable;
 
+#ifdef CONFIG_OPTPOLINE
+	/* Dynamic indirect branch promotion */
+	struct optpoline_entry *optpoline_entries;
+	struct optpoline *optpolines;
+	unsigned int num_optpolines;
+#endif
+
 	/* Startup function. */
 	int (*init)(void);
 
diff --git a/kernel/module.c b/kernel/module.c
index 99b46c32d579..5e9edca9bbdc 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -64,6 +64,7 @@
 #include <linux/bsearch.h>
 #include <linux/dynamic_debug.h>
 #include <linux/audit.h>
+#include <asm/nospec-branch.h>
 #include <uapi/linux/module.h>
 #include "module-internal.h"
 
@@ -256,6 +257,7 @@ static void mod_update_bounds(struct module *mod)
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
+struct list_head *cachepoline_modules = &modules;
 
 static void module_assert_mutex(void)
 {
@@ -3125,6 +3127,12 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 	mod->extable = section_objs(info, "__ex_table",
 				    sizeof(*mod->extable), &mod->num_exentries);
 
+#ifdef CONFIG_OPTPOLINE
+	mod->optpoline_entries = section_objs(info, "__optpolines",
+					      sizeof(*mod->optpoline_entries),
+					      &mod->num_optpolines);
+#endif
+
 	if (section_addr(info, "__obsparm"))
 		pr_warn("%s: Ignoring obsolete parameters\n", mod->name);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 5/6] x86: learning and patching indirect branch targets
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
                   ` (3 preceding siblings ...)
  2018-12-31  7:21 ` [RFC v2 4/6] x86: interface for accessing indirect branch locations Nadav Amit
@ 2018-12-31  7:21 ` Nadav Amit
  2018-12-31 20:05   ` Andy Lutomirski
  2018-12-31  7:21 ` [RFC v2 6/6] x86: outline optpoline Nadav Amit
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: Nadav Amit @ 2018-12-31  7:21 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree
  Cc: H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse, Nadav Amit

During runtime, we collect the targets of indirect branch targets and
patch them in. Patching is done asynchronously, by modifying each of the
relpoline code-paths separately while diverting code execution to the
other path during patching. Preemption is disabled while the code runs,
and we wait for preemption to occur on each core to ensure no core is
executing the patched code.

To make use of relpolines, a worker goes over the experienced indirect
calls targets and sorts them according to frequency. The target that
was encountered most times is patched in.

Periodically, the indirect branches are set back into learning mode to
see whether the targets have changed. The current policy might be too
aggressive.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/nospec-branch.c | 992 ++++++++++++++++++++++++++++++++
 include/linux/cpuhotplug.h      |   1 +
 2 files changed, 993 insertions(+)

diff --git a/arch/x86/kernel/nospec-branch.c b/arch/x86/kernel/nospec-branch.c
index 5ae12681b23b..1503c312f715 100644
--- a/arch/x86/kernel/nospec-branch.c
+++ b/arch/x86/kernel/nospec-branch.c
@@ -4,8 +4,1000 @@
  */
 
 #include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/sort.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+#include <linux/memory.h>
+#include <linux/cpu.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/cpumask.h>
+#include <linux/mm.h>
+#include <linux/debugfs.h>
+#include <linux/jump_label.h>
+#include <linux/rhashtable.h>
 #include <asm/nospec-branch.h>
+#include <asm/text-patching.h>
+#include <asm/asm-offsets.h>
+#include <asm/sections.h>
+#include <asm/mmu_context.h>
+
+#define REX_B			(0x41)
+#define JNZ_REL8_OPCODE		(0x75)
+#define JMP_REL8_OPCODE		(0xeb)
+#define CALL_REL32_OPCODE	(0xe8)
+#define CMP_IMM32_OPCODE	(0x81)
+#define NOP_OPCODE		(0x90)
+#define INT3_OPCODE		(0xcc)
+#define CALL_IND_INS		"\xff\xd0"
+
+#define OP_READJUST_SECONDS	(1)
+#define OP_REENABLE_IN_EPOCH	(4)
+#define OP_ENABLE_IN_EPOCH	(512)
+#define OP_SAMPLE_MSECS		(30000)
+
+enum code_state {
+	OPTPOLINE_SLOWPATH,
+	OPTPOLINE_FASTPATH,
+	OPTPOLINE_COND
+};
+
+/*
+ * Information for optpolines as it dynamically changes during execution.
+ */
+struct optpoline {
+	struct list_head node;
+	struct rhash_head rhash;
+	u32 rip;
+	u32 target;
+
+	/* The register that is used by the indirect branch */
+	u8 reg : 4;
+
+	/* The state of the optpoline, which indicates on which list it is */
+	u8 state : 3;
+
+	/* Whether we ever encountered more than one target */
+	u8 unstable : 1;
+
+	/* Whether there is a valid target */
+	u8 has_target : 1;
+
+	/*
+	 * Whether the optpoline needs to be set into training mode. This is a
+	 * transitory indication, which is cleared once it is set in learning
+	 * mode.
+	 */
+	u8 to_learn : 1;
+};
+
+struct optpoline_list {
+	/* @num: number of elements in the list */
+	unsigned int num;
+
+	struct list_head list;
+};
+
+static const struct rhashtable_params optpoline_rht_params = {
+	.automatic_shrinking = true,
+	.key_len = sizeof(u32),
+	.key_offset = offsetof(struct optpoline, rip),
+	.head_offset = offsetof(struct optpoline, rhash),
+};
 
 DEFINE_PER_CPU_ALIGNED(struct optpoline_sample[OPTPOLINE_SAMPLES_NUM],
 		       optpoline_samples);
 DEFINE_PER_CPU(u8, has_optpoline_samples);
+
+enum optpoline_state {
+	OP_STATE_NEW,			/* New, no known target */
+	OP_STATE_LEARN,			/* Learning for the first time */
+	OP_STATE_RELEARN,		/* Learning when a target is known */
+	OP_STATE_STABLE,		/* Calls with a single target */
+	OP_STATE_UNSTABLE,		/* Calls with multiple targets */
+	OP_STATE_UPDATING,		/* Undergoing an update */
+	OP_STATE_LAST = OP_STATE_UPDATING,
+	OP_STATE_W_TARGET_FIRST = OP_STATE_RELEARN,
+	OP_STATE_W_TARGET_LAST = OP_STATE_UNSTABLE
+};
+
+#define OP_STATE_NUM		(OP_STATE_LAST + 1)
+
+static const char *const optpoline_state_name[OP_STATE_NUM] = {
+	[OP_STATE_NEW] = "new",
+	[OP_STATE_LEARN] = "learning",
+	[OP_STATE_RELEARN] = "relearning",
+	[OP_STATE_STABLE] = "stable",
+	[OP_STATE_UNSTABLE] = "unstable",
+	[OP_STATE_UPDATING] = "updating",
+};
+
+struct optpolines {
+	struct mutex mutex;
+	struct optpoline_sample *samples_copy;
+
+	/*
+	 * Hashtable that holds all the optpolines, key'd by the instruction
+	 * pointer of the optpoline code block.
+	 */
+	struct rhashtable rhead;
+
+	/*
+	 * List of optpolines according to their states.
+	 */
+	struct optpoline_list list[OP_STATE_NUM];
+
+	/*
+	 * Indications whether optpolines are enabled. They might be disabled if
+	 * an error occurs.
+	 */
+	u8 enabled : 1;
+
+	/*
+	 * Unfortunately, the resizable hash-table cannot be destroyed if it
+	 * wasn't initialized, so we need to keep track of it.
+	 */
+	u8 rhead_initialized : 1;
+
+	 /* Number of unstable optpolines that are pending relearning. */
+	unsigned int pending_relearn;
+
+	/* Kernel (excluding modules) optpolines */
+	struct optpoline *kernel_optpolines;
+
+	struct dentry *dbg_entry;
+
+	ktime_t sample_time;
+};
+
+static struct optpolines optpolines;
+
+static inline void *kernel_ptr(u32 low_addr)
+{
+	return (void *)(low_addr | 0xffffffff00000000ul);
+}
+
+static inline u8 optpoline_cmp_rex_prefix(u8 reg)
+{
+	return 0x48 | ((reg & 8) >> 3);
+}
+
+static inline u8 optpoline_cmp_modrm(u8 reg)
+{
+	return 0xf8 | reg;
+}
+
+static void clear_optpoline_samples(void)
+{
+	int cpu, i;
+
+	for_each_online_cpu(cpu) {
+		/*
+		 * Do not update the target to avoid racing with the sampling.
+		 * Try to avoid having the wrong target, which might cause a
+		 * call with a single target to be considered having multiple
+		 * ones, leading to unnecessary learning cycles.
+		 */
+		for (i = 0; i < OPTPOLINE_SAMPLES_NUM; i++) {
+			per_cpu(optpoline_samples, cpu)[i].src = 0;
+			per_cpu(optpoline_samples, cpu)[i].cnt = 0;
+		}
+		per_cpu(has_optpoline_samples, cpu) = false;
+	}
+}
+
+static void add_optpoline_to_list(struct optpoline *op,
+					 enum optpoline_state rp_state)
+{
+	optpolines.list[rp_state].num++;
+
+	/*
+	 * Add to the tail, so when putting back into the list, it will be
+	 * considered for learning last.
+	 */
+	list_add_tail(&op->node, &optpolines.list[rp_state].list);
+	op->state = rp_state;
+}
+
+static void remove_optpoline_from_list(struct optpoline *op)
+{
+	optpolines.list[op->state].num--;
+	list_del_init(&op->node);
+}
+
+static inline void change_optpoline_state(struct optpoline *op,
+					  enum optpoline_state rp_state)
+{
+	remove_optpoline_from_list(op);
+	add_optpoline_to_list(op, rp_state);
+}
+
+static int add_optpolines(const struct optpoline_entry *entries,
+			  unsigned int n_entries, struct module *mod)
+{
+	struct optpoline *op;
+	int i, r = 0;
+
+	op = kvmalloc_array(n_entries, sizeof(*op), GFP_KERNEL|__GFP_ZERO);
+	if (!op)
+		return -ENOMEM;
+
+	if (mod)
+		mod->optpolines = op;
+	else
+		optpolines.kernel_optpolines = op;
+
+	for (i = 0; i < n_entries; i++) {
+		enum optpoline_state state = OP_STATE_NEW;
+		uintptr_t rip = (uintptr_t)entries[i].rip;
+
+		/* Knowningly, we take only 32-bits to save space */
+		op->rip = (u32)rip;
+		op->reg = entries[i].reg;
+
+		r = rhashtable_insert_fast(&optpolines.rhead, &op->rhash,
+					   optpoline_rht_params);
+		if (r < 0)
+			break;
+
+		add_optpoline_to_list(op, state);
+		op++;
+	}
+
+	if (r < 0)
+		WARN_ONCE(1, "Error loading optpolines\n");
+
+	return r;
+}
+
+static void remove_module_optpolines(struct module *mod)
+{
+	unsigned int i;
+
+	for (i = 0; i < mod->num_optpolines; i++) {
+		struct optpoline *op = &mod->optpolines[i];
+
+		/* If init somehow failed, we may see uninitialized entries */
+		if (op->rip == 0)
+			continue;
+
+		remove_optpoline_from_list(op);
+
+		rhashtable_remove_fast(&optpolines.rhead, &op->rhash,
+				       optpoline_rht_params);
+	}
+
+	kvfree(mod->optpolines);
+	mod->optpolines = NULL;
+}
+
+/*
+ * optpoline_sample_src_cmp_func() - sort by source and target
+ */
+static int optpoline_sample_src_cmp_func(const void *l, const void *r)
+{
+	const struct optpoline_sample *s1 = l;
+	const struct optpoline_sample *s2 = r;
+
+	if (s1->src != s2->src)
+		return s1->src - s2->src;
+	return s1->tgt - s2->tgt;
+}
+
+/*
+ * optpoline_sample_cnt_cmp_func() - sort by the number of samples
+ */
+static int optpoline_sample_cnt_cmp_func(const void *l, const void *r)
+{
+	const struct optpoline_sample *s1 = l;
+	const struct optpoline_sample *s2 = r;
+
+	return s2->cnt - s1->cnt;
+}
+
+/*
+ * copy_optpoline_samples() - copy the samples to the local copy
+ *
+ * As we need to process the samples without them being written concurrently,
+ * and since anyhow they might reside on remote NUMA nodes, copy them first
+ * to the local buffer. During the copy, the source ip is adjusted, and samples
+ * are sorted.
+ */
+static int copy_optpoline_samples(void)
+{
+	struct optpoline_sample *p_copy = optpolines.samples_copy;
+	int cpu, i, n_entries;
+
+	for_each_online_cpu(cpu) {
+		struct optpoline_sample *orig;
+
+		if (!per_cpu(has_optpoline_samples, cpu))
+			continue;
+
+		orig = per_cpu(optpoline_samples, cpu);
+
+		for (i = 0; i < OPTPOLINE_SAMPLES_NUM; i++, orig++) {
+			p_copy->src = orig->src;
+
+			/* Do some sanity checks while we are at it */
+			if (p_copy->src == 0)
+				continue;
+
+			if (init_kernel_text((uintptr_t)kernel_ptr(p_copy->src)))
+				continue;
+
+			p_copy->src -= offsetofend(struct optpoline_code,
+						   fallback);
+
+			/*
+			 * Although we can live with potentially wrong info, as
+			 * it only affects performance, wrong destination might
+			 * be somehow (completely theoretically) be exploited
+			 * through speculative execution. Avoid torn reads. The
+			 * read/write are naturally aligned so we are safe.
+			 */
+			p_copy->tgt = READ_ONCE(orig->tgt);
+
+			p_copy->cnt = orig->cnt;
+			p_copy++;
+		}
+	}
+
+	n_entries = p_copy - optpolines.samples_copy;
+
+	/* Sort by the call source (first) and destination (second) */
+	sort(optpolines.samples_copy, n_entries, sizeof(*optpolines.samples_copy),
+	     optpoline_sample_src_cmp_func, NULL);
+
+	return n_entries;
+}
+
+static void analyze_optpoline_samples(struct optpoline *op,
+				      struct optpoline_sample *samples,
+				      unsigned int n_entries)
+{
+	int i, n_targets = 0;
+
+	/* Merge destinations with the same source and sum their samples. */
+	for (i = 0; i < n_entries; i++) {
+		if (n_targets == 0 ||
+		    samples[n_targets-1].tgt != samples[i].tgt) {
+			/* New target */
+			samples[n_targets++] = samples[i];
+			continue;
+		}
+
+		/* Known target, add samples */
+		samples[n_targets - 1].cnt += samples[i].cnt;
+	}
+
+	/* Sort targets by frequency */
+	sort(samples, n_targets, sizeof(*samples),
+	     optpoline_sample_cnt_cmp_func, NULL);
+
+	/* Mark as unstable if there is more than a single target */
+	if ((op->has_target && op->target != samples[0].tgt) || n_targets > 1)
+		op->unstable = true;
+
+	op->target = samples[0].tgt;
+	op->has_target = true;
+}
+
+/*
+ * Returns the number of decisions.
+ */
+static void process_optpoline_samples(void)
+{
+	unsigned int end, start;
+	int n_copied;
+
+	/*
+	 * First we copy the optpolines for a certain hash index to prevent it
+	 * from messing up with our data. While we can cope with races that
+	 * modify the destination, we need the source rip to be consistent.
+	 */
+	n_copied = copy_optpoline_samples();
+
+	for (start = 0; start < n_copied; start = end) {
+		struct optpoline_code *code;
+		struct optpoline *op;
+
+		code = kernel_ptr(optpolines.samples_copy[start].src);
+		op = rhashtable_lookup_fast(&optpolines.rhead, &code,
+					    optpoline_rht_params);
+
+		/* Races might cause the source to be wrong. Live with it. */
+		if (!op) {
+			end = start + 1;
+			continue;
+		}
+
+		/* Find all the relevant entries */
+		for (end = start + 1; end < n_copied; end++) {
+			if (optpolines.samples_copy[start].src !=
+			    optpolines.samples_copy[end].src)
+				break;
+		}
+
+		analyze_optpoline_samples(op, &optpolines.samples_copy[start],
+					  end - start);
+
+		change_optpoline_state(op, OP_STATE_UPDATING);
+	}
+}
+
+static inline bool is_learning_optpoline(struct optpoline *op)
+{
+	/* Was explicitly required to learn */
+	if (op->to_learn)
+		return true;
+
+	/* Don't get updates if we know there are multiple targets */
+	if (op->unstable)
+		return false;
+
+	/* If we still don't know what the target, learning is needed */
+	return !op->has_target;
+}
+
+static inline bool is_conditional_optpoline(struct optpoline *op)
+{
+	/* During learning we disable the condition */
+	if (op->to_learn)
+		return false;
+
+	/* If we don't know where to go, the condition is useless */
+	return op->has_target;
+}
+
+#define CALL_OFFSET(op, offset, target)					\
+	(s32)((uintptr_t)target - (uintptr_t)kernel_ptr(op->rip) -	\
+	      offsetofend(struct optpoline_code, offset))
+
+static void make_common_optpoline(struct optpoline *op,
+				  struct optpoline_code *code,
+				  const void *fallback_target)
+{
+	code->jmp_done.opcode = JMP_REL8_OPCODE;
+	code->jmp_done.rel = offsetofend(struct optpoline_code, fallback) -
+			     offsetofend(struct optpoline_code, jmp_done);
+
+	code->fallback.rex = KERNEL_RESTARTABLE_PREFIX;
+	code->fallback.opcode = CALL_REL32_OPCODE;
+	code->fallback.rel = CALL_OFFSET(op, fallback, fallback_target);
+}
+
+static void make_conditional_optpoline(struct optpoline *op,
+				       struct optpoline_code *code,
+				       const void *fallback_target)
+{
+	code->cmp.rex = optpoline_cmp_rex_prefix(op->reg);
+	code->cmp.opcode = CMP_IMM32_OPCODE;
+	code->cmp.modrm = optpoline_cmp_modrm(op->reg);
+	code->cmp.imm = op->target;
+
+	code->jnz.rex = KERNEL_RESTARTABLE_PREFIX;
+	code->jnz.opcode = JNZ_REL8_OPCODE;
+	code->jnz.rel = offsetof(struct optpoline_code, fallback) -
+			offsetofend(struct optpoline_code, jnz);
+
+	code->call.rex = KERNEL_RESTARTABLE_PREFIX;
+	code->call.opcode = CALL_REL32_OPCODE;
+	code->call.rel = CALL_OFFSET(op, call, kernel_ptr(op->target));
+
+	make_common_optpoline(op, code, fallback_target);
+}
+
+static void make_unconditional_optpoline(struct optpoline *op,
+					 struct optpoline_code *code,
+					 const void *fallback_target)
+{
+	/*
+	 * Avoid having some partial code that may complicate debugging, and try
+	 * to catch bugs.
+	 */
+	memset(code, INT3_OPCODE, offsetofend(struct optpoline_code, call));
+
+	code->skip.opcode = JMP_REL8_OPCODE;
+	code->skip.rel = offsetof(struct optpoline_code, fallback) -
+			 offsetofend(struct optpoline_code, skip);
+
+	code->fallback.rex = KERNEL_RESTARTABLE_PREFIX;
+	code->fallback.opcode = CALL_REL32_OPCODE;
+	code->fallback.rel = CALL_OFFSET(op, fallback, fallback_target);
+
+	make_common_optpoline(op, code, fallback_target);
+}
+
+/**
+ * update_optpolines() - Patch the optpoline code
+ *
+ * @list: List of optpolines to patch
+ * @learn: If true, the optpolines will be set in learning mode.
+ *
+ * Ensure all cores no longer run the optpolines we play with. Since preemption
+ * is disabled between the optpoline compare and call, this would mean they are
+ * all safe.
+ */
+static void update_optpolines(void)
+{
+	struct list_head *list = &optpolines.list[OP_STATE_UPDATING].list;
+	const u8 patching_offset = offsetofend(struct optpoline_code,
+					       patching_call);
+	struct optpoline *op, *tmp;
+
+	mutex_lock(&text_mutex);
+
+	list_for_each_entry_safe(op, tmp, list, node) {
+		const void *fallback_target, *end_of_optpoline;
+		struct optpoline_code code, *p_code;
+		enum optpoline_state state;
+
+		p_code = kernel_ptr(op->rip);
+
+		fallback_target = (is_learning_optpoline(op)) ?
+					save_optpoline_funcs[op->reg] :
+					indirect_thunks[op->reg];
+
+		end_of_optpoline = (const void *)(p_code + 1);
+
+		/*
+		 * Read the original code, since we are lazy to intialize all of
+		 * it.
+		 */
+		memcpy(&code, p_code, sizeof(code));
+
+		/* Skip the code, calling the retpoline during patching */
+		BUILD_BUG_ON(sizeof(code.cmp) < sizeof(code.patching_call));
+		memset(&code.patching_call, INT3_OPCODE, sizeof(code.cmp));
+		code.patching_call.opcode = CALL_REL32_OPCODE;
+		code.patching_call.rel = CALL_OFFSET(op, patching_call,
+						     skip_optpoline_funcs[op->reg]);
+
+		BUILD_BUG_ON(sizeof_field(struct optpoline_code, patching_call) != 5);
+
+		text_poke_bp(p_code, &code, sizeof(code.patching_call),
+			     &p_code->fallback);
+
+		/* Wait for everyone to sees the updated version */
+		synchronize_sched();
+
+		if (is_conditional_optpoline(op))
+			make_conditional_optpoline(op, &code, fallback_target);
+		else
+			make_unconditional_optpoline(op, &code, fallback_target);
+
+		/* Patch everything but the first instruction */
+		text_poke((u8 *)p_code + patching_offset,
+			  (const u8 *)&code + patching_offset,
+			  sizeof(code) - patching_offset);
+
+		text_poke_bp(p_code, &code, sizeof(code.patching_call),
+			     &p_code->fallback);
+
+		if (is_conditional_optpoline(op)) {
+			state = op->unstable ? OP_STATE_UNSTABLE :
+					       OP_STATE_STABLE;
+		} else if (op->has_target)
+			state = OP_STATE_RELEARN;
+		else
+			state = op->to_learn ? OP_STATE_LEARN : OP_STATE_NEW;
+
+		op->to_learn = false;
+		change_optpoline_state(op, state);
+	}
+	mutex_unlock(&text_mutex);
+}
+
+static void optpoline_work(struct work_struct *work);
+
+static DECLARE_DELAYED_WORK(c_work, optpoline_work);
+
+static unsigned int set_optpolines_to_learn(enum optpoline_state state,
+					    unsigned int n)
+{
+	struct optpoline *op, *tmp;
+	unsigned int i = 0;
+
+	list_for_each_entry_safe(op, tmp, &optpolines.list[state].list, node) {
+		if (i == n)
+			break;
+
+		op->to_learn = true;
+		change_optpoline_state(op, OP_STATE_UPDATING);
+		i++;
+	}
+	return i;
+}
+
+/**
+ * relearn_pending() - Relearn optpolines which are waiting to be relearned.
+ *
+ * First relearn the targets of new indirect branches (after boot, or module
+ * load). Second, take those that have more than a single target and relearn
+ * them. Pace the learning to prevent too many collisions in our sampling
+ * data-structures.
+ */
+static int relearn_pending(void)
+{
+	unsigned int n;
+
+	n = set_optpolines_to_learn(OP_STATE_NEW, OP_ENABLE_IN_EPOCH);
+
+	if (n == 0) {
+		n = set_optpolines_to_learn(OP_STATE_UNSTABLE,
+			    min_t(unsigned int, optpolines.pending_relearn,
+						OP_REENABLE_IN_EPOCH));
+		optpolines.pending_relearn -= n;
+	}
+
+	if (n > 0)
+		update_optpolines();
+	return n;
+}
+
+static void optpolines_autolearn(void)
+{
+	if (relearn_pending() > 0) {
+		optpolines.sample_time = ktime_get();
+		return;
+	}
+
+	if (ktime_ms_delta(ktime_get(), optpolines.sample_time) < OP_SAMPLE_MSECS)
+		return;
+
+	/* Start another training period */
+	optpolines.pending_relearn = optpolines.list[OP_STATE_UNSTABLE].num;
+}
+
+static void optpoline_work(struct work_struct *work)
+{
+	struct optpoline *op, *tmp;
+	bool enabled;
+
+	mutex_lock(&optpolines.mutex);
+	cpus_read_lock();
+
+	enabled = optpolines.enabled;
+
+	/*
+	 * If something went wrong and the optpolines were disabled, we need to
+	 * bail out.
+	 */
+	if (unlikely(!enabled))
+		goto out;
+
+	/* Pickup the new samples */
+	process_optpoline_samples();
+
+	/*
+	 * For all those indirect branches that had a target before, but got no
+	 * samples, just set the target that we had before.
+	 */
+	list_for_each_entry_safe(op, tmp,
+				 &optpolines.list[OP_STATE_RELEARN].list, node)
+		change_optpoline_state(op, OP_STATE_UPDATING);
+
+	if (!list_empty(&optpolines.list[OP_STATE_UPDATING].list)) {
+		update_optpolines();
+		clear_optpoline_samples();
+	} else
+		optpolines_autolearn();
+
+out:
+	cpus_read_unlock();
+	mutex_unlock(&optpolines.mutex);
+
+	if (likely(enabled))
+		schedule_delayed_work(&c_work, HZ * OP_READJUST_SECONDS);
+}
+
+static void reset_optpolines(enum optpoline_state state,
+			     const struct module *mod)
+{
+	struct list_head *list = &optpolines.list[state].list;
+	struct optpoline *op, *tmp;
+
+	list_for_each_entry_safe(op, tmp, list, node) {
+		if (mod && !within_module((uintptr_t)kernel_ptr(op->target), mod))
+			continue;
+
+		op->unstable = false;
+		op->has_target = false;
+		op->to_learn = false;
+		change_optpoline_state(op, OP_STATE_UPDATING);
+	}
+}
+
+#ifdef CONFIG_MODULES
+/*
+ * reset_optpoline_module_targets() - reset optpolines with module targets
+ *
+ * @mod:	the module which optpolines that point to are removed
+ *
+ * Remove optpolines whose target is in the module as a safety precaution
+ * against speculative execution that will jump to the target.
+ */
+static void reset_optpoline_module_targets(const struct module *mod)
+{
+	enum optpoline_state state;
+
+	for (state = OP_STATE_W_TARGET_FIRST; state <= OP_STATE_W_TARGET_LAST; state++)
+		reset_optpolines(state, mod);
+
+	update_optpolines();
+}
+
+static int optpoline_module_notify(struct notifier_block *self,
+				   unsigned long val, void *data)
+{
+	struct module *mod = data;
+
+	mutex_lock(&optpolines.mutex);
+
+	switch (val) {
+	case MODULE_STATE_COMING:
+		add_optpolines(mod->optpoline_entries, mod->num_optpolines, mod);
+		break;
+	case MODULE_STATE_GOING:
+		/* Remove those which jump from the module source */
+		remove_module_optpolines(mod);
+
+		/* Remove those that point to the module */
+		reset_optpoline_module_targets(mod);
+
+		/* Clear the samples since they may point to the module */
+		clear_optpoline_samples();
+	}
+
+	mutex_unlock(&optpolines.mutex);
+	return 0;
+}
+#else
+static int optpoline_module_notify(struct notifier_block *self,
+				   unsigned long val, void *data)
+{
+}
+#endif
+
+static struct notifier_block optpoline_module_nb = {
+	.notifier_call = optpoline_module_notify,
+	.priority = 1,
+};
+
+#ifdef CONFIG_PREEMPT
+/**
+ * optpoline_rseq() - restart a retpoline due to preemption
+ *
+ * @regs: pointer to the registers of the task.
+ *
+ * This function should be called only when the kernel is preempted and when the
+ * KERNEL_RSEQ_PREFIX is at the beginning of the preempted instruction.
+ */
+asmlinkage __visible void optpoline_restart_rseq(struct pt_regs *regs)
+{
+	u8 i;
+	u8 offsets[3] = {
+		offsetof(struct optpoline_code, jnz),
+		offsetof(struct optpoline_code, call),
+		offsetof(struct optpoline_code, fallback)
+	};
+
+	rcu_read_lock();
+	for (i = 0; i < ARRAY_SIZE(offsets); i++) {
+		unsigned long rip = regs->ip + offsets[i];
+		struct optpoline *op;
+
+		op = rhashtable_lookup(&optpolines.rhead, &rip,
+				       optpoline_rht_params);
+
+		if (op) {
+			/*
+			 * We found an appropriate entry, move the pointer to
+			 * the start of the optpoline.
+			 */
+			regs->ip -= offsets[i];
+			break;
+		}
+	};
+	rcu_read_unlock();
+}
+#endif
+
+static int optpoline_debug_show(struct seq_file *f, void *offset)
+{
+	int i;
+
+	for (i = 0; i < OP_STATE_NUM; i++)
+		seq_printf(f, "%s %u\n", optpoline_state_name[i],
+			   optpolines.list[i].num);
+
+	return 0;
+}
+
+static int optpoline_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, optpoline_debug_show, inode->i_private);
+}
+
+static ssize_t optpoline_debug_write(struct file *file, const char __user *ubuf,
+				     size_t count, loff_t *ppos)
+{
+	u8 kbuf[40] = {0};
+	ssize_t ret = 0;
+	size_t len;
+
+	len = min(count, sizeof(kbuf) - 1);
+
+	if (len == 0)
+		return -EINVAL;
+
+	if (copy_from_user(kbuf, ubuf, len))
+		return -EFAULT;
+
+	kbuf[len] = '\0';
+	if (kbuf[len - 1] == '\n')
+		kbuf[len - 1] = '\0';
+
+	mutex_lock(&optpolines.mutex);
+
+	if (strcmp(kbuf, "relearn") == 0) {
+		/* Reinitiate immediate relearning of all the unstable ones */
+
+		optpolines.pending_relearn = optpolines.list[OP_STATE_UNSTABLE].num;
+	} else if (strcmp(kbuf, "reset") == 0) {
+		/* Forget everything we know */
+
+		reset_optpolines(OP_STATE_UNSTABLE, NULL);
+		reset_optpolines(OP_STATE_RELEARN, NULL);
+		optpolines.pending_relearn = 0;
+		update_optpolines();
+	} else
+		ret = -EINVAL;
+
+	mutex_unlock(&optpolines.mutex);
+
+	return count;
+}
+
+static const struct file_operations optpoline_debug_fops = {
+	.owner		= THIS_MODULE,
+	.open		= optpoline_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.write		= optpoline_debug_write,
+};
+
+static int __init create_optpoline_debugfs(void)
+{
+	int r;
+
+	if (!IS_ENABLED(CONFIG_DEBUG_FS))
+		return 0;
+
+	optpolines.dbg_entry = debugfs_create_file("optpolines", 0600, NULL,
+						  NULL, &optpoline_debug_fops);
+	if (IS_ERR(optpolines.dbg_entry)) {
+		r = PTR_ERR(optpolines.dbg_entry);
+		pr_err("failed to create debugfs entry, error: %d", r);
+		return r;
+	}
+
+	return 0;
+}
+
+static void optpolines_destroy(void)
+{
+	kvfree(optpolines.samples_copy);
+	optpolines.samples_copy = NULL;
+
+	unregister_module_notifier(&optpoline_module_nb);
+
+	if (optpolines.rhead_initialized)
+		rhashtable_destroy(&optpolines.rhead);
+	optpolines.rhead_initialized = false;
+
+	kvfree(optpolines.kernel_optpolines);
+	optpolines.kernel_optpolines = NULL;
+
+	optpolines.enabled = false;
+}
+
+static int optpoline_realloc_samples_copy(unsigned int n_cpus)
+{
+	kvfree(optpolines.samples_copy);
+
+	optpolines.samples_copy = kvmalloc_array(n_cpus * OPTPOLINE_SAMPLES_NUM,
+					sizeof(*optpolines.samples_copy),
+					GFP_KERNEL);
+
+	if (optpolines.samples_copy == NULL) {
+		/*
+		 * Indicate for the worker to stop all work on the next
+		 * iteration. We will not break anything, but we must disable
+		 * all optpolines.
+		 */
+		WARN(1, "error allocating optpoline memory");
+		optpolines_destroy();
+	}
+
+	return 0;
+}
+
+static int optpoline_cpu_online(unsigned int cpu, struct hlist_node *node)
+{
+	unsigned int n_cpus = num_online_cpus() + !cpu_online(cpu);
+
+	if (!optpolines.enabled)
+		return 0;
+
+	return optpoline_realloc_samples_copy(n_cpus);
+}
+
+static int optpoline_cpu_prep_down(unsigned int cpu, struct hlist_node *node)
+{
+	unsigned int n_cpus = num_online_cpus() - !!cpu_online(cpu);
+
+	if (!optpolines.enabled)
+		return 0;
+
+	return optpoline_realloc_samples_copy(n_cpus);
+}
+
+static int __init optpolines_init(void)
+{
+	int i, r;
+
+	mutex_init(&optpolines.mutex);
+
+	r = rhashtable_init(&optpolines.rhead, &optpoline_rht_params);
+	if (r)
+		goto error;
+
+	optpolines.rhead_initialized = true;
+
+	if (IS_ENABLED(CONFIG_DEBUG_FS)) {
+		r = create_optpoline_debugfs();
+		if (r)
+			goto error;
+	}
+
+	optpolines.sample_time = ktime_get();
+
+	r = optpoline_realloc_samples_copy(num_online_cpus());
+	if (r)
+		goto error;
+
+	r = register_module_notifier(&optpoline_module_nb);
+	if (r) {
+		WARN(1, "error initializing optpolines");
+		goto error;
+	}
+
+	for (i = 0; i < OP_STATE_NUM; i++) {
+		INIT_LIST_HEAD(&optpolines.list[i].list);
+		optpolines.list[i].num = 0;
+	}
+
+	/*
+	 * Ignoring errors here, only part of the optpolines would be enabled.
+	 */
+	add_optpolines(__optpolines, __optpolines_end - __optpolines, NULL);
+
+	r = cpuhp_setup_state_multi(CPUHP_AP_X86_OPTPOLINE_CHANGE,
+				    "optpoline:online", optpoline_cpu_online,
+				    optpoline_cpu_prep_down);
+	if (r)
+		goto error;
+
+	optpolines.enabled = true;
+	schedule_delayed_work(&c_work, HZ * OP_READJUST_SECONDS * 10);
+	return 0;
+error:
+	optpolines_destroy();
+	return r;
+}
+late_initcall(optpolines_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index e0cd2baa8380..4caf4d5941db 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -167,6 +167,7 @@ enum cpuhp_state {
 	CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE,
 	CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE,
 	CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
+	CPUHP_AP_X86_OPTPOLINE_CHANGE,
 	CPUHP_AP_WATCHDOG_ONLINE,
 	CPUHP_AP_WORKQUEUE_ONLINE,
 	CPUHP_AP_RCUTREE_ONLINE,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 6/6] x86: outline optpoline
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
                   ` (4 preceding siblings ...)
  2018-12-31  7:21 ` [RFC v2 5/6] x86: learning and patching indirect branch targets Nadav Amit
@ 2018-12-31  7:21 ` Nadav Amit
  2018-12-31 19:51 ` [RFC v2 0/6] x86: dynamic indirect branch promotion Andy Lutomirski
  2019-01-03 22:18 ` Andi Kleen
  7 siblings, 0 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31  7:21 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree
  Cc: H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse, Nadav Amit

When there is more than a single target, we can set an outline block to
hold optimized for few more targets. This is done dynamically during
runtime, limiting the potential memory consumption.

If preemption is performed while we are running the outline block, we
jump to the indirect thunk, and continue execution from there.

The current version does not reclaim memory if an entire page of
outline optpoline blocks is released (e.g., due to module removal).
There are various additional optimizations that are possible to reduce
the memory consumption of each optpoline.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/nospec-branch.c | 372 ++++++++++++++++++++++++++++----
 1 file changed, 331 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/nospec-branch.c b/arch/x86/kernel/nospec-branch.c
index 1503c312f715..df787a00a51a 100644
--- a/arch/x86/kernel/nospec-branch.c
+++ b/arch/x86/kernel/nospec-branch.c
@@ -18,6 +18,8 @@
 #include <linux/debugfs.h>
 #include <linux/jump_label.h>
 #include <linux/rhashtable.h>
+#include <linux/moduleloader.h>
+#include <linux/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/text-patching.h>
 #include <asm/asm-offsets.h>
@@ -32,11 +34,16 @@
 #define NOP_OPCODE		(0x90)
 #define INT3_OPCODE		(0xcc)
 #define CALL_IND_INS		"\xff\xd0"
+#define JMP_REL32_OPCODE        (0xe9)
+#define JZ_REL32_OPCODE		"\x0f\x84"
 
 #define OP_READJUST_SECONDS	(1)
 #define OP_REENABLE_IN_EPOCH	(4)
 #define OP_ENABLE_IN_EPOCH	(512)
 #define OP_SAMPLE_MSECS		(30000)
+#define OP_INLINED_TARGETS	(1)
+#define OP_OUTLINED_TARGETS	(6)
+#define OP_N_TARGETS		(OP_INLINED_TARGETS + OP_OUTLINED_TARGETS)
 
 enum code_state {
 	OPTPOLINE_SLOWPATH,
@@ -46,12 +53,25 @@ enum code_state {
 
 /*
  * Information for optpolines as it dynamically changes during execution.
+ *
+ * TODO: This structure is under-optimized. For example, we can reduce its size
+ * and not save the targets, but instead realize them from the code itself.
  */
 struct optpoline {
 	struct list_head node;
 	struct rhash_head rhash;
+
+	/*
+	 * Using hashtable to look for the optpoline by the instruction-pointer.
+	 */
+	struct rhash_head outline_rhash;
+
+	/* Outline code block; NULL if none */
+	union outline_optpoline *outline_op;
+
 	u32 rip;
-	u32 target;
+	u32 outline_rip;
+	u32 targets[OP_N_TARGETS];
 
 	/* The register that is used by the indirect branch */
 	u8 reg : 4;
@@ -59,12 +79,6 @@ struct optpoline {
 	/* The state of the optpoline, which indicates on which list it is */
 	u8 state : 3;
 
-	/* Whether we ever encountered more than one target */
-	u8 unstable : 1;
-
-	/* Whether there is a valid target */
-	u8 has_target : 1;
-
 	/*
 	 * Whether the optpoline needs to be set into training mode. This is a
 	 * transitory indication, which is cleared once it is set in learning
@@ -87,8 +101,16 @@ static const struct rhashtable_params optpoline_rht_params = {
 	.head_offset = offsetof(struct optpoline, rhash),
 };
 
+static const struct rhashtable_params outline_optpoline_rht_params = {
+	.automatic_shrinking = true,
+	.key_len = sizeof(u32),
+	.key_offset = offsetof(struct optpoline, outline_rip),
+	.head_offset = offsetof(struct optpoline, outline_rhash),
+};
+
+
 DEFINE_PER_CPU_ALIGNED(struct optpoline_sample[OPTPOLINE_SAMPLES_NUM],
-		       optpoline_samples);
+			optpoline_samples);
 DEFINE_PER_CPU(u8, has_optpoline_samples);
 
 enum optpoline_state {
@@ -114,6 +136,38 @@ static const char *const optpoline_state_name[OP_STATE_NUM] = {
 	[OP_STATE_UPDATING] = "updating",
 };
 
+union outline_optpoline_entry {
+	struct {
+		struct {
+			u8 no_preempt_prefix;
+			u8 rex_prefix;
+			u8 opcode;
+			u8 modrm;
+			u32 imm;
+		} __packed cmp;
+		struct {
+			u8 no_preempt_prefix;
+			u8 opcode[2];
+			int32_t rel;
+		} __packed jz;
+	} __packed;
+	struct {
+		u8 no_preempt_prefix;
+		u8 opcode;
+		int32_t rel;
+	} __packed fallback;
+} __packed;
+
+#define OP_OUTLINE_SIZE							\
+	(ALIGN(sizeof(union outline_optpoline_entry) *			\
+		(OP_OUTLINED_TARGETS + 1), SMP_CACHE_BYTES))
+
+union outline_optpoline {
+	union outline_optpoline_entry entries[OP_OUTLINED_TARGETS + 1];
+	union outline_optpoline *next_free;
+	u8 padding[OP_OUTLINE_SIZE];
+} __packed;
+
 struct optpolines {
 	struct mutex mutex;
 	struct optpoline_sample *samples_copy;
@@ -124,6 +178,12 @@ struct optpolines {
 	 */
 	struct rhashtable rhead;
 
+	/*
+	 * Hashtable that holds all the outlined optpolines, key'd by the
+	 * instruction pointer of the outline optpoline code block, aligned.
+	 */
+	struct rhashtable outline_rhead;
+
 	/*
 	 * List of optpolines according to their states.
 	 */
@@ -140,6 +200,7 @@ struct optpolines {
 	 * wasn't initialized, so we need to keep track of it.
 	 */
 	u8 rhead_initialized : 1;
+	u8 outline_rhead_initialized : 1;
 
 	 /* Number of unstable optpolines that are pending relearning. */
 	unsigned int pending_relearn;
@@ -147,11 +208,18 @@ struct optpolines {
 	/* Kernel (excluding modules) optpolines */
 	struct optpoline *kernel_optpolines;
 
+	union outline_optpoline *outline_free_list;
+
 	struct dentry *dbg_entry;
 
 	ktime_t sample_time;
 };
 
+struct optpoline_target {
+	u32 target;
+	u32 count;
+};
+
 static struct optpolines optpolines;
 
 static inline void *kernel_ptr(u32 low_addr)
@@ -169,6 +237,149 @@ static inline u8 optpoline_cmp_modrm(u8 reg)
 	return 0xf8 | reg;
 }
 
+static void release_outline_branch_block(struct optpoline *op)
+{
+	union outline_optpoline *p_outline = kernel_ptr(op->outline_rip);
+
+	if (!p_outline)
+		return;
+
+	/*
+	 * Update the linked-list, but since these are code pages, we need to do
+	 * it by poking the text.
+	 */
+	text_poke(&p_outline->next_free, optpolines.outline_free_list,
+		  sizeof(p_outline->next_free));
+	optpolines.outline_free_list = p_outline;
+
+	rhashtable_remove_fast(&optpolines.outline_rhead, &op->outline_rhash,
+			       outline_optpoline_rht_params);
+
+	op->outline_rip = 0;
+
+	/* TODO: release the whole page if its reference count drops to zero */
+}
+
+static void alloc_outline_branch_block(struct optpoline *op)
+{
+	const u32 entries_per_chunk = PAGE_SIZE / OP_OUTLINE_SIZE;
+	union outline_optpoline *outline, *chunk;
+	unsigned int i;
+	int r;
+
+retry:
+	outline = optpolines.outline_free_list;
+	if (outline) {
+		op->outline_rip = (u32)(uintptr_t)outline;
+		r = rhashtable_insert_fast(&optpolines.outline_rhead, &op->rhash,
+					   outline_optpoline_rht_params);
+
+		if (unlikely(r < 0)) {
+			WARN_ONCE(1, "outline block allocation failed");
+			op->outline_rip = 0;
+			return;
+		}
+
+		optpolines.outline_free_list = outline->next_free;
+		return;
+	}
+
+	chunk = module_alloc(PAGE_SIZE);
+	if (unlikely(!chunk)) {
+		WARN_ONCE(1, "outline block allocation failed");
+		return;
+	}
+
+	/* Add the entries from the newly allocated chunk to the linked list */
+	for (i = 1; i < entries_per_chunk; i++)
+		chunk[i - 1].next_free = &chunk[i];
+
+	chunk[entries_per_chunk - 1].next_free = optpolines.outline_free_list;
+	optpolines.outline_free_list = &chunk[0];
+
+	set_memory_ro((uintptr_t)chunk, 1);
+	goto retry;
+}
+
+static int make_outline_optpoline(struct optpoline *op)
+{
+	union outline_optpoline outline_op, *p_outline;
+	union outline_optpoline_entry *entry;
+	uintptr_t rel_from, rel_to;
+	size_t set_bytes;
+	unsigned int i;
+	u8 *last_byte;
+	s64 offset;
+
+	if (!op->outline_rip) {
+		alloc_outline_branch_block(op);
+
+		/*
+		 * If we hit some allocation error, only use the first
+		 * destination and override the previous decision.
+		 */
+		if (unlikely(!op->outline_rip))
+			return -ENOMEM;
+	}
+
+	p_outline = kernel_ptr(op->outline_rip);
+
+	/* Offset to adjust the pointer before poking */
+	offset = (uintptr_t)p_outline - (uintptr_t)&outline_op;
+
+	for (i = 0, entry = &outline_op.entries[0]; i < OP_OUTLINED_TARGETS;
+	     i++, entry++) {
+		u32 target = op->targets[i + OP_INLINED_TARGETS];
+
+		if (target == 0)
+			break;
+
+		/* compare target to immediate */
+		entry->cmp.no_preempt_prefix = KERNEL_RESTARTABLE_PREFIX;
+		entry->cmp.opcode = CMP_IMM32_OPCODE;
+		entry->cmp.imm = target;
+		entry->cmp.rex_prefix = optpoline_cmp_rex_prefix(op->reg);
+		entry->cmp.modrm = optpoline_cmp_modrm(op->reg);
+
+		/*
+		 * During installation of the outline cachepoline, we set all
+		 * the entries as JMPs, and patch them into JNZ only once we
+		 * enable them.
+		 */
+		entry->jz.no_preempt_prefix = KERNEL_RESTARTABLE_PREFIX;
+		memcpy(entry->jz.opcode, JZ_REL32_OPCODE, sizeof(JZ_REL32_OPCODE));
+
+		rel_from = (uintptr_t)&entry->jz + sizeof(entry->jz) + offset;
+		rel_to = (uintptr_t)kernel_ptr(target);
+		entry->jz.rel = (s32)(rel_to - rel_from);
+	}
+
+	entry->fallback.no_preempt_prefix = KERNEL_RESTARTABLE_PREFIX;
+	entry->fallback.opcode = JMP_REL32_OPCODE;
+
+	rel_from = (uintptr_t)&entry->fallback + sizeof(entry->fallback) + offset;
+
+	/*
+	 * If we ran out of space, stop learning, since it would thrash the
+	 * learning hash-table.
+	 */
+	if (entry == &outline_op.entries[OP_OUTLINED_TARGETS])
+		rel_to = (uintptr_t)indirect_thunks[op->reg];
+	else
+		rel_to = (uintptr_t)save_optpoline_funcs[op->reg];
+
+	entry->fallback.rel = (s32)(rel_to - rel_from);
+
+	/* Fill the rest with int3 */
+	last_byte = (u8 *)&entry->fallback + sizeof(entry->fallback);
+	set_bytes = (uintptr_t)last_byte - (uintptr_t)&outline_op;
+	memset(last_byte, INT3_OPCODE, OP_OUTLINE_SIZE - set_bytes);
+
+	text_poke(p_outline, &outline_op, sizeof(outline_op));
+
+	return 0;
+}
+
 static void clear_optpoline_samples(void)
 {
 	int cpu, i;
@@ -264,6 +475,7 @@ static void remove_module_optpolines(struct module *mod)
 			continue;
 
 		remove_optpoline_from_list(op);
+		release_outline_branch_block(op);
 
 		rhashtable_remove_fast(&optpolines.rhead, &op->rhash,
 				       optpoline_rht_params);
@@ -358,31 +570,44 @@ static void analyze_optpoline_samples(struct optpoline *op,
 				      struct optpoline_sample *samples,
 				      unsigned int n_entries)
 {
-	int i, n_targets = 0;
+	unsigned int i, j, k, n_new_targets, n_prev_targets;
+
+	/* Count the number of previous targets */
+	for (n_prev_targets = 0; n_prev_targets < OP_N_TARGETS; n_prev_targets++) {
+		if (op->targets[n_prev_targets] == 0)
+			break;
+	}
 
 	/* Merge destinations with the same source and sum their samples. */
-	for (i = 0; i < n_entries; i++) {
-		if (n_targets == 0 ||
-		    samples[n_targets-1].tgt != samples[i].tgt) {
+	for (i = 0, n_new_targets = 0; i < n_entries; i++) {
+		if (n_new_targets == 0 ||
+		    samples[n_new_targets-1].tgt != samples[i].tgt) {
 			/* New target */
-			samples[n_targets++] = samples[i];
+			samples[n_new_targets++] = samples[i];
 			continue;
 		}
 
 		/* Known target, add samples */
-		samples[n_targets - 1].cnt += samples[i].cnt;
+		samples[n_new_targets - 1].cnt += samples[i].cnt;
 	}
 
 	/* Sort targets by frequency */
-	sort(samples, n_targets, sizeof(*samples),
+	sort(samples, n_new_targets, sizeof(*samples),
 	     optpoline_sample_cnt_cmp_func, NULL);
 
-	/* Mark as unstable if there is more than a single target */
-	if ((op->has_target && op->target != samples[0].tgt) || n_targets > 1)
-		op->unstable = true;
+	for (i = n_prev_targets, j = 0;
+	     i < OP_N_TARGETS && j < n_new_targets; j++) {
+		bool duplicate = false;
 
-	op->target = samples[0].tgt;
-	op->has_target = true;
+		/* Check if the new target is already considered */
+		for (k = 0; k < n_new_targets; k++)
+			duplicate |= (samples[j].tgt == op->targets[k]);
+
+		if (duplicate)
+			continue;
+
+		op->targets[i++] = samples[j].tgt;
+	}
 }
 
 /*
@@ -435,11 +660,11 @@ static inline bool is_learning_optpoline(struct optpoline *op)
 		return true;
 
 	/* Don't get updates if we know there are multiple targets */
-	if (op->unstable)
+	if (op->targets[OP_INLINED_TARGETS])
 		return false;
 
 	/* If we still don't know what the target, learning is needed */
-	return !op->has_target;
+	return op->targets[0] != 0;
 }
 
 static inline bool is_conditional_optpoline(struct optpoline *op)
@@ -449,7 +674,7 @@ static inline bool is_conditional_optpoline(struct optpoline *op)
 		return false;
 
 	/* If we don't know where to go, the condition is useless */
-	return op->has_target;
+	return op->targets[0];
 }
 
 #define CALL_OFFSET(op, offset, target)					\
@@ -470,13 +695,13 @@ static void make_common_optpoline(struct optpoline *op,
 }
 
 static void make_conditional_optpoline(struct optpoline *op,
-				       struct optpoline_code *code,
-				       const void *fallback_target)
+					struct optpoline_code *code,
+					const void *fallback_target)
 {
 	code->cmp.rex = optpoline_cmp_rex_prefix(op->reg);
 	code->cmp.opcode = CMP_IMM32_OPCODE;
 	code->cmp.modrm = optpoline_cmp_modrm(op->reg);
-	code->cmp.imm = op->target;
+	code->cmp.imm = op->targets[0];
 
 	code->jnz.rex = KERNEL_RESTARTABLE_PREFIX;
 	code->jnz.opcode = JNZ_REL8_OPCODE;
@@ -485,7 +710,7 @@ static void make_conditional_optpoline(struct optpoline *op,
 
 	code->call.rex = KERNEL_RESTARTABLE_PREFIX;
 	code->call.opcode = CALL_REL32_OPCODE;
-	code->call.rel = CALL_OFFSET(op, call, kernel_ptr(op->target));
+	code->call.rel = CALL_OFFSET(op, call, kernel_ptr(op->targets[0]));
 
 	make_common_optpoline(op, code, fallback_target);
 }
@@ -525,7 +750,7 @@ static void update_optpolines(void)
 {
 	struct list_head *list = &optpolines.list[OP_STATE_UPDATING].list;
 	const u8 patching_offset = offsetofend(struct optpoline_code,
-					       patching_call);
+						patching_call);
 	struct optpoline *op, *tmp;
 
 	mutex_lock(&text_mutex);
@@ -537,10 +762,6 @@ static void update_optpolines(void)
 
 		p_code = kernel_ptr(op->rip);
 
-		fallback_target = (is_learning_optpoline(op)) ?
-					save_optpoline_funcs[op->reg] :
-					indirect_thunks[op->reg];
-
 		end_of_optpoline = (const void *)(p_code + 1);
 
 		/*
@@ -564,6 +785,25 @@ static void update_optpolines(void)
 		/* Wait for everyone to sees the updated version */
 		synchronize_sched();
 
+		if (op->targets[OP_INLINED_TARGETS]) {
+			int r = make_outline_optpoline(op);
+
+			/*
+			 * If there is an error in creating the outline, fall
+			 * back to a single target.
+			 */
+			if (r)
+				op->targets[OP_INLINED_TARGETS] = 0;
+		}
+
+		/* Choose the appropriate target */
+		if (is_learning_optpoline(op))
+			fallback_target = save_optpoline_funcs[op->reg];
+		else if (op->outline_rip)
+			fallback_target = kernel_ptr(op->outline_rip);
+		else
+			fallback_target = indirect_thunks[op->reg];
+
 		if (is_conditional_optpoline(op))
 			make_conditional_optpoline(op, &code, fallback_target);
 		else
@@ -578,9 +818,10 @@ static void update_optpolines(void)
 			     &p_code->fallback);
 
 		if (is_conditional_optpoline(op)) {
-			state = op->unstable ? OP_STATE_UNSTABLE :
-					       OP_STATE_STABLE;
-		} else if (op->has_target)
+			state = op->targets[OP_INLINED_TARGETS] ?
+						OP_STATE_UNSTABLE :
+						OP_STATE_STABLE;
+		} else if (op->targets[0])
 			state = OP_STATE_RELEARN;
 		else
 			state = op->to_learn ? OP_STATE_LEARN : OP_STATE_NEW;
@@ -694,6 +935,19 @@ static void optpoline_work(struct work_struct *work)
 		schedule_delayed_work(&c_work, HZ * OP_READJUST_SECONDS);
 }
 
+static bool optpoline_has_targets_in_module(struct optpoline *op,
+					    const struct module *mod)
+{
+	unsigned int i;
+
+	for (i = 0; i < OP_N_TARGETS && op->targets[i]; i++) {
+		if (within_module((uintptr_t)kernel_ptr(op->targets[i]), mod))
+			return true;
+	}
+
+	return false;
+}
+
 static void reset_optpolines(enum optpoline_state state,
 			     const struct module *mod)
 {
@@ -701,13 +955,20 @@ static void reset_optpolines(enum optpoline_state state,
 	struct optpoline *op, *tmp;
 
 	list_for_each_entry_safe(op, tmp, list, node) {
-		if (mod && !within_module((uintptr_t)kernel_ptr(op->target), mod))
+		if (mod && !optpoline_has_targets_in_module(op, mod))
 			continue;
 
-		op->unstable = false;
-		op->has_target = false;
+		op->targets[0] = 0;
 		op->to_learn = false;
 		change_optpoline_state(op, OP_STATE_UPDATING);
+
+		/*
+		 * TODO: To be super safe, we should disable the optpolines
+		 * immediately instead of deferring the update. Otherwise, there
+		 * is a time-window in which the optpoline can speculatively
+		 * jump to the wrong target. Someone may use BPF to set his own
+		 * code on the call target, for example, and exploit it.
+		 */
 	}
 }
 
@@ -778,6 +1039,8 @@ static struct notifier_block optpoline_module_nb = {
  */
 asmlinkage __visible void optpoline_restart_rseq(struct pt_regs *regs)
 {
+	uintptr_t outline_start;
+	struct optpoline *op;
 	u8 i;
 	u8 offsets[3] = {
 		offsetof(struct optpoline_code, jnz),
@@ -786,12 +1049,13 @@ asmlinkage __visible void optpoline_restart_rseq(struct pt_regs *regs)
 	};
 
 	rcu_read_lock();
+
+	/* Check the inline optpoline */
 	for (i = 0; i < ARRAY_SIZE(offsets); i++) {
 		unsigned long rip = regs->ip + offsets[i];
-		struct optpoline *op;
 
 		op = rhashtable_lookup(&optpolines.rhead, &rip,
-				       optpoline_rht_params);
+					optpoline_rht_params);
 
 		if (op) {
 			/*
@@ -799,9 +1063,25 @@ asmlinkage __visible void optpoline_restart_rseq(struct pt_regs *regs)
 			 * the start of the optpoline.
 			 */
 			regs->ip -= offsets[i];
-			break;
+			goto out;
 		}
 	};
+
+	/* Check the outline optpoline. Use the alignment to ease the search */
+	outline_start = (regs->ip & PAGE_MASK) |
+			ALIGN_DOWN(offset_in_page(regs->ip), OP_OUTLINE_SIZE);
+
+	op = rhashtable_lookup(&optpolines.outline_rhead, &outline_start,
+			       outline_optpoline_rht_params);
+
+	/*
+	 * We already have performed the call and we don't want to mess with the
+	 * stack, so just move the rip to the indirect call thunk.
+	 */
+	if (op)
+		regs->ip = (uintptr_t)indirect_thunks[op->reg];
+
+out:
 	rcu_read_unlock();
 }
 #endif
@@ -900,6 +1180,10 @@ static void optpolines_destroy(void)
 		rhashtable_destroy(&optpolines.rhead);
 	optpolines.rhead_initialized = false;
 
+	if (optpolines.outline_rhead_initialized)
+		rhashtable_destroy(&optpolines.outline_rhead);
+	optpolines.outline_rhead_initialized = false;
+
 	kvfree(optpolines.kernel_optpolines);
 	optpolines.kernel_optpolines = NULL;
 
@@ -959,6 +1243,12 @@ static int __init optpolines_init(void)
 
 	optpolines.rhead_initialized = true;
 
+	r = rhashtable_init(&optpolines.outline_rhead, &outline_optpoline_rht_params);
+	if (r)
+		goto error;
+
+	optpolines.outline_rhead_initialized = true;
+
 	if (IS_ENABLED(CONFIG_DEBUG_FS)) {
 		r = create_optpoline_debugfs();
 		if (r)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
                   ` (5 preceding siblings ...)
  2018-12-31  7:21 ` [RFC v2 6/6] x86: outline optpoline Nadav Amit
@ 2018-12-31 19:51 ` Andy Lutomirski
  2018-12-31 19:53   ` Nadav Amit
  2019-01-03 22:18 ` Andi Kleen
  7 siblings, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2018-12-31 19:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit,
	X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse

On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
>
> This is a revised version of optpolines (formerly named retpolines) for
> dynamic indirect branch promotion in order to reduce retpoline overheads
> [1].

Some of your changelogs still call them "relpolines".

I have a crazy suggestion: maybe don't give them a cute name at all?
Where it's actually necessary to name them (like in a config option),
use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
CONFIG_PATCH_INDIRECT_CALLS or something.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2018-12-31 19:51 ` [RFC v2 0/6] x86: dynamic indirect branch promotion Andy Lutomirski
@ 2018-12-31 19:53   ` Nadav Amit
  2019-01-03 18:10     ` Josh Poimboeuf
  0 siblings, 1 reply; 42+ messages in thread
From: Nadav Amit @ 2018-12-31 19:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Peter Zijlstra, Josh Poimboeuf, Edward Cree,
	H . Peter Anvin, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

> On Dec 31, 2018, at 11:51 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
>> This is a revised version of optpolines (formerly named retpolines) for
>> dynamic indirect branch promotion in order to reduce retpoline overheads
>> [1].
> 
> Some of your changelogs still call them "relpolines".
> 
> I have a crazy suggestion: maybe don't give them a cute name at all?
> Where it's actually necessary to name them (like in a config option),
> use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
> CONFIG_PATCH_INDIRECT_CALLS or something.

I’m totally fine with that (don’t turn me into a "marketing” guy). It was
just a way to refer to the mechanism. I need more feedback about the more
fundamental issues to go on.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/6] x86: learning and patching indirect branch targets
  2018-12-31  7:21 ` [RFC v2 5/6] x86: learning and patching indirect branch targets Nadav Amit
@ 2018-12-31 20:05   ` Andy Lutomirski
  2018-12-31 21:07     ` Nadav Amit
  0 siblings, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2018-12-31 20:05 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit,
	X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse

On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
>
> During runtime, we collect the targets of indirect branch targets and
> patch them in. Patching is done asynchronously, by modifying each of the
> relpoline code-paths separately while diverting code execution to the
> other path during patching. Preemption is disabled while the code runs,
> and we wait for preemption to occur on each core to ensure no core is
> executing the patched code.
>
> To make use of relpolines, a worker goes over the experienced indirect
> calls targets and sorts them according to frequency. The target that
> was encountered most times is patched in.
>
> Periodically, the indirect branches are set back into learning mode to
> see whether the targets have changed. The current policy might be too
> aggressive.
>

Can you put, in a comment somewhere, a clear description of the actual
optpoline assembly sequence?  I'm finding this code very hard to
follow as is.  Something like:

/*
 * An optpoline is:
 *
* cmp something, something else
* je somewhere
* [repeats of the above]
* RETPOLINE (i.e. call some thunk)
*/

And please make it correct.

Your comment says that preemption is disabled, but it's not obvious to
me where this happens.

Also, you define REX_B and don't use it.  Are there other cases of that?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2018-12-31  7:21 ` [RFC v2 1/6] x86: introduce kernel restartable sequence Nadav Amit
@ 2018-12-31 20:08   ` Andy Lutomirski
  2018-12-31 21:12     ` Nadav Amit
  2019-01-03 22:21   ` Andi Kleen
  2019-01-04  0:34   ` hpa
  2 siblings, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2018-12-31 20:08 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit,
	X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse

On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
>
> It is sometimes beneficial to have a restartable sequence - very few
> instructions which if they are preempted jump to a predefined point.
>
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication for instruction in such a sequence. Before
> calling the schedule IRQ routine, if the "magic" prefix is found, we
> call a routine to adjust the instruction pointer.  It is expected that
> this opcode is not in common use.
>
> The following patch will make use of this function. Since there are no
> other users (yet?), the patch does not bother to create a general
> infrastructure and API that others can use for such sequences. Yet, it
> should not be hard to make such extension later.

The following patch does not use it.  Can you update this?


> +asmlinkage __visible void restart_kernel_rseq(struct pt_regs *regs)
> +{
> +       if (user_mode(regs) || *(u8 *)regs->ip != KERNEL_RESTARTABLE_PREFIX)
> +               return;

else?

I suspect something is missing here.  Or I'm very confused.

> +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/6] x86: learning and patching indirect branch targets
  2018-12-31 20:05   ` Andy Lutomirski
@ 2018-12-31 21:07     ` Nadav Amit
  0 siblings, 0 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31 21:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Peter Zijlstra, Josh Poimboeuf, Edward Cree,
	H . Peter Anvin, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

> On Dec 31, 2018, at 12:05 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
>> During runtime, we collect the targets of indirect branch targets and
>> patch them in. Patching is done asynchronously, by modifying each of the
>> relpoline code-paths separately while diverting code execution to the
>> other path during patching. Preemption is disabled while the code runs,
>> and we wait for preemption to occur on each core to ensure no core is
>> executing the patched code.
>> 
>> To make use of relpolines, a worker goes over the experienced indirect
>> calls targets and sorts them according to frequency. The target that
>> was encountered most times is patched in.
>> 
>> Periodically, the indirect branches are set back into learning mode to
>> see whether the targets have changed. The current policy might be too
>> aggressive.
> 
> Can you put, in a comment somewhere, a clear description of the actual
> optpoline assembly sequence?  I'm finding this code very hard to
> follow as is.  Something like:
> 
> /*
> * An optpoline is:
> *
> * cmp something, something else
> * je somewhere
> * [repeats of the above]
> * RETPOLINE (i.e. call some thunk)
> */
> 

Sure. I will add it. The GCC plugin code [3/6] holds commented assembly
code, but I will add it to the commit log as well.

> And please make it correct.
> 
> Your comment says that preemption is disabled, but it's not obvious to
> me where this happens.
> 
> Also, you define REX_B and don't use it.  Are there other cases of that?

Yes, I was sloppy. The preemption is not disabled, and instead I used your
proposed approach of restartable sequences.

REX_B is used as KERNEL_RESTARTABLE_PREFIX in [3/6], [5/6] and [6/6]. I will
rename it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2018-12-31 20:08   ` Andy Lutomirski
@ 2018-12-31 21:12     ` Nadav Amit
  0 siblings, 0 replies; 42+ messages in thread
From: Nadav Amit @ 2018-12-31 21:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Peter Zijlstra, Josh Poimboeuf, Edward Cree,
	H . Peter Anvin, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

> On Dec 31, 2018, at 12:08 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
>> It is sometimes beneficial to have a restartable sequence - very few
>> instructions which if they are preempted jump to a predefined point.
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication for instruction in such a sequence. Before
>> calling the schedule IRQ routine, if the "magic" prefix is found, we
>> call a routine to adjust the instruction pointer.  It is expected that
>> this opcode is not in common use.
>> 
>> The following patch will make use of this function. Since there are no
>> other users (yet?), the patch does not bother to create a general
>> infrastructure and API that others can use for such sequences. Yet, it
>> should not be hard to make such extension later.
> 
> The following patch does not use it.  Can you update this?

I will. Sorry for not updating the commit-log. The GCC plugin, and various
requests (that I am not sure I fully agree with) really caused me to spit
some blood.

>> +asmlinkage __visible void restart_kernel_rseq(struct pt_regs *regs)
>> +{
>> +       if (user_mode(regs) || *(u8 *)regs->ip != KERNEL_RESTARTABLE_PREFIX)
>> +               return;
> 
> else?
> 
> I suspect something is missing here.  Or I'm very confused.

Indeed, the code should call optpoline_restart_rseq() (or some other name,
once I fix the naming). I will fix it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2018-12-31 19:53   ` Nadav Amit
@ 2019-01-03 18:10     ` Josh Poimboeuf
  2019-01-03 18:30       ` Nadav Amit
  0 siblings, 1 reply; 42+ messages in thread
From: Josh Poimboeuf @ 2019-01-03 18:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Ingo Molnar, Peter Zijlstra, Edward Cree,
	H . Peter Anvin, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

On Mon, Dec 31, 2018 at 07:53:06PM +0000, Nadav Amit wrote:
> > On Dec 31, 2018, at 11:51 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > 
> > On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
> >> This is a revised version of optpolines (formerly named retpolines) for
> >> dynamic indirect branch promotion in order to reduce retpoline overheads
> >> [1].
> > 
> > Some of your changelogs still call them "relpolines".
> > 
> > I have a crazy suggestion: maybe don't give them a cute name at all?
> > Where it's actually necessary to name them (like in a config option),
> > use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
> > CONFIG_PATCH_INDIRECT_CALLS or something.

Cute or not, naming is important.

If you want a description instead of a name, it will be a challenge to
describe it in 2-3 words.

I have no idea what "dynamic devirtualization" means.

"Patch indirect calls" doesn't fully describe it either (and could be
easily confused with static calls and some other approaches).

> I’m totally fine with that (don’t turn me into a "marketing” guy). It was
> just a way to refer to the mechanism. I need more feedback about the more
> fundamental issues to go on.

Naming isn't marketing.  It's a real issue: it affects both usability
and code readability.

-- 
Josh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-03 18:10     ` Josh Poimboeuf
@ 2019-01-03 18:30       ` Nadav Amit
  2019-01-03 20:31         ` Josh Poimboeuf
  0 siblings, 1 reply; 42+ messages in thread
From: Nadav Amit @ 2019-01-03 18:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Ingo Molnar, Peter Zijlstra, Edward Cree,
	H . Peter Anvin, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse, Ard Biesheuvel

> On Jan 3, 2019, at 10:10 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Mon, Dec 31, 2018 at 07:53:06PM +0000, Nadav Amit wrote:
>>> On Dec 31, 2018, at 11:51 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
>>>> This is a revised version of optpolines (formerly named retpolines) for
>>>> dynamic indirect branch promotion in order to reduce retpoline overheads
>>>> [1].
>>> 
>>> Some of your changelogs still call them "relpolines".
>>> 
>>> I have a crazy suggestion: maybe don't give them a cute name at all?
>>> Where it's actually necessary to name them (like in a config option),
>>> use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
>>> CONFIG_PATCH_INDIRECT_CALLS or something.
> 
> Cute or not, naming is important.
> 
> If you want a description instead of a name, it will be a challenge to
> describe it in 2-3 words.
> 
> I have no idea what "dynamic devirtualization" means.
> 
> "Patch indirect calls" doesn't fully describe it either (and could be
> easily confused with static calls and some other approaches).
> 
>> I’m totally fine with that (don’t turn me into a "marketing” guy). It was
>> just a way to refer to the mechanism. I need more feedback about the more
>> fundamental issues to go on.
> 
> Naming isn't marketing.  It's a real issue: it affects both usability
> and code readability.

Well, allow me to be on the fence not this one.

I look for the path of least resistance. I think it would be easiest if I
first manage to make Josh’s static calls to be less intrusive. For that, I
try to add in the GCC plugin an attribute to annotate the function pointers
whose calls should be promoted. However, I don’t manage to get the
declaration from the call instruction's rtx. If anyone has a pointer on how
it can be done, that’s would be very helpful.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-03 18:30       ` Nadav Amit
@ 2019-01-03 20:31         ` Josh Poimboeuf
  0 siblings, 0 replies; 42+ messages in thread
From: Josh Poimboeuf @ 2019-01-03 20:31 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Ingo Molnar, Peter Zijlstra, Edward Cree,
	H . Peter Anvin, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse, Ard Biesheuvel

On Thu, Jan 03, 2019 at 06:30:08PM +0000, Nadav Amit wrote:
> > On Jan 3, 2019, at 10:10 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Mon, Dec 31, 2018 at 07:53:06PM +0000, Nadav Amit wrote:
> >>> On Dec 31, 2018, at 11:51 AM, Andy Lutomirski <luto@kernel.org> wrote:
> >>> 
> >>> On Sun, Dec 30, 2018 at 11:20 PM Nadav Amit <namit@vmware.com> wrote:
> >>>> This is a revised version of optpolines (formerly named retpolines) for
> >>>> dynamic indirect branch promotion in order to reduce retpoline overheads
> >>>> [1].
> >>> 
> >>> Some of your changelogs still call them "relpolines".
> >>> 
> >>> I have a crazy suggestion: maybe don't give them a cute name at all?
> >>> Where it's actually necessary to name them (like in a config option),
> >>> use a description like CONFIG_DYNAMIC_DEVIRTUALIZATION or
> >>> CONFIG_PATCH_INDIRECT_CALLS or something.
> > 
> > Cute or not, naming is important.
> > 
> > If you want a description instead of a name, it will be a challenge to
> > describe it in 2-3 words.
> > 
> > I have no idea what "dynamic devirtualization" means.
> > 
> > "Patch indirect calls" doesn't fully describe it either (and could be
> > easily confused with static calls and some other approaches).
> > 
> >> I’m totally fine with that (don’t turn me into a "marketing” guy). It was
> >> just a way to refer to the mechanism. I need more feedback about the more
> >> fundamental issues to go on.
> > 
> > Naming isn't marketing.  It's a real issue: it affects both usability
> > and code readability.
> 
> Well, allow me to be on the fence not this one.
> 
> I look for the path of least resistance. I think it would be easiest if I
> first manage to make Josh’s static calls to be less intrusive. For that, I
> try to add in the GCC plugin an attribute to annotate the function pointers
> whose calls should be promoted. However, I don’t manage to get the
> declaration from the call instruction's rtx. If anyone has a pointer on how
> it can be done, that’s would be very helpful.

Ok.  FYI, I'll be posting v3 in the next few days or so.

-- 
Josh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
                   ` (6 preceding siblings ...)
  2018-12-31 19:51 ` [RFC v2 0/6] x86: dynamic indirect branch promotion Andy Lutomirski
@ 2019-01-03 22:18 ` Andi Kleen
  2019-01-07 16:32   ` Peter Zijlstra
  7 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2019-01-03 22:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit,
	X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	adrian.hunter

Nadav Amit <namit@vmware.com> writes:
>
> - Do we use periodic learning or not? Josh suggested to reconfigure the
>   branches whenever a new target is found. However, I do not know at
>   this time how to do learning efficiently, without making learning much
>   more expensive.

FWIW frequent patching will likely completely break perf Processor Trace
decoding, which needs a somewhat stable kernel text image to decode the
traces generated by the CPU. Right now it relies on kcore dumped after
the trace usually being stable because jumplabel changes happen only
infrequently. But if you start patching frequently this assumption will
break.

You would either need a way to turn this off, or provide
updates for every change to the trace, so that the decoder can
keep track.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2018-12-31  7:21 ` [RFC v2 1/6] x86: introduce kernel restartable sequence Nadav Amit
  2018-12-31 20:08   ` Andy Lutomirski
@ 2019-01-03 22:21   ` Andi Kleen
  2019-01-03 22:29     ` Nadav Amit
  2019-01-04  0:34   ` hpa
  2 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2019-01-03 22:21 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit,
	X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse

Nadav Amit <namit@vmware.com> writes:

I see another poor man's attempt to reinvent TSX.

> It is sometimes beneficial to have a restartable sequence - very few
> instructions which if they are preempted jump to a predefined point.
>
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication for instruction in such a sequence. Before
> calling the schedule IRQ routine, if the "magic" prefix is found, we
> call a routine to adjust the instruction pointer.  It is expected that
> this opcode is not in common use.

You cannot just assume something like that. x86 is a constantly
evolving architecture. The prefix might well have meaning at
some point.

Before doing something like that you would need to ask the CPU
vendors to reserve the sequence you're using for software use.

You're doing the equivalent of patching a private system call
into your own kernel without working with upstream, don't do that.

Better to find some other solution to do the restart.
How about simply using a per cpu variable? That should be cheaper
anyways.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2019-01-03 22:21   ` Andi Kleen
@ 2019-01-03 22:29     ` Nadav Amit
  2019-01-03 22:48       ` Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Nadav Amit @ 2019-01-03 22:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse

> On Jan 3, 2019, at 2:21 PM, Andi Kleen <ak@linux.intel.com> wrote:
> 
> Nadav Amit <namit@vmware.com> writes:
> 
> I see another poor man's attempt to reinvent TSX.
> 
>> It is sometimes beneficial to have a restartable sequence - very few
>> instructions which if they are preempted jump to a predefined point.
>> 
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication for instruction in such a sequence. Before
>> calling the schedule IRQ routine, if the "magic" prefix is found, we
>> call a routine to adjust the instruction pointer.  It is expected that
>> this opcode is not in common use.
> 
> You cannot just assume something like that. x86 is a constantly
> evolving architecture. The prefix might well have meaning at
> some point.
> 
> Before doing something like that you would need to ask the CPU
> vendors to reserve the sequence you're using for software use.

Ok… I’ll try to think about another solution. Just note that this is just
used as a hint to avoid unnecessary lookups. (IOW, nothing will break if the
prefix is used.)

> You're doing the equivalent of patching a private system call
> into your own kernel without working with upstream, don't do that.

I don’t understand this comment though. Can you please explain?

> Better to find some other solution to do the restart.
> How about simply using a per cpu variable? That should be cheaper
> anyways.

The problem is that the per-cpu variable needs to be updated after the call
is executed, when we are already not in the context of the “injected” code.
I can increase it before the call, and decrease it after return - but this
can create (in theory) long periods in which the code is “unpatchable”,
increase the code size and slow performance.

Anyhow, I’ll give more thought. Ideas are welcomed.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2019-01-03 22:29     ` Nadav Amit
@ 2019-01-03 22:48       ` Andi Kleen
  2019-01-03 22:52         ` Nadav Amit
  0 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2019-01-03 22:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse

> Ok… I’ll try to think about another solution. Just note that this is just
> used as a hint to avoid unnecessary lookups. (IOW, nothing will break if the
> prefix is used.)

Are you sure actually? 

The empty prefix could mean 8bit register accesses.

> > You're doing the equivalent of patching a private system call
> > into your own kernel without working with upstream, don't do that.
> 
> I don’t understand this comment though. Can you please explain?

Instruction encoding = system call ABI
Upstream = CPU vendors

Early in Linux's history, naive Linux distribution vendors patched in their own
private system calls without waiting for upstream to define an ABI, which caused
endless compatibility problems. These days this is very frowned upon.

> > Better to find some other solution to do the restart.
> > How about simply using a per cpu variable? That should be cheaper
> > anyways.
> 
> The problem is that the per-cpu variable needs to be updated after the call
> is executed, when we are already not in the context of the “injected” code.
> I can increase it before the call, and decrease it after return - but this
> can create (in theory) long periods in which the code is “unpatchable”,
> increase the code size and slow performance.
> 
> Anyhow, I’ll give more thought. Ideas are welcomed.

Write the address of the instruction into the per cpu variable.

-Andi


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2019-01-03 22:48       ` Andi Kleen
@ 2019-01-03 22:52         ` Nadav Amit
  2019-01-03 23:40           ` Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Nadav Amit @ 2019-01-03 22:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

> On Jan 3, 2019, at 2:48 PM, Andi Kleen <ak@linux.intel.com> wrote:
> 
>> Ok… I’ll try to think about another solution. Just note that this is just
>> used as a hint to avoid unnecessary lookups. (IOW, nothing will break if the
>> prefix is used.)
> 
> Are you sure actually? 
> 
> The empty prefix could mean 8bit register accesses.
> 
>>> You're doing the equivalent of patching a private system call
>>> into your own kernel without working with upstream, don't do that.
>> 
>> I don’t understand this comment though. Can you please explain?
> 
> Instruction encoding = system call ABI
> Upstream = CPU vendors
> 
> Early in Linux's history, naive Linux distribution vendors patched in their own
> private system calls without waiting for upstream to define an ABI, which caused
> endless compatibility problems. These days this is very frowned upon.
> 
>>> Better to find some other solution to do the restart.
>>> How about simply using a per cpu variable? That should be cheaper
>>> anyways.
>> 
>> The problem is that the per-cpu variable needs to be updated after the call
>> is executed, when we are already not in the context of the “injected” code.
>> I can increase it before the call, and decrease it after return - but this
>> can create (in theory) long periods in which the code is “unpatchable”,
>> increase the code size and slow performance.
>> 
>> Anyhow, I’ll give more thought. Ideas are welcomed.
> 
> Write the address of the instruction into the per cpu variable.

Thanks for the explanations. I don’t think it would work (e.g., IRQs). I can
avoid generalizing and just detect the "magic sequence” of the code, but let
me give it some more thought.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2019-01-03 22:52         ` Nadav Amit
@ 2019-01-03 23:40           ` Andi Kleen
  2019-01-03 23:56             ` Nadav Amit
  0 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2019-01-03 23:40 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

> Thanks for the explanations. I don’t think it would work (e.g., IRQs). I can
> avoid generalizing and just detect the "magic sequence” of the code, but let
> me give it some more thought.

If you ask me I would just use compiler profile feedback or autofdo (if your
compiler has a working version)

The compiler can do a much better job at optimizing this than you ever could.

Manual FDO needs some kernel patching though.

-Andi


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2019-01-03 23:40           ` Andi Kleen
@ 2019-01-03 23:56             ` Nadav Amit
  0 siblings, 0 replies; 42+ messages in thread
From: Nadav Amit @ 2019-01-03 23:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Andy Lutomirski, Peter Zijlstra, Josh Poimboeuf,
	Edward Cree, Thomas Gleixner, LKML, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

> On Jan 3, 2019, at 3:40 PM, Andi Kleen <ak@linux.intel.com> wrote:
> 
>> Thanks for the explanations. I don’t think it would work (e.g., IRQs). I can
>> avoid generalizing and just detect the "magic sequence” of the code, but let
>> me give it some more thought.
> 
> If you ask me I would just use compiler profile feedback or autofdo (if your
> compiler has a working version)
> 
> The compiler can do a much better job at optimizing this than you ever could.
> 
> Manual FDO needs some kernel patching though.

As long as you are willing to profile and build your own kernels. (Profiling
will not tell you if users are about to use IPv4/6).

I also don’t see how FDO can support modules.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/6] x86: introduce kernel restartable sequence
  2018-12-31  7:21 ` [RFC v2 1/6] x86: introduce kernel restartable sequence Nadav Amit
  2018-12-31 20:08   ` Andy Lutomirski
  2019-01-03 22:21   ` Andi Kleen
@ 2019-01-04  0:34   ` hpa
  2 siblings, 0 replies; 42+ messages in thread
From: hpa @ 2019-01-04  0:34 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar, Andy Lutomirski, Peter Zijlstra,
	Josh Poimboeuf, Edward Cree
  Cc: Thomas Gleixner, LKML, Nadav Amit, X86 ML, Paolo Abeni,
	Borislav Petkov, David Woodhouse

On December 30, 2018 11:21:07 PM PST, Nadav Amit <namit@vmware.com> wrote:
>It is sometimes beneficial to have a restartable sequence - very few
>instructions which if they are preempted jump to a predefined point.
>
>To provide such functionality on x86-64, we use an empty REX-prefix
>(opcode 0x40) as an indication for instruction in such a sequence.
>Before
>calling the schedule IRQ routine, if the "magic" prefix is found, we
>call a routine to adjust the instruction pointer.  It is expected that
>this opcode is not in common use.
>
>The following patch will make use of this function. Since there are no
>other users (yet?), the patch does not bother to create a general
>infrastructure and API that others can use for such sequences. Yet, it
>should not be hard to make such extension later.
>
>Signed-off-by: Nadav Amit <namit@vmware.com>
>---
> arch/x86/entry/entry_64.S            | 16 ++++++++++++++--
> arch/x86/include/asm/nospec-branch.h | 12 ++++++++++++
> arch/x86/kernel/traps.c              |  7 +++++++
> 3 files changed, 33 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>index 1f0efdb7b629..e144ff8b914f 100644
>--- a/arch/x86/entry/entry_64.S
>+++ b/arch/x86/entry/entry_64.S
>@@ -644,12 +644,24 @@ retint_kernel:
> 	/* Interrupts are off */
> 	/* Check if we need preemption */
> 	btl	$9, EFLAGS(%rsp)		/* were interrupts off? */
>-	jnc	1f
>+	jnc	2f
> 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
>+	jnz	2f
>+
>+	/*
>+	 * Allow to use restartable code sections in the kernel. Consider an
>+	 * instruction with the first byte having REX prefix without any bits
>+	 * set as an indication for an instruction in such a section.
>+	 */
>+	movq    RIP(%rsp), %rax
>+	cmpb    $KERNEL_RESTARTABLE_PREFIX, (%rax)
> 	jnz	1f
>+	mov	%rsp, %rdi
>+	call	restart_kernel_rseq
>+1:
> 	call	preempt_schedule_irq
> 	jmp	0b
>-1:
>+2:
> #endif
> 	/*
> 	 * The iretq could re-enable interrupts:
>diff --git a/arch/x86/include/asm/nospec-branch.h
>b/arch/x86/include/asm/nospec-branch.h
>index dad12b767ba0..be4713ef0940 100644
>--- a/arch/x86/include/asm/nospec-branch.h
>+++ b/arch/x86/include/asm/nospec-branch.h
>@@ -54,6 +54,12 @@
> 	jnz	771b;				\
> 	add	$(BITS_PER_LONG/8) * nr, sp;
> 
>+/*
>+ * An empty REX-prefix is an indication that this instruction is part
>of kernel
>+ * restartable sequence.
>+ */
>+#define KERNEL_RESTARTABLE_PREFIX		(0x40)
>+
> #ifdef __ASSEMBLY__
> 
> /*
>@@ -150,6 +156,12 @@
> #endif
> .endm
> 
>+.macro restartable_seq_prefix
>+#ifdef CONFIG_PREEMPT
>+	.byte	KERNEL_RESTARTABLE_PREFIX
>+#endif
>+.endm
>+
> #else /* __ASSEMBLY__ */
> 
> #define ANNOTATE_NOSPEC_ALTERNATIVE				\
>diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>index 85cccadb9a65..b1e855bad5ac 100644
>--- a/arch/x86/kernel/traps.c
>+++ b/arch/x86/kernel/traps.c
>@@ -59,6 +59,7 @@
> #include <asm/fpu/xstate.h>
> #include <asm/vm86.h>
> #include <asm/umip.h>
>+#include <asm/nospec-branch.h>
> 
> #ifdef CONFIG_X86_64
> #include <asm/x86_init.h>
>@@ -186,6 +187,12 @@ int fixup_bug(struct pt_regs *regs, int trapnr)
> 	return 0;
> }
> 
>+asmlinkage __visible void restart_kernel_rseq(struct pt_regs *regs)
>+{
>+	if (user_mode(regs) || *(u8 *)regs->ip != KERNEL_RESTARTABLE_PREFIX)
>+		return;
>+}
>+
> static nokprobe_inline int
>do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str,
> 		  struct pt_regs *regs,	long error_code)

A 0x40 prefix is *not* a noop. It changes the interpretation of byte registers 4 though 7 from ah, ch, dh, bh to spl, bpl, sil and dil.

It may not matter in your application but:

a. You need to clarify that so is the case, and why;
b. Phrase it differently so others don't propagate the same misunderstanding.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-03 22:18 ` Andi Kleen
@ 2019-01-07 16:32   ` Peter Zijlstra
  2019-01-08  7:47     ` Adrian Hunter
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-01-07 16:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nadav Amit, Ingo Molnar, Andy Lutomirski, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit,
	X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	adrian.hunter, Alexander Shishkin

On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
> Nadav Amit <namit@vmware.com> writes:
> >
> > - Do we use periodic learning or not? Josh suggested to reconfigure the
> >   branches whenever a new target is found. However, I do not know at
> >   this time how to do learning efficiently, without making learning much
> >   more expensive.
> 
> FWIW frequent patching will likely completely break perf Processor Trace
> decoding, which needs a somewhat stable kernel text image to decode the
> traces generated by the CPU. Right now it relies on kcore dumped after
> the trace usually being stable because jumplabel changes happen only
> infrequently. But if you start patching frequently this assumption will
> break.
> 
> You would either need a way to turn this off, or provide
> updates for every change to the trace, so that the decoder can
> keep track.

I'm thining it would be entirely possible to create and feed text_poke
events into the regular (!aux) buffer which can be timestamp correlated
to the PT data.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-07 16:32   ` Peter Zijlstra
@ 2019-01-08  7:47     ` Adrian Hunter
  2019-01-08  9:25       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Adrian Hunter @ 2019-01-08  7:47 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: Nadav Amit, Ingo Molnar, Andy Lutomirski, Josh Poimboeuf,
	Edward Cree, H . Peter Anvin, Thomas Gleixner, LKML, Nadav Amit,
	X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin

On 7/01/19 6:32 PM, Peter Zijlstra wrote:
> On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
>> Nadav Amit <namit@vmware.com> writes:
>>>
>>> - Do we use periodic learning or not? Josh suggested to reconfigure the
>>>   branches whenever a new target is found. However, I do not know at
>>>   this time how to do learning efficiently, without making learning much
>>>   more expensive.
>>
>> FWIW frequent patching will likely completely break perf Processor Trace
>> decoding, which needs a somewhat stable kernel text image to decode the
>> traces generated by the CPU. Right now it relies on kcore dumped after
>> the trace usually being stable because jumplabel changes happen only
>> infrequently. But if you start patching frequently this assumption will
>> break.
>>
>> You would either need a way to turn this off, or provide
>> updates for every change to the trace, so that the decoder can
>> keep track.
> 
> I'm thining it would be entirely possible to create and feed text_poke
> events into the regular (!aux) buffer which can be timestamp correlated
> to the PT data.

To rebuild kernel text from such events would require a starting point.
What is the starting point?  The problem with kcore is that people can
deconfig it without realising it is needed to enable the tracing of kernel
self-modifying code.  It would be nice if it was all tied together, so that
if someone selects the ability to trace kernel self-modifying code, then all
the bits needed are also selected.  Perhaps we should expose another ELF
image that contains only kernel executable code, and take the opportunity to
put the symbols in it also.

Also what about BPF jitted code?  Will it always fit in an event?  I was
thinking of trying to add a way to prevent temporarily the unload of modules
or jitted code, which would be a good-enough solution for now.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08  7:47     ` Adrian Hunter
@ 2019-01-08  9:25       ` Peter Zijlstra
  2019-01-08 10:01         ` Adrian Hunter
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-01-08  9:25 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Andi Kleen, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, Nadav Amit, X86 ML, Paolo Abeni, Borislav Petkov,
	David Woodhouse, Alexander Shishkin, songliubraving

On Tue, Jan 08, 2019 at 09:47:18AM +0200, Adrian Hunter wrote:
> On 7/01/19 6:32 PM, Peter Zijlstra wrote:
> > On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
> >> Nadav Amit <namit@vmware.com> writes:
> >>>
> >>> - Do we use periodic learning or not? Josh suggested to reconfigure the
> >>>   branches whenever a new target is found. However, I do not know at
> >>>   this time how to do learning efficiently, without making learning much
> >>>   more expensive.
> >>
> >> FWIW frequent patching will likely completely break perf Processor Trace
> >> decoding, which needs a somewhat stable kernel text image to decode the
> >> traces generated by the CPU. Right now it relies on kcore dumped after
> >> the trace usually being stable because jumplabel changes happen only
> >> infrequently. But if you start patching frequently this assumption will
> >> break.
> >>
> >> You would either need a way to turn this off, or provide
> >> updates for every change to the trace, so that the decoder can
> >> keep track.
> > 
> > I'm thining it would be entirely possible to create and feed text_poke
> > events into the regular (!aux) buffer which can be timestamp correlated
> > to the PT data.
> 
> To rebuild kernel text from such events would require a starting point.
> What is the starting point?  The problem with kcore is that people can
> deconfig it without realising it is needed to enable the tracing of kernel
> self-modifying code.  It would be nice if it was all tied together, so that
> if someone selects the ability to trace kernel self-modifying code, then all
> the bits needed are also selected.  Perhaps we should expose another ELF
> image that contains only kernel executable code, and take the opportunity to
> put the symbols in it also.

Meh; you always need a magic combo of CONFIG symbols to make stuff work.
We don't even have a CONFIG symbol for PT, so if you really care you
should probably start there.

If you want symbols; what stops us from exposing kallsyms in kcore as
is?

> Also what about BPF jitted code?  Will it always fit in an event?  I was
> thinking of trying to add a way to prevent temporarily the unload of modules
> or jitted code, which would be a good-enough solution for now.

We're working on BPF and kallsym events, those should, esp. when
combined with kcore, allow you to extract the actual instructions.

We don't have module events, but I suppose the kallsym events should
cover module loading (a new module results in lots of new symbols after
all).

With all that there is still a race in that nothing blocks module-unload
while we're still harvesting the information, not sure how/if we want to
cure that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08  9:25       ` Peter Zijlstra
@ 2019-01-08 10:01         ` Adrian Hunter
  2019-01-08 10:10           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Adrian Hunter @ 2019-01-08 10:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, Nadav Amit, X86 ML, Paolo Abeni, Borislav Petkov,
	David Woodhouse, Alexander Shishkin, songliubraving

On 8/01/19 11:25 AM, Peter Zijlstra wrote:
> On Tue, Jan 08, 2019 at 09:47:18AM +0200, Adrian Hunter wrote:
>> On 7/01/19 6:32 PM, Peter Zijlstra wrote:
>>> On Thu, Jan 03, 2019 at 02:18:15PM -0800, Andi Kleen wrote:
>>>> Nadav Amit <namit@vmware.com> writes:
>>>>>
>>>>> - Do we use periodic learning or not? Josh suggested to reconfigure the
>>>>>   branches whenever a new target is found. However, I do not know at
>>>>>   this time how to do learning efficiently, without making learning much
>>>>>   more expensive.
>>>>
>>>> FWIW frequent patching will likely completely break perf Processor Trace
>>>> decoding, which needs a somewhat stable kernel text image to decode the
>>>> traces generated by the CPU. Right now it relies on kcore dumped after
>>>> the trace usually being stable because jumplabel changes happen only
>>>> infrequently. But if you start patching frequently this assumption will
>>>> break.
>>>>
>>>> You would either need a way to turn this off, or provide
>>>> updates for every change to the trace, so that the decoder can
>>>> keep track.
>>>
>>> I'm thining it would be entirely possible to create and feed text_poke
>>> events into the regular (!aux) buffer which can be timestamp correlated
>>> to the PT data.
>>
>> To rebuild kernel text from such events would require a starting point.
>> What is the starting point?  The problem with kcore is that people can
>> deconfig it without realising it is needed to enable the tracing of kernel
>> self-modifying code.  It would be nice if it was all tied together, so that
>> if someone selects the ability to trace kernel self-modifying code, then all
>> the bits needed are also selected.  Perhaps we should expose another ELF
>> image that contains only kernel executable code, and take the opportunity to
>> put the symbols in it also.
> 
> Meh; you always need a magic combo of CONFIG symbols to make stuff work.
> We don't even have a CONFIG symbol for PT, so if you really care you
> should probably start there.
> 
> If you want symbols; what stops us from exposing kallsyms in kcore as
> is?
> 
>> Also what about BPF jitted code?  Will it always fit in an event?  I was
>> thinking of trying to add a way to prevent temporarily the unload of modules
>> or jitted code, which would be a good-enough solution for now.
> 
> We're working on BPF and kallsym events, those should, esp. when
> combined with kcore, allow you to extract the actual instructions.

The problem is that the jitted code gets freed from memory, which is why I
suggested the ability to pin it for a while.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 10:01         ` Adrian Hunter
@ 2019-01-08 10:10           ` Peter Zijlstra
  2019-01-08 17:27             ` Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-01-08 10:10 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Andi Kleen, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, Nadav Amit, X86 ML, Paolo Abeni, Borislav Petkov,
	David Woodhouse, Alexander Shishkin, songliubraving

On Tue, Jan 08, 2019 at 12:01:11PM +0200, Adrian Hunter wrote:
> The problem is that the jitted code gets freed from memory, which is why I
> suggested the ability to pin it for a while.

Then what do you tell the guy that keeps PT running for a day and runs
out of memory because he likes to JIT a lot?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 10:10           ` Peter Zijlstra
@ 2019-01-08 17:27             ` Andi Kleen
  2019-01-08 18:28               ` Nadav Amit
  2019-01-08 18:57               ` [RFC v2 0/6] x86: dynamic indirect branch promotion Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Andi Kleen @ 2019-01-08 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, Nadav Amit, X86 ML, Paolo Abeni, Borislav Petkov,
	David Woodhouse, Alexander Shishkin, songliubraving

On Tue, Jan 08, 2019 at 11:10:58AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 08, 2019 at 12:01:11PM +0200, Adrian Hunter wrote:
> > The problem is that the jitted code gets freed from memory, which is why I
> > suggested the ability to pin it for a while.
> 
> Then what do you tell the guy that keeps PT running for a day and runs
> out of memory because he likes to JIT a lot?

It only would need to be saved until the next kcore dump, so they would
need to do regular kcore dumps, after each of which the JIT code could be freed.

In a sense it would be like RCU for code.

You would somehow need to tell the kernel when that happens though
so it can schedule the frees.

It doesn't work when the code is modified in place, like the
patch in the $SUBJECT.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 17:27             ` Andi Kleen
@ 2019-01-08 18:28               ` Nadav Amit
  2019-01-08 19:01                 ` Peter Zijlstra
  2019-01-08 18:57               ` [RFC v2 0/6] x86: dynamic indirect branch promotion Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Nadav Amit @ 2019-01-08 18:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Adrian Hunter, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

> On Jan 8, 2019, at 9:27 AM, Andi Kleen <ak@linux.intel.com> wrote:
> 
> On Tue, Jan 08, 2019 at 11:10:58AM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 08, 2019 at 12:01:11PM +0200, Adrian Hunter wrote:
>>> The problem is that the jitted code gets freed from memory, which is why I
>>> suggested the ability to pin it for a while.
>> 
>> Then what do you tell the guy that keeps PT running for a day and runs
>> out of memory because he likes to JIT a lot?
> 
> It only would need to be saved until the next kcore dump, so they would
> need to do regular kcore dumps, after each of which the JIT code could be freed.
> 
> In a sense it would be like RCU for code.
> 
> You would somehow need to tell the kernel when that happens though
> so it can schedule the frees.
> 
> It doesn't work when the code is modified in place, like the
> patch in the $SUBJECT.

Excuse my ignorance - can you be more concrete what will break where?

I am looking at perf-with-kcore, and intuitively the main thing that is
required is to take text_mutex while kcore is copied, to get a point-in-time
snapshot.

Is it really that important for debugging to get the instructions at the
time of execution? Wouldn’t it be easier to annotate the instructions that
might change? After all, it is not as if any instruction can change to any
other instruction.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 17:27             ` Andi Kleen
  2019-01-08 18:28               ` Nadav Amit
@ 2019-01-08 18:57               ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2019-01-08 18:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Adrian Hunter, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, Nadav Amit, X86 ML, Paolo Abeni, Borislav Petkov,
	David Woodhouse, Alexander Shishkin, songliubraving

On Tue, Jan 08, 2019 at 09:27:21AM -0800, Andi Kleen wrote:
> It doesn't work when the code is modified in place, like the
> patch in the $SUBJECT.

For in-place modification a-la jump_label / static_call etc. we can
'trivially' log the new instruction and the decoder can do matching
in-place kcore adjustments.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 18:28               ` Nadav Amit
@ 2019-01-08 19:01                 ` Peter Zijlstra
  2019-01-08 20:47                   ` Nadav Amit
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-01-08 19:01 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andi Kleen, Adrian Hunter, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On Tue, Jan 08, 2019 at 10:28:02AM -0800, Nadav Amit wrote:
> Is it really that important for debugging to get the instructions at the
> time of execution? Wouldn’t it be easier to annotate the instructions that
> might change? After all, it is not as if any instruction can change to any
> other instruction.

I think PT has a bitstream encoding of branch-taken; to decode and
follow the actual code-flow you then need to have the actual and
accurate branch target from the code. If we go muck about with the code
and change that, decoding gets somewhat 'tricky'.

Or something along those lines..

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 19:01                 ` Peter Zijlstra
@ 2019-01-08 20:47                   ` Nadav Amit
  2019-01-08 20:53                     ` Andi Kleen
  2019-01-09 10:35                     ` Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Nadav Amit @ 2019-01-08 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Adrian Hunter, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

> On Jan 8, 2019, at 11:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Jan 08, 2019 at 10:28:02AM -0800, Nadav Amit wrote:
>> Is it really that important for debugging to get the instructions at the
>> time of execution? Wouldn’t it be easier to annotate the instructions that
>> might change? After all, it is not as if any instruction can change to any
>> other instruction.
> 
> I think PT has a bitstream encoding of branch-taken; to decode and
> follow the actual code-flow you then need to have the actual and
> accurate branch target from the code. If we go muck about with the code
> and change that, decoding gets somewhat 'tricky'.
> 
> Or something along those lines..

Thanks for the explanation (I now see it in the SDM and sources).

Basically, the updates should not be done too frequently, and I can expose
an interface to suspend them (this will not affect correctness). I think
this can be the easiest solution, which should not affect the workload
execution too much.

A general solution is more complicated, however, due to the racy nature of
cross-modifying code. There would need to be TSC recording of the time
before the modifications start and after they are done.

BTW: I am not sure that static-keys are much better. Their change also
affects the control flow, and they do affect the control flow.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 20:47                   ` Nadav Amit
@ 2019-01-08 20:53                     ` Andi Kleen
  2019-01-09 10:35                     ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2019-01-08 20:53 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Adrian Hunter, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

> BTW: I am not sure that static-keys are much better. Their change also
> affects the control flow, and they do affect the control flow.

Static keys have the same problem, but they only change infrequently so usually
it's not too big a problem if you dump the kernel close to the tracing 
sessions.

simple-pt doesn't have a similar mechanism, so it suffers more from it.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 0/6] x86: dynamic indirect branch promotion
  2019-01-08 20:47                   ` Nadav Amit
  2019-01-08 20:53                     ` Andi Kleen
@ 2019-01-09 10:35                     ` Peter Zijlstra
  2019-08-29  8:23                       ` Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion) Adrian Hunter
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-01-09 10:35 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andi Kleen, Adrian Hunter, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:

> A general solution is more complicated, however, due to the racy nature of
> cross-modifying code. There would need to be TSC recording of the time
> before the modifications start and after they are done.
> 
> BTW: I am not sure that static-keys are much better. Their change also
> affects the control flow, and they do affect the control flow.

Any text_poke() user is a problem; which is why I suggested a
PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
timestamped and can be correlated to the trace.

As to the racy nature of text_poke, yes, this is a wee bit tricky and
might need some care. I _think_ we can make it work, but I'm not 100%
sure on exactly how PT works, but something like:

 - write INT3 byte
 - IPI-SYNC

and ensure the poke_handler preserves the existing control flow (which
it currently does not, but should be possible).

 - emit RECORD_TEXT_POKE with the new instruction

at this point the actual control flow will be through the INT3 and
handler and not hit the actual instruction, so the actual state is
irrelevant.

 - write instruction tail
 - IPI-SYNC
 - write first byte
 - IPI-SYNC

And at this point we start using the new instruction, but this is after
the timestamp from the RECORD_TEXT_POKE event and decoding should work
just fine.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)
  2019-01-09 10:35                     ` Peter Zijlstra
@ 2019-08-29  8:23                       ` Adrian Hunter
  2019-08-29  8:53                         ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Adrian Hunter @ 2019-08-29  8:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Andi Kleen, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On 9/01/19 12:35 PM, Peter Zijlstra wrote:
> On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
> 
>> A general solution is more complicated, however, due to the racy nature of
>> cross-modifying code. There would need to be TSC recording of the time
>> before the modifications start and after they are done.
>>
>> BTW: I am not sure that static-keys are much better. Their change also
>> affects the control flow, and they do affect the control flow.
> 
> Any text_poke() user is a problem; which is why I suggested a
> PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
> timestamped and can be correlated to the trace.
> 
> As to the racy nature of text_poke, yes, this is a wee bit tricky and
> might need some care. I _think_ we can make it work, but I'm not 100%
> sure on exactly how PT works, but something like:
> 
>  - write INT3 byte
>  - IPI-SYNC
> 
> and ensure the poke_handler preserves the existing control flow (which
> it currently does not, but should be possible).
> 
>  - emit RECORD_TEXT_POKE with the new instruction
> 
> at this point the actual control flow will be through the INT3 and
> handler and not hit the actual instruction, so the actual state is
> irrelevant.
> 
>  - write instruction tail
>  - IPI-SYNC
>  - write first byte
>  - IPI-SYNC
> 
> And at this point we start using the new instruction, but this is after
> the timestamp from the RECORD_TEXT_POKE event and decoding should work
> just fine.
> 

Presumably the IPI-SYNC does not guarantee that other CPUs will not already
have seen the change.  In that case, it is not possible to provide a
timestamp before which all CPUs executed the old code, and after which all
CPUs execute the new code.

So we need 2 events: one before and one after the text poke.  Then it will
be up to the tools to figure out which code path was taken in that time
interval. e.g.

	1. emit RECORD_TEXT_POKE
		flags:	BEFORE (timestamp is before change)
		method:	INT3 (INT3 method used to change code)
		ip
		code size
		code bytes before
		code bytes after

	2. text poke

	3. emit RECORD_TEXT_POKE
		flags:	AFTER (timestamp is after change)
		method:	INT3 (INT3 method used to change code)
		ip
		code size
		code bytes before
		code bytes after

Thoughts?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)
  2019-08-29  8:23                       ` Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion) Adrian Hunter
@ 2019-08-29  8:53                         ` Peter Zijlstra
  2019-08-29  9:40                           ` Adrian Hunter
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-08-29  8:53 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Nadav Amit, Andi Kleen, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On Thu, Aug 29, 2019 at 11:23:52AM +0300, Adrian Hunter wrote:
> On 9/01/19 12:35 PM, Peter Zijlstra wrote:
> > On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
> > 
> >> A general solution is more complicated, however, due to the racy nature of
> >> cross-modifying code. There would need to be TSC recording of the time
> >> before the modifications start and after they are done.
> >>
> >> BTW: I am not sure that static-keys are much better. Their change also
> >> affects the control flow, and they do affect the control flow.
> > 
> > Any text_poke() user is a problem; which is why I suggested a
> > PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
> > timestamped and can be correlated to the trace.
> > 
> > As to the racy nature of text_poke, yes, this is a wee bit tricky and
> > might need some care. I _think_ we can make it work, but I'm not 100%
> > sure on exactly how PT works, but something like:
> > 
> >  - write INT3 byte
> >  - IPI-SYNC
> > 
> > and ensure the poke_handler preserves the existing control flow (which
> > it currently does not, but should be possible).
> > 
> >  - emit RECORD_TEXT_POKE with the new instruction
> > 
> > at this point the actual control flow will be through the INT3 and
> > handler and not hit the actual instruction, so the actual state is
> > irrelevant.
> > 
> >  - write instruction tail
> >  - IPI-SYNC
> >  - write first byte
> >  - IPI-SYNC
> > 
> > And at this point we start using the new instruction, but this is after
> > the timestamp from the RECORD_TEXT_POKE event and decoding should work
> > just fine.
> > 
> 
> Presumably the IPI-SYNC does not guarantee that other CPUs will not already
> have seen the change.  In that case, it is not possible to provide a
> timestamp before which all CPUs executed the old code, and after which all
> CPUs execute the new code.

'the change' is an INT3 poke, so either you see the old code flow, or
you see an INT3 emulate the old flow in your trace.

That should be unambiguous.

Then you emit the RECORD_TEXT_POKE with the new instruction on. This
prepares the decoder to accept a new reality.

Then we finish the instruction poke.

And then when the trace no longer shows INT3 exceptions, you know the
new code is in effect.

How is this ambiguous?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)
  2019-08-29  8:53                         ` Peter Zijlstra
@ 2019-08-29  9:40                           ` Adrian Hunter
  2019-08-29 11:46                             ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Adrian Hunter @ 2019-08-29  9:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Andi Kleen, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On 29/08/19 11:53 AM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 11:23:52AM +0300, Adrian Hunter wrote:
>> On 9/01/19 12:35 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 08, 2019 at 12:47:42PM -0800, Nadav Amit wrote:
>>>
>>>> A general solution is more complicated, however, due to the racy nature of
>>>> cross-modifying code. There would need to be TSC recording of the time
>>>> before the modifications start and after they are done.
>>>>
>>>> BTW: I am not sure that static-keys are much better. Their change also
>>>> affects the control flow, and they do affect the control flow.
>>>
>>> Any text_poke() user is a problem; which is why I suggested a
>>> PERF_RECORD_TEXT_POKE that emits the new instruction. Such records are
>>> timestamped and can be correlated to the trace.
>>>
>>> As to the racy nature of text_poke, yes, this is a wee bit tricky and
>>> might need some care. I _think_ we can make it work, but I'm not 100%
>>> sure on exactly how PT works, but something like:
>>>
>>>  - write INT3 byte
>>>  - IPI-SYNC
>>>
>>> and ensure the poke_handler preserves the existing control flow (which
>>> it currently does not, but should be possible).
>>>
>>>  - emit RECORD_TEXT_POKE with the new instruction
>>>
>>> at this point the actual control flow will be through the INT3 and
>>> handler and not hit the actual instruction, so the actual state is
>>> irrelevant.
>>>
>>>  - write instruction tail
>>>  - IPI-SYNC
>>>  - write first byte
>>>  - IPI-SYNC
>>>
>>> And at this point we start using the new instruction, but this is after
>>> the timestamp from the RECORD_TEXT_POKE event and decoding should work
>>> just fine.
>>>
>>
>> Presumably the IPI-SYNC does not guarantee that other CPUs will not already
>> have seen the change.  In that case, it is not possible to provide a
>> timestamp before which all CPUs executed the old code, and after which all
>> CPUs execute the new code.
> 
> 'the change' is an INT3 poke, so either you see the old code flow, or
> you see an INT3 emulate the old flow in your trace.
> 
> That should be unambiguous.
> 
> Then you emit the RECORD_TEXT_POKE with the new instruction on. This
> prepares the decoder to accept a new reality.
> 
> Then we finish the instruction poke.
> 
> And then when the trace no longer shows INT3 exceptions, you know the
> new code is in effect.
> 
> How is this ambiguous?

It's not.  I didn't get that from the first read, sorry.

Can you expand on "and ensure the poke_handler preserves the existing
control flow"?  Whatever the INT3-handler does will be traced normally so
long as it does not itself execute self-modified code.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)
  2019-08-29  9:40                           ` Adrian Hunter
@ 2019-08-29 11:46                             ` Peter Zijlstra
  2019-09-12  7:00                               ` Adrian Hunter
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-08-29 11:46 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Nadav Amit, Andi Kleen, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
> Can you expand on "and ensure the poke_handler preserves the existing
> control flow"?  Whatever the INT3-handler does will be traced normally so
> long as it does not itself execute self-modified code.

My thinking was that the code shouldn't change semantics before emitting
the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
matter much.

Either we run the old code or INT3 does 'something'.  Then we get
RECORD_TEXT_POKE and finish the poke.  Which tells that the moment INT3
stops the new text is in place.

I suppose that works too, and requires less changes.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)
  2019-08-29 11:46                             ` Peter Zijlstra
@ 2019-09-12  7:00                               ` Adrian Hunter
  2019-09-12 12:17                                 ` hpa
  0 siblings, 1 reply; 42+ messages in thread
From: Adrian Hunter @ 2019-09-12  7:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Andi Kleen, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, H . Peter Anvin, Thomas Gleixner,
	LKML, X86 ML, Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On 29/08/19 2:46 PM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
>> Can you expand on "and ensure the poke_handler preserves the existing
>> control flow"?  Whatever the INT3-handler does will be traced normally so
>> long as it does not itself execute self-modified code.
> 
> My thinking was that the code shouldn't change semantics before emitting
> the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
> matter much.
> 
> Either we run the old code or INT3 does 'something'.  Then we get
> RECORD_TEXT_POKE and finish the poke.  Which tells that the moment INT3
> stops the new text is in place.
> 
> I suppose that works too, and requires less changes.


What about this?


diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 70c09967a999..00aa9bef2b9d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -30,6 +30,7 @@ struct text_poke_loc {
 	void *addr;
 	size_t len;
 	const char opcode[POKE_MAX_OPCODE_SIZE];
+	char old_opcode[POKE_MAX_OPCODE_SIZE];
 };
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ccd32013c47a..c781bbbbbafd 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -3,6 +3,7 @@
 
 #include <linux/module.h>
 #include <linux/sched.h>
+#include <linux/perf_event.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/stringify.h>
@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 	/*
 	 * First step: add a int3 trap to the address that will be patched.
 	 */
-	for (i = 0; i < nr_entries; i++)
+	for (i = 0; i < nr_entries; i++) {
+		memcpy(tp[i].old_opcode, tp[i].addr, tp[i].len);
 		text_poke(tp[i].addr, &int3, sizeof(int3));
+	}
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
@@ -1071,6 +1074,11 @@ void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 		on_each_cpu(do_sync_core, NULL, 1);
 	}
 
+	for (i = 0; i < nr_entries; i++) {
+		perf_event_text_poke((unsigned long)tp[i].addr,
+				     tp[i].old_opcode, tp[i].opcode, tp[i].len);
+	}
+
 	/*
 	 * Third step: replace the first byte (int3) by the first byte of
 	 * replacing opcode.
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61448c19a132..f4c6095d2110 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_namespaces(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
+extern void perf_event_text_poke(unsigned long addr, const void *old_bytes,
+				 const void *new_bytes, size_t len);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_namespaces(struct task_struct *tsk)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline void perf_event_text_poke(unsigned long addr,
+					const void *old_bytes,
+					const void *new_bytes,
+					size_t len)			{ }
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..6396d4c0d2f9 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -375,7 +375,8 @@ struct perf_event_attr {
 				ksymbol        :  1, /* include ksymbol events */
 				bpf_event      :  1, /* include bpf events */
 				aux_output     :  1, /* generate AUX records instead of events */
-				__reserved_1   : 32;
+				text_poke      :  1, /* include text poke events */
+				__reserved_1   : 31;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1000,6 +1001,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_BPF_EVENT			= 18,
 
+	/*
+	 * Records changes to kernel text i.e. self-modified code.
+	 * 'len' is the number of old bytes, which is the same as the number
+	 * of new bytes. 'bytes' contains the old bytes followed immediately
+	 * by the new bytes.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				addr;
+	 *	u16				len;
+	 *	u8				bytes[];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+ 	PERF_RECORD_TEXT_POKE			= 19,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 811bb333c986..43c0d3d232dc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -386,6 +386,7 @@ static atomic_t nr_freq_events __read_mostly;
 static atomic_t nr_switch_events __read_mostly;
 static atomic_t nr_ksymbol_events __read_mostly;
 static atomic_t nr_bpf_events __read_mostly;
+static atomic_t nr_text_poke_events __read_mostly;
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
@@ -4339,7 +4340,7 @@ static bool is_sb_event(struct perf_event *event)
 	if (attr->mmap || attr->mmap_data || attr->mmap2 ||
 	    attr->comm || attr->comm_exec ||
 	    attr->task || attr->ksymbol ||
-	    attr->context_switch ||
+	    attr->context_switch || attr->text_poke ||
 	    attr->bpf_event)
 		return true;
 	return false;
@@ -4413,6 +4414,8 @@ static void unaccount_event(struct perf_event *event)
 		atomic_dec(&nr_ksymbol_events);
 	if (event->attr.bpf_event)
 		atomic_dec(&nr_bpf_events);
+	if (event->attr.text_poke)
+		atomic_dec(&nr_text_poke_events);
 
 	if (dec) {
 		if (!atomic_add_unless(&perf_sched_count, -1, 1))
@@ -8045,6 +8048,85 @@ void perf_event_bpf_event(struct bpf_prog *prog,
 	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
 }
 
+struct perf_text_poke_event {
+	const void		*old_bytes;
+	const void		*new_bytes;
+	size_t			pad;
+	u16			len;
+
+	struct {
+		struct perf_event_header	header;
+
+		u64				addr;
+	} event_id;
+};
+
+static int perf_event_text_poke_match(struct perf_event *event)
+{
+	return event->attr.text_poke;
+}
+
+static void perf_event_text_poke_output(struct perf_event *event, void *data)
+{
+	struct perf_text_poke_event *text_poke_event = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	u64 padding = 0;
+	int ret;
+
+	if (!perf_event_text_poke_match(event))
+		return;
+
+	perf_event_header__init_id(&text_poke_event->event_id.header, &sample, event);
+
+	ret = perf_output_begin(&handle, event, text_poke_event->event_id.header.size);
+	if (ret)
+		return;
+
+	perf_output_put(&handle, text_poke_event->event_id);
+	perf_output_put(&handle, text_poke_event->len);
+
+	__output_copy(&handle, text_poke_event->old_bytes, text_poke_event->len);
+	__output_copy(&handle, text_poke_event->new_bytes, text_poke_event->len);
+
+	if (text_poke_event->pad)
+		__output_copy(&handle, &padding, text_poke_event->pad);
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
+void perf_event_text_poke(unsigned long addr, const void *old_bytes,
+			  const void *new_bytes, size_t len)
+{
+	struct perf_text_poke_event text_poke_event;
+	size_t tot, pad;
+
+	if (!atomic_read(&nr_text_poke_events))
+		return;
+
+	tot  = sizeof(text_poke_event.len) + (len << 1);
+	pad  = ALIGN(tot, sizeof(u64)) - tot;
+
+	text_poke_event = (struct perf_text_poke_event){
+		.old_bytes    = old_bytes,
+		.new_bytes    = new_bytes,
+		.pad          = pad,
+		.len          = len,
+		.event_id  = {
+			.header = {
+				.type = PERF_RECORD_TEXT_POKE,
+				.misc = PERF_RECORD_MISC_KERNEL,
+				.size = sizeof(text_poke_event.event_id) + tot + pad,
+			},
+			.addr = addr,
+		},
+	};
+
+	perf_iterate_sb(perf_event_text_poke_output, &text_poke_event, NULL);
+}
+
 void perf_event_itrace_started(struct perf_event *event)
 {
 	event->attach_state |= PERF_ATTACH_ITRACE;
@@ -10331,6 +10413,8 @@ static void account_event(struct perf_event *event)
 		atomic_inc(&nr_ksymbol_events);
 	if (event->attr.bpf_event)
 		atomic_inc(&nr_bpf_events);
+	if (event->attr.text_poke)
+		atomic_inc(&nr_text_poke_events);
 
 	if (inc) {
 		/*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)
  2019-09-12  7:00                               ` Adrian Hunter
@ 2019-09-12 12:17                                 ` hpa
  0 siblings, 0 replies; 42+ messages in thread
From: hpa @ 2019-09-12 12:17 UTC (permalink / raw)
  To: Adrian Hunter, Peter Zijlstra
  Cc: Nadav Amit, Andi Kleen, Ingo Molnar, Andy Lutomirski,
	Josh Poimboeuf, Edward Cree, Thomas Gleixner, LKML, X86 ML,
	Paolo Abeni, Borislav Petkov, David Woodhouse,
	Alexander Shishkin, songliubraving

On September 12, 2019 8:00:39 AM GMT+01:00, Adrian Hunter <adrian.hunter@intel.com> wrote:
>On 29/08/19 2:46 PM, Peter Zijlstra wrote:
>> On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
>>> Can you expand on "and ensure the poke_handler preserves the
>existing
>>> control flow"?  Whatever the INT3-handler does will be traced
>normally so
>>> long as it does not itself execute self-modified code.
>> 
>> My thinking was that the code shouldn't change semantics before
>emitting
>> the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
>> matter much.
>> 
>> Either we run the old code or INT3 does 'something'.  Then we get
>> RECORD_TEXT_POKE and finish the poke.  Which tells that the moment
>INT3
>> stops the new text is in place.
>> 
>> I suppose that works too, and requires less changes.
>
>
>What about this?
>
>
>diff --git a/arch/x86/include/asm/text-patching.h
>b/arch/x86/include/asm/text-patching.h
>index 70c09967a999..00aa9bef2b9d 100644
>--- a/arch/x86/include/asm/text-patching.h
>+++ b/arch/x86/include/asm/text-patching.h
>@@ -30,6 +30,7 @@ struct text_poke_loc {
> 	void *addr;
> 	size_t len;
> 	const char opcode[POKE_MAX_OPCODE_SIZE];
>+	char old_opcode[POKE_MAX_OPCODE_SIZE];
> };
> 
>extern void text_poke_early(void *addr, const void *opcode, size_t
>len);
>diff --git a/arch/x86/kernel/alternative.c
>b/arch/x86/kernel/alternative.c
>index ccd32013c47a..c781bbbbbafd 100644
>--- a/arch/x86/kernel/alternative.c
>+++ b/arch/x86/kernel/alternative.c
>@@ -3,6 +3,7 @@
> 
> #include <linux/module.h>
> #include <linux/sched.h>
>+#include <linux/perf_event.h>
> #include <linux/mutex.h>
> #include <linux/list.h>
> #include <linux/stringify.h>
>@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
> 	/*
> 	 * First step: add a int3 trap to the address that will be patched.
> 	 */
>-	for (i = 0; i < nr_entries; i++)
>+	for (i = 0; i < nr_entries; i++) {
>+		memcpy(tp[i].old_opcode, tp[i].addr, tp[i].len);
> 		text_poke(tp[i].addr, &int3, sizeof(int3));
>+	}
> 
> 	on_each_cpu(do_sync_core, NULL, 1);
> 
>@@ -1071,6 +1074,11 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
> 		on_each_cpu(do_sync_core, NULL, 1);
> 	}
> 
>+	for (i = 0; i < nr_entries; i++) {
>+		perf_event_text_poke((unsigned long)tp[i].addr,
>+				     tp[i].old_opcode, tp[i].opcode, tp[i].len);
>+	}
>+
> 	/*
> 	 * Third step: replace the first byte (int3) by the first byte of
> 	 * replacing opcode.
>diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>index 61448c19a132..f4c6095d2110 100644
>--- a/include/linux/perf_event.h
>+++ b/include/linux/perf_event.h
>@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
> extern void perf_event_comm(struct task_struct *tsk, bool exec);
> extern void perf_event_namespaces(struct task_struct *tsk);
> extern void perf_event_fork(struct task_struct *tsk);
>+extern void perf_event_text_poke(unsigned long addr, const void
>*old_bytes,
>+				 const void *new_bytes, size_t len);
> 
> /* Callchains */
> DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void)				{ }
>static inline void perf_event_comm(struct task_struct *tsk, bool
>exec)	{ }
> static inline void perf_event_namespaces(struct task_struct *tsk)	{ }
> static inline void perf_event_fork(struct task_struct *tsk)		{ }
>+static inline void perf_event_text_poke(unsigned long addr,
>+					const void *old_bytes,
>+					const void *new_bytes,
>+					size_t len)			{ }
> static inline void perf_event_init(void)				{ }
>static inline int  perf_swevent_get_recursion_context(void)		{ return
>-1; }
> static inline void perf_swevent_put_recursion_context(int rctx)		{ }
>diff --git a/include/uapi/linux/perf_event.h
>b/include/uapi/linux/perf_event.h
>index bb7b271397a6..6396d4c0d2f9 100644
>--- a/include/uapi/linux/perf_event.h
>+++ b/include/uapi/linux/perf_event.h
>@@ -375,7 +375,8 @@ struct perf_event_attr {
> 				ksymbol        :  1, /* include ksymbol events */
> 				bpf_event      :  1, /* include bpf events */
> 				aux_output     :  1, /* generate AUX records instead of events */
>-				__reserved_1   : 32;
>+				text_poke      :  1, /* include text poke events */
>+				__reserved_1   : 31;
> 
> 	union {
> 		__u32		wakeup_events;	  /* wakeup every n events */
>@@ -1000,6 +1001,22 @@ enum perf_event_type {
> 	 */
> 	PERF_RECORD_BPF_EVENT			= 18,
> 
>+	/*
>+	 * Records changes to kernel text i.e. self-modified code.
>+	 * 'len' is the number of old bytes, which is the same as the number
>+	 * of new bytes. 'bytes' contains the old bytes followed immediately
>+	 * by the new bytes.
>+	 *
>+	 * struct {
>+	 *	struct perf_event_header	header;
>+	 *	u64				addr;
>+	 *	u16				len;
>+	 *	u8				bytes[];
>+	 *	struct sample_id		sample_id;
>+	 * };
>+	 */
>+ 	PERF_RECORD_TEXT_POKE			= 19,
>+
> 	PERF_RECORD_MAX,			/* non-ABI */
> };
> 
>diff --git a/kernel/events/core.c b/kernel/events/core.c
>index 811bb333c986..43c0d3d232dc 100644
>--- a/kernel/events/core.c
>+++ b/kernel/events/core.c
>@@ -386,6 +386,7 @@ static atomic_t nr_freq_events __read_mostly;
> static atomic_t nr_switch_events __read_mostly;
> static atomic_t nr_ksymbol_events __read_mostly;
> static atomic_t nr_bpf_events __read_mostly;
>+static atomic_t nr_text_poke_events __read_mostly;
> 
> static LIST_HEAD(pmus);
> static DEFINE_MUTEX(pmus_lock);
>@@ -4339,7 +4340,7 @@ static bool is_sb_event(struct perf_event *event)
> 	if (attr->mmap || attr->mmap_data || attr->mmap2 ||
> 	    attr->comm || attr->comm_exec ||
> 	    attr->task || attr->ksymbol ||
>-	    attr->context_switch ||
>+	    attr->context_switch || attr->text_poke ||
> 	    attr->bpf_event)
> 		return true;
> 	return false;
>@@ -4413,6 +4414,8 @@ static void unaccount_event(struct perf_event
>*event)
> 		atomic_dec(&nr_ksymbol_events);
> 	if (event->attr.bpf_event)
> 		atomic_dec(&nr_bpf_events);
>+	if (event->attr.text_poke)
>+		atomic_dec(&nr_text_poke_events);
> 
> 	if (dec) {
> 		if (!atomic_add_unless(&perf_sched_count, -1, 1))
>@@ -8045,6 +8048,85 @@ void perf_event_bpf_event(struct bpf_prog *prog,
> 	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
> }
> 
>+struct perf_text_poke_event {
>+	const void		*old_bytes;
>+	const void		*new_bytes;
>+	size_t			pad;
>+	u16			len;
>+
>+	struct {
>+		struct perf_event_header	header;
>+
>+		u64				addr;
>+	} event_id;
>+};
>+
>+static int perf_event_text_poke_match(struct perf_event *event)
>+{
>+	return event->attr.text_poke;
>+}
>+
>+static void perf_event_text_poke_output(struct perf_event *event, void
>*data)
>+{
>+	struct perf_text_poke_event *text_poke_event = data;
>+	struct perf_output_handle handle;
>+	struct perf_sample_data sample;
>+	u64 padding = 0;
>+	int ret;
>+
>+	if (!perf_event_text_poke_match(event))
>+		return;
>+
>+	perf_event_header__init_id(&text_poke_event->event_id.header,
>&sample, event);
>+
>+	ret = perf_output_begin(&handle, event,
>text_poke_event->event_id.header.size);
>+	if (ret)
>+		return;
>+
>+	perf_output_put(&handle, text_poke_event->event_id);
>+	perf_output_put(&handle, text_poke_event->len);
>+
>+	__output_copy(&handle, text_poke_event->old_bytes,
>text_poke_event->len);
>+	__output_copy(&handle, text_poke_event->new_bytes,
>text_poke_event->len);
>+
>+	if (text_poke_event->pad)
>+		__output_copy(&handle, &padding, text_poke_event->pad);
>+
>+	perf_event__output_id_sample(event, &handle, &sample);
>+
>+	perf_output_end(&handle);
>+}
>+
>+void perf_event_text_poke(unsigned long addr, const void *old_bytes,
>+			  const void *new_bytes, size_t len)
>+{
>+	struct perf_text_poke_event text_poke_event;
>+	size_t tot, pad;
>+
>+	if (!atomic_read(&nr_text_poke_events))
>+		return;
>+
>+	tot  = sizeof(text_poke_event.len) + (len << 1);
>+	pad  = ALIGN(tot, sizeof(u64)) - tot;
>+
>+	text_poke_event = (struct perf_text_poke_event){
>+		.old_bytes    = old_bytes,
>+		.new_bytes    = new_bytes,
>+		.pad          = pad,
>+		.len          = len,
>+		.event_id  = {
>+			.header = {
>+				.type = PERF_RECORD_TEXT_POKE,
>+				.misc = PERF_RECORD_MISC_KERNEL,
>+				.size = sizeof(text_poke_event.event_id) + tot + pad,
>+			},
>+			.addr = addr,
>+		},
>+	};
>+
>+	perf_iterate_sb(perf_event_text_poke_output, &text_poke_event, NULL);
>+}
>+
> void perf_event_itrace_started(struct perf_event *event)
> {
> 	event->attach_state |= PERF_ATTACH_ITRACE;
>@@ -10331,6 +10413,8 @@ static void account_event(struct perf_event
>*event)
> 		atomic_inc(&nr_ksymbol_events);
> 	if (event->attr.bpf_event)
> 		atomic_inc(&nr_bpf_events);
>+	if (event->attr.text_poke)
>+		atomic_inc(&nr_text_poke_events);
> 
> 	if (inc) {
> 		/*

Wasn't there was a long discussion about this a while ago. Holding (or spinning) on INT 3 has a lot of nice properties, e.g. no need to emulate instructions. However, there were concerns about deadlocks. I proposed an algorithm which I *believe* addressed the deadlocks, but obviously when it comes to deadlock avoidance one pair of eyeballs is not enough.

My flight is about to take off so I can't look up the email thread right now, unfortunately.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, back to index

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-31  7:21 [RFC v2 0/6] x86: dynamic indirect branch promotion Nadav Amit
2018-12-31  7:21 ` [RFC v2 1/6] x86: introduce kernel restartable sequence Nadav Amit
2018-12-31 20:08   ` Andy Lutomirski
2018-12-31 21:12     ` Nadav Amit
2019-01-03 22:21   ` Andi Kleen
2019-01-03 22:29     ` Nadav Amit
2019-01-03 22:48       ` Andi Kleen
2019-01-03 22:52         ` Nadav Amit
2019-01-03 23:40           ` Andi Kleen
2019-01-03 23:56             ` Nadav Amit
2019-01-04  0:34   ` hpa
2018-12-31  7:21 ` [RFC v2 2/6] objtool: ignore instructions Nadav Amit
2018-12-31  7:21 ` [RFC v2 3/6] x86: patch indirect branch promotion Nadav Amit
2018-12-31  7:21 ` [RFC v2 4/6] x86: interface for accessing indirect branch locations Nadav Amit
2018-12-31  7:21 ` [RFC v2 5/6] x86: learning and patching indirect branch targets Nadav Amit
2018-12-31 20:05   ` Andy Lutomirski
2018-12-31 21:07     ` Nadav Amit
2018-12-31  7:21 ` [RFC v2 6/6] x86: outline optpoline Nadav Amit
2018-12-31 19:51 ` [RFC v2 0/6] x86: dynamic indirect branch promotion Andy Lutomirski
2018-12-31 19:53   ` Nadav Amit
2019-01-03 18:10     ` Josh Poimboeuf
2019-01-03 18:30       ` Nadav Amit
2019-01-03 20:31         ` Josh Poimboeuf
2019-01-03 22:18 ` Andi Kleen
2019-01-07 16:32   ` Peter Zijlstra
2019-01-08  7:47     ` Adrian Hunter
2019-01-08  9:25       ` Peter Zijlstra
2019-01-08 10:01         ` Adrian Hunter
2019-01-08 10:10           ` Peter Zijlstra
2019-01-08 17:27             ` Andi Kleen
2019-01-08 18:28               ` Nadav Amit
2019-01-08 19:01                 ` Peter Zijlstra
2019-01-08 20:47                   ` Nadav Amit
2019-01-08 20:53                     ` Andi Kleen
2019-01-09 10:35                     ` Peter Zijlstra
2019-08-29  8:23                       ` Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion) Adrian Hunter
2019-08-29  8:53                         ` Peter Zijlstra
2019-08-29  9:40                           ` Adrian Hunter
2019-08-29 11:46                             ` Peter Zijlstra
2019-09-12  7:00                               ` Adrian Hunter
2019-09-12 12:17                                 ` hpa
2019-01-08 18:57               ` [RFC v2 0/6] x86: dynamic indirect branch promotion Peter Zijlstra

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git