linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] x86 cleanups and static_call()
@ 2019-06-05 13:07 Peter Zijlstra
  2019-06-05 13:07 ` [PATCH 01/15] x86/entry/32: Clean up return from interrupt preemption path Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:07 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Hi!

Now that all the x86_64 int3_emulate_call() stuff is upstream, here are the
i386 patches and all the rest of the cleanups that resulted from that
discussion.

And I figured I should have a go at making that static_call() thing working now
that we have the prerequisites sorted.  This is mostly a combination of the v2
and v3 static_call() patches done on top of an 'enhanced' text_poke_bp().

I wrote a little self-test and added Steve's ftrace convertion on top for
testing, and suspicously I've not had it explode, so I'm sure this is going to
set all your computers on fire.

Enjoy!


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 01/15] x86/entry/32: Clean up return from interrupt preemption path
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
@ 2019-06-05 13:07 ` Peter Zijlstra
  2019-06-07 14:21   ` Josh Poimboeuf
  2019-06-05 13:07 ` [PATCH 02/15] x86: Move ENCODE_FRAME_POINTER to asm/frame.h Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:07 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

The code flow around the return from interrupt preemption point seems
needlesly complicated.

There is only one site jumping to resume_kernel, and none (outside of
resume_kernel) jumping to restore_all_kernel. Inline resume_kernel
in restore_all_kernel and avoid the CONFIG_PREEMPT dependent label.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_32.S |   24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -67,7 +67,6 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all_kernel
 #endif
 
 .macro TRACE_IRQS_IRET
@@ -755,7 +754,7 @@ END(ret_from_fork)
 	andl	$SEGMENT_RPL_MASK, %eax
 #endif
 	cmpl	$USER_RPL, %eax
-	jb	resume_kernel			# not returning to v8086 or userspace
+	jb	restore_all_kernel		# not returning to v8086 or userspace
 
 ENTRY(resume_userspace)
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -765,18 +764,6 @@ ENTRY(resume_userspace)
 	jmp	restore_all
 END(ret_from_exception)
 
-#ifdef CONFIG_PREEMPT
-ENTRY(resume_kernel)
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all_kernel
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all_kernel
-	call	preempt_schedule_irq
-	jmp	restore_all_kernel
-END(resume_kernel)
-#endif
-
 GLOBAL(__begin_SYSENTER_singlestep_region)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1027,6 +1014,15 @@ ENTRY(entry_INT80_32)
 	INTERRUPT_RETURN
 
 restore_all_kernel:
+#ifdef CONFIG_PREEMPT
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	.Lno_preempt
+	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
+	jz	.Lno_preempt
+	call	preempt_schedule_irq
+.Lno_preempt:
+#endif
 	TRACE_IRQS_IRET
 	PARANOID_EXIT_TO_KERNEL_MODE
 	BUG_IF_WRONG_CR3



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 02/15] x86: Move ENCODE_FRAME_POINTER to asm/frame.h
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
  2019-06-05 13:07 ` [PATCH 01/15] x86/entry/32: Clean up return from interrupt preemption path Peter Zijlstra
@ 2019-06-05 13:07 ` Peter Zijlstra
  2019-06-07 14:24   ` Josh Poimboeuf
  2019-06-05 13:07 ` [PATCH 03/15] x86/kprobes: Fix frame pointer annotations Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:07 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

In preparation for wider use, move the ENCODE_FRAME_POINTER macros to
a common header and provide inline asm versions.

These macros are used to encode a pt_regs frame for the unwinder; see
unwind_frame.c:decode_frame_pointer().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/calling.h     |   15 -------------
 arch/x86/entry/entry_32.S    |   16 --------------
 arch/x86/include/asm/frame.h |   49 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 31 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -172,21 +172,6 @@ For 32-bit we have the following convent
 	.endif
 .endm
 
-/*
- * This is a sneaky trick to help the unwinder find pt_regs on the stack.  The
- * frame pointer is replaced with an encoded pointer to pt_regs.  The encoding
- * is just setting the LSB, which makes it an invalid stack address and is also
- * a signal to the unwinder that it's a pt_regs pointer in disguise.
- *
- * NOTE: This macro must be used *after* PUSH_AND_CLEAR_REGS because it corrupts
- * the original rbp.
- */
-.macro ENCODE_FRAME_POINTER ptregs_offset=0
-#ifdef CONFIG_FRAME_POINTER
-	leaq 1+\ptregs_offset(%rsp), %rbp
-#endif
-.endm
-
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 
 /*
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -246,22 +246,6 @@
 .Lend_\@:
 .endm
 
-/*
- * This is a sneaky trick to help the unwinder find pt_regs on the stack.  The
- * frame pointer is replaced with an encoded pointer to pt_regs.  The encoding
- * is just clearing the MSB, which makes it an invalid stack address and is also
- * a signal to the unwinder that it's a pt_regs pointer in disguise.
- *
- * NOTE: This macro must be used *after* SAVE_ALL because it corrupts the
- * original rbp.
- */
-.macro ENCODE_FRAME_POINTER
-#ifdef CONFIG_FRAME_POINTER
-	mov %esp, %ebp
-	andl $0x7fffffff, %ebp
-#endif
-.endm
-
 .macro RESTORE_INT_REGS
 	popl	%ebx
 	popl	%ecx
--- a/arch/x86/include/asm/frame.h
+++ b/arch/x86/include/asm/frame.h
@@ -22,6 +22,35 @@
 	pop %_ASM_BP
 .endm
 
+#ifdef CONFIG_X86_64
+/*
+ * This is a sneaky trick to help the unwinder find pt_regs on the stack.  The
+ * frame pointer is replaced with an encoded pointer to pt_regs.  The encoding
+ * is just setting the LSB, which makes it an invalid stack address and is also
+ * a signal to the unwinder that it's a pt_regs pointer in disguise.
+ *
+ * NOTE: This macro must be used *after* PUSH_AND_CLEAR_REGS because it corrupts
+ * the original rbp.
+ */
+.macro ENCODE_FRAME_POINTER ptregs_offset=0
+	leaq 1+\ptregs_offset(%rsp), %rbp
+.endm
+#else /* !CONFIG_X86_64 */
+/*
+ * This is a sneaky trick to help the unwinder find pt_regs on the stack.  The
+ * frame pointer is replaced with an encoded pointer to pt_regs.  The encoding
+ * is just clearing the MSB, which makes it an invalid stack address and is also
+ * a signal to the unwinder that it's a pt_regs pointer in disguise.
+ *
+ * NOTE: This macro must be used *after* SAVE_ALL because it corrupts the
+ * original ebp.
+ */
+.macro ENCODE_FRAME_POINTER
+	mov %esp, %ebp
+	andl $0x7fffffff, %ebp
+.endm
+#endif /* CONFIG_X86_64 */
+
 #else /* !__ASSEMBLY__ */
 
 #define FRAME_BEGIN				\
@@ -30,12 +59,32 @@
 
 #define FRAME_END "pop %" _ASM_BP "\n"
 
+#ifdef CONFIG_X86_64
+#define ENCODE_FRAME_POINTER			\
+	"lea 1(%rsp), %rbp\n\t"
+#else /* !CONFIG_X86_64 */
+#define ENCODE_FRAME_POINTER			\
+	"movl %esp, %ebp\n\t"			\
+	"andl $0x7fffffff, %ebp\n\t"
+#endif /* CONFIG_X86_64 */
+
 #endif /* __ASSEMBLY__ */
 
 #define FRAME_OFFSET __ASM_SEL(4, 8)
 
 #else /* !CONFIG_FRAME_POINTER */
 
+#ifdef __ASSEMBLY__
+
+.macro ENCODE_FRAME_POINTER ptregs_offset=0
+.endm
+
+#else /* !__ASSEMBLY */
+
+#define ENCODE_FRAME_POINTER
+
+#endif
+
 #define FRAME_BEGIN
 #define FRAME_END
 #define FRAME_OFFSET 0



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 03/15] x86/kprobes: Fix frame pointer annotations
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
  2019-06-05 13:07 ` [PATCH 01/15] x86/entry/32: Clean up return from interrupt preemption path Peter Zijlstra
  2019-06-05 13:07 ` [PATCH 02/15] x86: Move ENCODE_FRAME_POINTER to asm/frame.h Peter Zijlstra
@ 2019-06-05 13:07 ` Peter Zijlstra
  2019-06-07 13:02   ` Masami Hiramatsu
  2019-06-05 13:07 ` [PATCH 04/15] x86/ftrace: Add pt_regs frame annotations Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:07 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

The kprobe trampolines have a FRAME_POINTER annotation that makes no
sense. It marks the frame in the middle of pt_regs, at the place of
saving BP.

Change it to mark the pt_regs frame as per the ENCODE_FRAME_POINTER
from the respective entry_*.S.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/kprobes/common.h |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

--- a/arch/x86/kernel/kprobes/common.h
+++ b/arch/x86/kernel/kprobes/common.h
@@ -5,15 +5,10 @@
 /* Kprobes and Optprobes common header */
 
 #include <asm/asm.h>
-
-#ifdef CONFIG_FRAME_POINTER
-# define SAVE_RBP_STRING "	push %" _ASM_BP "\n" \
-			 "	mov  %" _ASM_SP ", %" _ASM_BP "\n"
-#else
-# define SAVE_RBP_STRING "	push %" _ASM_BP "\n"
-#endif
+#include <asm/frame.h>
 
 #ifdef CONFIG_X86_64
+
 #define SAVE_REGS_STRING			\
 	/* Skip cs, ip, orig_ax. */		\
 	"	subq $24, %rsp\n"		\
@@ -27,11 +22,13 @@
 	"	pushq %r10\n"			\
 	"	pushq %r11\n"			\
 	"	pushq %rbx\n"			\
-	SAVE_RBP_STRING				\
+	"	pushq %rbp\n"			\
 	"	pushq %r12\n"			\
 	"	pushq %r13\n"			\
 	"	pushq %r14\n"			\
-	"	pushq %r15\n"
+	"	pushq %r15\n"			\
+	ENCODE_FRAME_POINTER
+
 #define RESTORE_REGS_STRING			\
 	"	popq %r15\n"			\
 	"	popq %r14\n"			\
@@ -51,19 +48,22 @@
 	/* Skip orig_ax, ip, cs */		\
 	"	addq $24, %rsp\n"
 #else
+
 #define SAVE_REGS_STRING			\
 	/* Skip cs, ip, orig_ax and gs. */	\
-	"	subl $16, %esp\n"		\
+	"	subl $4*4, %esp\n"		\
 	"	pushl %fs\n"			\
 	"	pushl %es\n"			\
 	"	pushl %ds\n"			\
 	"	pushl %eax\n"			\
-	SAVE_RBP_STRING				\
+	"	pushl %ebp\n"			\
 	"	pushl %edi\n"			\
 	"	pushl %esi\n"			\
 	"	pushl %edx\n"			\
 	"	pushl %ecx\n"			\
-	"	pushl %ebx\n"
+	"	pushl %ebx\n"			\
+	ENCODE_FRAME_POINTER
+
 #define RESTORE_REGS_STRING			\
 	"	popl %ebx\n"			\
 	"	popl %ecx\n"			\



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 04/15] x86/ftrace: Add pt_regs frame annotations
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (2 preceding siblings ...)
  2019-06-05 13:07 ` [PATCH 03/15] x86/kprobes: Fix frame pointer annotations Peter Zijlstra
@ 2019-06-05 13:07 ` Peter Zijlstra
  2019-06-07 14:45   ` Josh Poimboeuf
  2019-06-05 13:07 ` [PATCH 05/15] x86_32: Provide consistent pt_regs Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:07 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

When CONFIG_FRAME_POINTER, we should mark pt_regs frames.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/ftrace_32.S |    3 +++
 arch/x86/kernel/ftrace_64.S |    3 +++
 2 files changed, 6 insertions(+)

--- a/arch/x86/kernel/ftrace_32.S
+++ b/arch/x86/kernel/ftrace_32.S
@@ -9,6 +9,7 @@
 #include <asm/export.h>
 #include <asm/ftrace.h>
 #include <asm/nospec-branch.h>
+#include <asm/frame.h>
 
 # define function_hook	__fentry__
 EXPORT_SYMBOL(__fentry__)
@@ -116,6 +117,8 @@ ENTRY(ftrace_regs_caller)
 	pushl	%ecx
 	pushl	%ebx
 
+	ENCODE_FRAME_POINTER
+
 	movl	12*4(%esp), %eax		/* Load ip (1st parameter) */
 	subl	$MCOUNT_INSN_SIZE, %eax		/* Adjust ip */
 	movl	15*4(%esp), %edx		/* Load parent ip (2nd parameter) */
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -9,6 +9,7 @@
 #include <asm/export.h>
 #include <asm/nospec-branch.h>
 #include <asm/unwind_hints.h>
+#include <asm/frame.h>
 
 	.code64
 	.section .entry.text, "ax"
@@ -203,6 +204,8 @@ GLOBAL(ftrace_regs_caller_op_ptr)
 	leaq MCOUNT_REG_SIZE+8*2(%rsp), %rcx
 	movq %rcx, RSP(%rsp)
 
+	ENCODE_FRAME_POINTER
+
 	/* regs go into 4th parameter */
 	leaq (%rsp), %rcx
 



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 05/15] x86_32: Provide consistent pt_regs
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (3 preceding siblings ...)
  2019-06-05 13:07 ` [PATCH 04/15] x86/ftrace: Add pt_regs frame annotations Peter Zijlstra
@ 2019-06-05 13:07 ` Peter Zijlstra
  2019-06-07 13:13   ` Masami Hiramatsu
  2019-06-07 19:32   ` Josh Poimboeuf
  2019-06-05 13:07 ` [PATCH 06/15] x86_32: Allow int3_emulate_push() Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:07 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Currently pt_regs on x86_32 has an oddity in that kernel regs
(!user_mode(regs)) are short two entries (esp/ss). This means that any
code trying to use them (typically: regs->sp) needs to jump through
some unfortunate hoops.

Change the entry code to fix this up and create a full pt_regs frame.

This then simplifies various trampolines in ftrace and kprobes, the
stack unwinder, ptrace, kdump and kgdb.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_32.S         |  105 ++++++++++++++++++++++++++++++++++----
 arch/x86/include/asm/kexec.h      |   17 ------
 arch/x86/include/asm/ptrace.h     |   17 ------
 arch/x86/include/asm/stacktrace.h |    2 
 arch/x86/kernel/crash.c           |    8 --
 arch/x86/kernel/ftrace_32.S       |   77 +++++++++++++++------------
 arch/x86/kernel/kgdb.c            |    8 --
 arch/x86/kernel/kprobes/common.h  |    4 -
 arch/x86/kernel/kprobes/core.c    |   29 ++++------
 arch/x86/kernel/kprobes/opt.c     |   20 ++++---
 arch/x86/kernel/process_32.c      |   16 +----
 arch/x86/kernel/ptrace.c          |   29 ----------
 arch/x86/kernel/time.c            |    3 -
 arch/x86/kernel/unwind_frame.c    |   32 +----------
 arch/x86/kernel/unwind_orc.c      |    2 
 15 files changed, 178 insertions(+), 191 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -202,9 +202,102 @@
 .Lend_\@:
 .endm
 
+#define CS_FROM_ENTRY_STACK	(1 << 31)
+#define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_KERNEL		(1 << 29)
+
+.macro FIXUP_FRAME
+	/*
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*.
+	 * Clear them in case hardware didn't do this for us.
+	 */
+	andl	$0x0000ffff, 3*4(%esp)
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 4*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+#endif
+	testl	$SEGMENT_RPL_MASK, 3*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+
+	orl	$CS_FROM_KERNEL, 3*4(%esp)
+
+	/*
+	 * When we're here from kernel mode; the (exception) stack looks like:
+	 *
+	 *  5*4(%esp) - <previous context>
+	 *  4*4(%esp) - flags
+	 *  3*4(%esp) - cs
+	 *  2*4(%esp) - ip
+	 *  1*4(%esp) - orig_eax
+	 *  0*4(%esp) - gs / function
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 5 enties as gap:
+	 *
+	 * 12*4(%esp) - <previous context>
+	 * 11*4(%esp) - gap / flags
+	 * 10*4(%esp) - gap / cs
+	 *  9*4(%esp) - gap / ip
+	 *  8*4(%esp) - gap / orig_eax
+	 *  7*4(%esp) - gap / gs / function
+	 *  6*4(%esp) - ss
+	 *  5*4(%esp) - sp
+	 *  4*4(%esp) - flags
+	 *  3*4(%esp) - cs
+	 *  2*4(%esp) - ip
+	 *  1*4(%esp) - orig_eax
+	 *  0*4(%esp) - gs / function
+	 */
+
+	pushl	%ss		# ss
+	pushl	%esp		# sp (points at ss)
+	addl	$6*4, (%esp)	# point sp back at the previous context
+	pushl	6*4(%esp)	# flags
+	pushl	6*4(%esp)	# cs
+	pushl	6*4(%esp)	# ip
+	pushl	6*4(%esp)	# orig_eax
+	pushl	6*4(%esp)	# gs / function
+.Lfrom_usermode_no_fixup_\@:
+.endm
+
+.macro IRET_FRAME
+	testl $CS_FROM_KERNEL, 1*4(%esp)
+	jz .Lfinished_frame_\@
+
+	/*
+	 * Reconstruct the 3 entry IRET frame right after the (modified)
+	 * regs->sp without lowering %esp in between, such that an NMI in the
+	 * middle doesn't scribble our stack.
+	 */
+	pushl	%eax
+	pushl	%ecx
+	movl	5*4(%esp), %eax		# (modified) regs->sp
+
+	movl	4*4(%esp), %ecx		# flags
+	movl	%ecx, -4(%eax)
+
+	movl	3*4(%esp), %ecx		# cs
+	andl	$0x0000ffff, %ecx
+	movl	%ecx, -8(%eax)
+
+	movl	2*4(%esp), %ecx		# ip
+	movl	%ecx, -12(%eax)
+
+	movl	1*4(%esp), %ecx		# eax
+	movl	%ecx, -16(%eax)
+
+	popl	%ecx
+	lea	-16(%eax), %esp
+	popl	%eax
+.Lfinished_frame_\@:
+.endm
+
 .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
 	cld
 	PUSH_GS
+	FIXUP_FRAME
 	pushl	%fs
 	pushl	%es
 	pushl	%ds
@@ -358,9 +451,6 @@
  * switch to it before we do any copying.
  */
 
-#define CS_FROM_ENTRY_STACK	(1 << 31)
-#define CS_FROM_USER_CR3	(1 << 30)
-
 .macro SWITCH_TO_KERNEL_STACK
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
@@ -374,13 +464,6 @@
 	 * that register for the time this macro runs
 	 */
 
-	/*
-	 * The high bits of the CS dword (__csh) are used for
-	 * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case
-	 * hardware didn't do this for us.
-	 */
-	andl	$(0x0000ffff), PT_CS(%esp)
-
 	/* Are we on the entry stack? Bail out if not! */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
 	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
@@ -990,6 +1073,7 @@ ENTRY(entry_INT80_32)
 	/* Restore user state */
 	RESTORE_REGS pop=4			# skip orig_eax/error_code
 .Lirq_return:
+	IRET_FRAME
 	/*
 	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
 	 * when returning from IPI handler and when returning from
@@ -1340,6 +1424,7 @@ END(page_fault)
 
 common_exception:
 	/* the function address is in %gs's slot on the stack */
+	FIXUP_FRAME
 	pushl	%fs
 	pushl	%es
 	pushl	%ds
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -71,22 +71,6 @@ struct kimage;
 #define KEXEC_BACKUP_SRC_END	(640 * 1024UL - 1)	/* 640K */
 
 /*
- * CPU does not save ss and sp on stack if execution is already
- * running in kernel mode at the time of NMI occurrence. This code
- * fixes it.
- */
-static inline void crash_fixup_ss_esp(struct pt_regs *newregs,
-				      struct pt_regs *oldregs)
-{
-#ifdef CONFIG_X86_32
-	newregs->sp = (unsigned long)&(oldregs->sp);
-	asm volatile("xorl %%eax, %%eax\n\t"
-		     "movw %%ss, %%ax\n\t"
-		     :"=a"(newregs->ss));
-#endif
-}
-
-/*
  * This function is responsible for capturing register states if coming
  * via panic otherwise just fix up the ss and sp if coming via kernel
  * mode exception.
@@ -96,7 +80,6 @@ static inline void crash_setup_regs(stru
 {
 	if (oldregs) {
 		memcpy(newregs, oldregs, sizeof(*newregs));
-		crash_fixup_ss_esp(newregs, oldregs);
 	} else {
 #ifdef CONFIG_X86_32
 		asm volatile("movl %%ebx,%0" : "=m"(newregs->bx));
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struc
 #define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
-#ifdef CONFIG_X86_32
-extern unsigned long kernel_stack_pointer(struct pt_regs *regs);
-#else
 static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
 {
 	return regs->sp;
 }
-#endif
 
 #define GET_IP(regs) ((regs)->ip)
 #define GET_FP(regs) ((regs)->bp)
@@ -201,14 +197,6 @@ static inline unsigned long regs_get_reg
 	if (unlikely(offset > MAX_REG_OFFSET))
 		return 0;
 #ifdef CONFIG_X86_32
-	/*
-	 * Traps from the kernel do not save sp and ss.
-	 * Use the helper function to retrieve sp.
-	 */
-	if (offset == offsetof(struct pt_regs, sp) &&
-	    regs->cs == __KERNEL_CS)
-		return kernel_stack_pointer(regs);
-
 	/* The selector fields are 16-bit. */
 	if (offset == offsetof(struct pt_regs, cs) ||
 	    offset == offsetof(struct pt_regs, ss) ||
@@ -234,8 +222,7 @@ static inline unsigned long regs_get_reg
 static inline int regs_within_kernel_stack(struct pt_regs *regs,
 					   unsigned long addr)
 {
-	return ((addr & ~(THREAD_SIZE - 1))  ==
-		(kernel_stack_pointer(regs) & ~(THREAD_SIZE - 1)));
+	return ((addr & ~(THREAD_SIZE - 1)) == (regs->sp & ~(THREAD_SIZE - 1)));
 }
 
 /**
@@ -249,7 +236,7 @@ static inline int regs_within_kernel_sta
  */
 static inline unsigned long *regs_get_kernel_stack_nth_addr(struct pt_regs *regs, unsigned int n)
 {
-	unsigned long *addr = (unsigned long *)kernel_stack_pointer(regs);
+	unsigned long *addr = (unsigned long *)regs->sp;
 
 	addr += n;
 	if (regs_within_kernel_stack(regs, (unsigned long)addr))
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -78,7 +78,7 @@ static inline unsigned long *
 get_stack_pointer(struct task_struct *task, struct pt_regs *regs)
 {
 	if (regs)
-		return (unsigned long *)kernel_stack_pointer(regs);
+		return (unsigned long *)regs->sp;
 
 	if (task == current)
 		return __builtin_frame_address(0);
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -72,14 +72,6 @@ static inline void cpu_crash_vmclear_loa
 
 static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
 {
-#ifdef CONFIG_X86_32
-	struct pt_regs fixed_regs;
-
-	if (!user_mode(regs)) {
-		crash_fixup_ss_esp(&fixed_regs, regs);
-		regs = &fixed_regs;
-	}
-#endif
 	crash_save_cpu(regs, cpu);
 
 	/*
--- a/arch/x86/kernel/ftrace_32.S
+++ b/arch/x86/kernel/ftrace_32.S
@@ -10,6 +10,7 @@
 #include <asm/ftrace.h>
 #include <asm/nospec-branch.h>
 #include <asm/frame.h>
+#include <asm/asm-offsets.h>
 
 # define function_hook	__fentry__
 EXPORT_SYMBOL(__fentry__)
@@ -90,26 +91,38 @@ END(ftrace_caller)
 
 ENTRY(ftrace_regs_caller)
 	/*
-	 * i386 does not save SS and ESP when coming from kernel.
-	 * Instead, to get sp, &regs->sp is used (see ptrace.h).
-	 * Unfortunately, that means eflags must be at the same location
-	 * as the current return ip is. We move the return ip into the
-	 * regs->ip location, and move flags into the return ip location.
+	 * We're here from an mcount/fentry CALL, and the stack frame looks like:
+	 *
+	 *  <previous context>
+	 *  RET-IP
+	 *
+	 * The purpose of this function is to call out in an emulated INT3
+	 * environment with a stack frame like:
+	 *
+	 *  <previous context>
+	 *  gap / RET-IP
+	 *  gap
+	 *  gap
+	 *  gap
+	 *  pt_regs
+	 *
+	 * We do _NOT_ restore: ss, flags, cs, gs, fs, es, ds
 	 */
-	pushl	$__KERNEL_CS
-	pushl	4(%esp)				/* Save the return ip */
-	pushl	$0				/* Load 0 into orig_ax */
+	subl	$3*4, %esp	# RET-IP + 3 gaps
+	pushl	%ss		# ss
+	pushl	%esp		# points at ss
+	addl	$5*4, (%esp)	#   make it point at <previous context>
+	pushfl			# flags
+	pushl	$__KERNEL_CS	# cs
+	pushl	7*4(%esp)	# ip <- RET-IP
+	pushl	$0		# orig_eax
+
 	pushl	%gs
 	pushl	%fs
 	pushl	%es
 	pushl	%ds
-	pushl	%eax
-
-	/* Get flags and place them into the return ip slot */
-	pushf
-	popl	%eax
-	movl	%eax, 8*4(%esp)
 
+	pushl	%eax
 	pushl	%ebp
 	pushl	%edi
 	pushl	%esi
@@ -119,24 +132,25 @@ ENTRY(ftrace_regs_caller)
 
 	ENCODE_FRAME_POINTER
 
-	movl	12*4(%esp), %eax		/* Load ip (1st parameter) */
-	subl	$MCOUNT_INSN_SIZE, %eax		/* Adjust ip */
-	movl	15*4(%esp), %edx		/* Load parent ip (2nd parameter) */
-	movl	function_trace_op, %ecx		/* Save ftrace_pos in 3rd parameter */
-	pushl	%esp				/* Save pt_regs as 4th parameter */
+	movl	PT_EIP(%esp), %eax	# 1st argument: IP
+	subl	$MCOUNT_INSN_SIZE, %eax
+	movl	21*4(%esp), %edx	# 2nd argument: parent ip
+	movl	function_trace_op, %ecx	# 3rd argument: ftrace_pos
+	pushl	%esp			# 4th argument: pt_regs
 
 GLOBAL(ftrace_regs_call)
 	call	ftrace_stub
 
-	addl	$4, %esp			/* Skip pt_regs */
+	addl	$4, %esp		# skip 4th argument
 
-	/* restore flags */
-	push	14*4(%esp)
-	popf
-
-	/* Move return ip back to its original location */
-	movl	12*4(%esp), %eax
-	movl	%eax, 14*4(%esp)
+	/* place IP below the new SP */
+	movl	PT_OLDESP(%esp), %eax
+	movl	PT_EIP(%esp), %ecx
+	movl	%ecx, -4(%eax)
+
+	/* place EAX below that */
+	movl	PT_EAX(%esp), %ecx
+	movl	%ecx, -8(%eax)
 
 	popl	%ebx
 	popl	%ecx
@@ -144,14 +158,9 @@ GLOBAL(ftrace_regs_call)
 	popl	%esi
 	popl	%edi
 	popl	%ebp
-	popl	%eax
-	popl	%ds
-	popl	%es
-	popl	%fs
-	popl	%gs
 
-	/* use lea to not affect flags */
-	lea	3*4(%esp), %esp			/* Skip orig_ax, ip and cs */
+	lea	-8(%eax), %esp
+	popl	%eax
 
 	jmp	.Lftrace_ret
 
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -127,14 +127,6 @@ char *dbg_get_reg(int regno, void *mem,
 
 #ifdef CONFIG_X86_32
 	switch (regno) {
-	case GDB_SS:
-		if (!user_mode(regs))
-			*(unsigned long *)mem = __KERNEL_DS;
-		break;
-	case GDB_SP:
-		if (!user_mode(regs))
-			*(unsigned long *)mem = kernel_stack_pointer(regs);
-		break;
 	case GDB_GS:
 	case GDB_FS:
 		*(unsigned long *)mem = 0xFFFF;
--- a/arch/x86/kernel/kprobes/common.h
+++ b/arch/x86/kernel/kprobes/common.h
@@ -72,8 +72,8 @@
 	"	popl %edi\n"			\
 	"	popl %ebp\n"			\
 	"	popl %eax\n"			\
-	/* Skip ds, es, fs, gs, orig_ax, and ip. Note: don't pop cs here*/\
-	"	addl $24, %esp\n"
+	/* Skip ds, es, fs, gs, orig_ax, ip, and cs. */\
+	"	addl $7*4, %esp\n"
 #endif
 
 /* Ensure if the instruction can be boostable */
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -69,7 +69,7 @@
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 
-#define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))
+#define stack_addr(regs) ((unsigned long *)regs->sp)
 
 #define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
 	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
@@ -731,29 +731,27 @@ asm(
 	".global kretprobe_trampoline\n"
 	".type kretprobe_trampoline, @function\n"
 	"kretprobe_trampoline:\n"
-#ifdef CONFIG_X86_64
 	/* We don't bother saving the ss register */
+#ifdef CONFIG_X86_64
 	"	pushq %rsp\n"
 	"	pushfq\n"
 	SAVE_REGS_STRING
 	"	movq %rsp, %rdi\n"
 	"	call trampoline_handler\n"
 	/* Replace saved sp with true return address. */
-	"	movq %rax, 152(%rsp)\n"
+	"	movq %rax, 19*8(%rsp)\n"
 	RESTORE_REGS_STRING
 	"	popfq\n"
 #else
-	"	pushf\n"
+	"	pushl %esp\n"
+	"	pushfl\n"
 	SAVE_REGS_STRING
 	"	movl %esp, %eax\n"
 	"	call trampoline_handler\n"
-	/* Move flags to cs */
-	"	movl 56(%esp), %edx\n"
-	"	movl %edx, 52(%esp)\n"
-	/* Replace saved flags with true return address. */
-	"	movl %eax, 56(%esp)\n"
+	/* Replace saved sp with true return address. */
+	"	movl %eax, 15*4(%esp)\n"
 	RESTORE_REGS_STRING
-	"	popf\n"
+	"	popfl\n"
 #endif
 	"	ret\n"
 	".size kretprobe_trampoline, .-kretprobe_trampoline\n"
@@ -794,16 +792,13 @@ __used __visible void *trampoline_handle
 	INIT_HLIST_HEAD(&empty_rp);
 	kretprobe_hash_lock(current, &head, &flags);
 	/* fixup registers */
-#ifdef CONFIG_X86_64
 	regs->cs = __KERNEL_CS;
-	/* On x86-64, we use pt_regs->sp for return address holder. */
-	frame_pointer = &regs->sp;
-#else
-	regs->cs = __KERNEL_CS | get_kernel_rpl();
+#ifdef CONFIG_X86_32
+	regs->cs |= get_kernel_rpl();
 	regs->gs = 0;
-	/* On x86-32, we use pt_regs->flags for return address holder. */
-	frame_pointer = &regs->flags;
 #endif
+	/* We use pt_regs->sp for return address holder. */
+	frame_pointer = &regs->sp;
 	regs->ip = trampoline_address;
 	regs->orig_ax = ~0UL;
 
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -115,14 +115,15 @@ asm (
 			"optprobe_template_call:\n"
 			ASM_NOP5
 			/* Move flags to rsp */
-			"	movq 144(%rsp), %rdx\n"
-			"	movq %rdx, 152(%rsp)\n"
+			"	movq 18*8(%rsp), %rdx\n"
+			"	movq %rdx, 19*8(%rsp)\n"
 			RESTORE_REGS_STRING
 			/* Skip flags entry */
 			"	addq $8, %rsp\n"
 			"	popfq\n"
 #else /* CONFIG_X86_32 */
-			"	pushf\n"
+			"	pushl %esp\n"
+			"	pushfl\n"
 			SAVE_REGS_STRING
 			"	movl %esp, %edx\n"
 			".global optprobe_template_val\n"
@@ -131,9 +132,13 @@ asm (
 			".global optprobe_template_call\n"
 			"optprobe_template_call:\n"
 			ASM_NOP5
+			/* Move flags into esp */
+			"	movl 14*4(%esp), %edx\n"
+			"	movl %edx, 15*4(%esp)\n"
 			RESTORE_REGS_STRING
-			"	addl $4, %esp\n"	/* skip cs */
-			"	popf\n"
+			/* Skip flags entry */
+			"	addl $4, %esp\n"
+			"	popfl\n"
 #endif
 			".global optprobe_template_end\n"
 			"optprobe_template_end:\n"
@@ -165,10 +170,9 @@ optimized_callback(struct optimized_kpro
 	} else {
 		struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
 		/* Save skipped registers */
-#ifdef CONFIG_X86_64
 		regs->cs = __KERNEL_CS;
-#else
-		regs->cs = __KERNEL_CS | get_kernel_rpl();
+#ifdef CONFIG_X86_32
+		regs->cs |= get_kernel_rpl();
 		regs->gs = 0;
 #endif
 		regs->ip = (unsigned long)op->kp.addr + INT3_SIZE;
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -62,27 +62,21 @@ void __show_regs(struct pt_regs *regs, e
 {
 	unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
 	unsigned long d0, d1, d2, d3, d6, d7;
-	unsigned long sp;
-	unsigned short ss, gs;
+	unsigned short gs;
 
-	if (user_mode(regs)) {
-		sp = regs->sp;
-		ss = regs->ss;
+	if (user_mode(regs))
 		gs = get_user_gs(regs);
-	} else {
-		sp = kernel_stack_pointer(regs);
-		savesegment(ss, ss);
+	else
 		savesegment(gs, gs);
-	}
 
 	show_ip(regs, KERN_DEFAULT);
 
 	printk(KERN_DEFAULT "EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n",
 		regs->ax, regs->bx, regs->cx, regs->dx);
 	printk(KERN_DEFAULT "ESI: %08lx EDI: %08lx EBP: %08lx ESP: %08lx\n",
-		regs->si, regs->di, regs->bp, sp);
+		regs->si, regs->di, regs->bp, regs->sp);
 	printk(KERN_DEFAULT "DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x EFLAGS: %08lx\n",
-	       (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, ss, regs->flags);
+	       (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, regs->ss, regs->flags);
 
 	if (mode != SHOW_REGS_ALL)
 		return;
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -153,35 +153,6 @@ static inline bool invalid_selector(u16
 
 #define FLAG_MASK		FLAG_MASK_32
 
-/*
- * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
- * when it traps.  The previous stack will be directly underneath the saved
- * registers, and 'sp/ss' won't even have been saved. Thus the '&regs->sp'.
- *
- * Now, if the stack is empty, '&regs->sp' is out of range. In this
- * case we try to take the previous stack. To always return a non-null
- * stack pointer we fall back to regs as stack if no previous stack
- * exists.
- *
- * This is valid only for kernel mode traps.
- */
-unsigned long kernel_stack_pointer(struct pt_regs *regs)
-{
-	unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1);
-	unsigned long sp = (unsigned long)&regs->sp;
-	u32 *prev_esp;
-
-	if (context == (sp & ~(THREAD_SIZE - 1)))
-		return sp;
-
-	prev_esp = (u32 *)(context);
-	if (*prev_esp)
-		return (unsigned long)*prev_esp;
-
-	return (unsigned long)regs;
-}
-EXPORT_SYMBOL_GPL(kernel_stack_pointer);
-
 static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno)
 {
 	BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0);
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -37,8 +37,7 @@ unsigned long profile_pc(struct pt_regs
 #ifdef CONFIG_FRAME_POINTER
 		return *(unsigned long *)(regs->bp + sizeof(long));
 #else
-		unsigned long *sp =
-			(unsigned long *)kernel_stack_pointer(regs);
+		unsigned long *sp = (unsigned long *)regs->sp;
 		/*
 		 * Return address is either directly at stack pointer
 		 * or above a saved flags. Eflags has bits 22-31 zero,
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -69,15 +69,6 @@ static void unwind_dump(struct unwind_st
 	}
 }
 
-static size_t regs_size(struct pt_regs *regs)
-{
-	/* x86_32 regs from kernel mode are two words shorter: */
-	if (IS_ENABLED(CONFIG_X86_32) && !user_mode(regs))
-		return sizeof(*regs) - 2*sizeof(long);
-
-	return sizeof(*regs);
-}
-
 static bool in_entry_code(unsigned long ip)
 {
 	char *addr = (char *)ip;
@@ -197,12 +188,6 @@ static struct pt_regs *decode_frame_poin
 }
 #endif
 
-#ifdef CONFIG_X86_32
-#define KERNEL_REGS_SIZE (sizeof(struct pt_regs) - 2*sizeof(long))
-#else
-#define KERNEL_REGS_SIZE (sizeof(struct pt_regs))
-#endif
-
 static bool update_stack_state(struct unwind_state *state,
 			       unsigned long *next_bp)
 {
@@ -213,7 +198,7 @@ static bool update_stack_state(struct un
 	size_t len;
 
 	if (state->regs)
-		prev_frame_end = (void *)state->regs + regs_size(state->regs);
+		prev_frame_end = (void *)state->regs + sizeof(*state->regs);
 	else
 		prev_frame_end = (void *)state->bp + FRAME_HEADER_SIZE;
 
@@ -221,7 +206,7 @@ static bool update_stack_state(struct un
 	regs = decode_frame_pointer(next_bp);
 	if (regs) {
 		frame = (unsigned long *)regs;
-		len = KERNEL_REGS_SIZE;
+		len = sizeof(*regs);
 		state->got_irq = true;
 	} else {
 		frame = next_bp;
@@ -245,14 +230,6 @@ static bool update_stack_state(struct un
 	    frame < prev_frame_end)
 		return false;
 
-	/*
-	 * On 32-bit with user mode regs, make sure the last two regs are safe
-	 * to access:
-	 */
-	if (IS_ENABLED(CONFIG_X86_32) && regs && user_mode(regs) &&
-	    !on_stack(info, frame, len + 2*sizeof(long)))
-		return false;
-
 	/* Move state to the next frame: */
 	if (regs) {
 		state->regs = regs;
@@ -411,10 +388,9 @@ void __unwind_start(struct unwind_state
 	 * Pretend that the frame is complete and that BP points to it, but save
 	 * the real BP so that we can use it when looking for the next frame.
 	 */
-	if (regs && regs->ip == 0 &&
-	    (unsigned long *)kernel_stack_pointer(regs) >= first_frame) {
+	if (regs && regs->ip == 0 && (unsigned long *)regs->sp >= first_frame) {
 		state->next_bp = bp;
-		bp = ((unsigned long *)kernel_stack_pointer(regs)) - 1;
+		bp = ((unsigned long *)regs->sp) - 1;
 	}
 
 	/* Initialize stack info and make sure the frame data is accessible: */
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -579,7 +579,7 @@ void __unwind_start(struct unwind_state
 			goto done;
 
 		state->ip = regs->ip;
-		state->sp = kernel_stack_pointer(regs);
+		state->sp = regs->sp;
 		state->bp = regs->bp;
 		state->regs = regs;
 		state->full_regs = true;



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 06/15] x86_32: Allow int3_emulate_push()
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (4 preceding siblings ...)
  2019-06-05 13:07 ` [PATCH 05/15] x86_32: Provide consistent pt_regs Peter Zijlstra
@ 2019-06-05 13:07 ` Peter Zijlstra
  2019-06-05 13:08 ` [PATCH 07/15] x86: Add int3_emulate_call() selftest Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:07 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Now that x86_32 has an unconditional gap on the kernel stack frame,
the int3_emulate_push() thing will work without further changes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/text-patching.h |    2 --
 arch/x86/kernel/ftrace.c             |    7 -------
 2 files changed, 9 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -51,7 +51,6 @@ static inline void int3_emulate_jmp(stru
 #define INT3_INSN_SIZE 1
 #define CALL_INSN_SIZE 5
 
-#ifdef CONFIG_X86_64
 static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
 {
 	/*
@@ -69,7 +68,6 @@ static inline void int3_emulate_call(str
 	int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
 	int3_emulate_jmp(regs, func);
 }
-#endif /* CONFIG_X86_64 */
 #endif /* !CONFIG_UML_X86 */
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -300,7 +300,6 @@ int ftrace_int3_handler(struct pt_regs *
 
 	ip = regs->ip - INT3_INSN_SIZE;
 
-#ifdef CONFIG_X86_64
 	if (ftrace_location(ip)) {
 		int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
 		return 1;
@@ -312,12 +311,6 @@ int ftrace_int3_handler(struct pt_regs *
 		int3_emulate_call(regs, ftrace_update_func_call);
 		return 1;
 	}
-#else
-	if (ftrace_location(ip) || is_ftrace_caller(ip)) {
-		int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
-		return 1;
-	}
-#endif
 
 	return 0;
 }



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 07/15] x86: Add int3_emulate_call() selftest
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (5 preceding siblings ...)
  2019-06-05 13:07 ` [PATCH 06/15] x86_32: Allow int3_emulate_push() Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-10 16:52   ` Josh Poimboeuf
  2019-06-05 13:08 ` [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

Given that the entry_*.S changes for this functionality are somewhat
tricky, make sure the paths are tested every boot, instead of on the
rare occasion when we trip an INT3 while rewriting text.

Requested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/alternative.c |   81 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -614,11 +614,83 @@ extern struct paravirt_patch_site __star
 	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
+/*
+ * Self-test for the INT3 based CALL emulation code.
+ *
+ * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up
+ * properly and that there is a stack gap between the INT3 frame and the
+ * previous context. Without this gap doing a virtual PUSH on the interrupted
+ * stack would corrupt the INT3 IRET frame.
+ *
+ * See entry_{32,64}.S for more details.
+ */
+static void __init int3_magic(unsigned int *ptr)
+{
+	*ptr = 1;
+}
+
+extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */
+
+static int __init
+int3_exception_notify(struct notifier_block *self, unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	if (!regs || user_mode(regs))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3)
+		return NOTIFY_DONE;
+
+	if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip)
+		return NOTIFY_DONE;
+
+	int3_emulate_call(regs, (unsigned long)&int3_magic);
+	return NOTIFY_STOP;
+}
+
+static void __init int3_selftest(void)
+{
+	static __initdata struct notifier_block int3_exception_nb = {
+		.notifier_call	= int3_exception_notify,
+		.priority	= INT_MAX-1, /* last */
+	};
+	unsigned int val = 0;
+
+	BUG_ON(register_die_notifier(&int3_exception_nb));
+
+	/*
+	 * Basically: int3_magic(&val); but really complicated :-)
+	 *
+	 * Stick the address of the INT3 instruction into int3_selftest_ip,
+	 * then trigger the INT3, padded with NOPs to match a CALL instruction
+	 * length.
+	 */
+	asm volatile ("1: int3; nop; nop; nop; nop\n\t"
+		      ".pushsection .init.data,\"aw\"\n\t"
+		      ".align " __ASM_SEL(4, 8) "\n\t"
+		      ".type int3_selftest_ip, @object\n\t"
+		      ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t"
+		      "int3_selftest_ip:\n\t"
+		      __ASM_SEL(.long, .quad) " 1b\n\t"
+		      ".popsection\n\t"
+		      : : __ASM_SEL_RAW(a, D) (&val) : "memory");
+
+	BUG_ON(val != 1);
+
+	unregister_die_notifier(&int3_exception_nb);
+}
+
 void __init alternative_instructions(void)
 {
-	/* The patching is not fully atomic, so try to avoid local interruptions
-	   that might execute the to be patched code.
-	   Other CPUs are not running. */
+	int3_selftest();
+
+	/*
+	 * The patching is not fully atomic, so try to avoid local
+	 * interruptions that might execute the to be patched code.
+	 * Other CPUs are not running.
+	 */
 	stop_nmi();
 
 	/*
@@ -643,10 +715,11 @@ void __init alternative_instructions(voi
 					    _text, _etext);
 	}
 
-	if (!uniproc_patched || num_possible_cpus() == 1)
+	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
+	}
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (6 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 07/15] x86: Add int3_emulate_call() selftest Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-07  5:41   ` Nadav Amit
                     ` (3 more replies)
  2019-06-05 13:08 ` [PATCH 09/15] compiler.h: Make __ADDRESSABLE() symbol truly unique Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 4 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

In preparation for static_call support, teach text_poke_bp() to
emulate instructions, including CALL.

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/text-patching.h |    2 -
 arch/x86/kernel/alternative.c        |   47 ++++++++++++++++++++++++++---------
 arch/x86/kernel/jump_label.c         |    3 --
 arch/x86/kernel/kprobes/opt.c        |   11 +++++---
 4 files changed, 46 insertions(+), 17 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -921,19 +921,25 @@ static void do_sync_core(void *info)
 }
 
 static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static const void *bp_int3_opcode, *bp_int3_addr;
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
+	struct opcode {
+		u8 insn;
+		s32 rel;
+	} __packed opcode;
+
 	/*
 	 * Having observed our INT3 instruction, we now must observe
 	 * bp_patching_in_progress.
 	 *
-	 * 	in_progress = TRUE		INT3
-	 * 	WMB				RMB
-	 * 	write INT3			if (in_progress)
+	 *	in_progress = TRUE		INT3
+	 *	WMB				RMB
+	 *	write INT3			if (in_progress)
 	 *
-	 * Idem for bp_int3_handler.
+	 * Idem for bp_int3_opcode.
 	 */
 	smp_rmb();
 
@@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
 	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	opcode = *(struct opcode *)bp_int3_opcode;
+
+	switch (opcode.insn) {
+	case 0xE8: /* CALL */
+		int3_emulate_call(regs, ip + opcode.rel);
+		break;
+
+	case 0xE9: /* JMP */
+		int3_emulate_jmp(regs, ip + opcode.rel);
+		break;
+
+	default: /* assume NOP */
+		int3_emulate_jmp(regs, ip);
+		break;
+	}
 
 	return 1;
 }
@@ -955,7 +974,7 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  * @addr:	address to patch
  * @opcode:	opcode of new instruction
  * @len:	length to copy
- * @handler:	address to jump to when the temporary breakpoint is hit
+ * @emulate:	opcode to emulate, when NULL use @opcode
  *
  * Modify multi-byte instruction by using int3 breakpoint on SMP.
  * We completely avoid stop_machine() here, and achieve the
@@ -970,19 +989,25 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  *	  replacing opcode
  *	- sync cores
  */
-void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
 {
 	unsigned char int3 = 0xcc;
 
-	bp_int3_handler = handler;
+	bp_int3_opcode = emulate ?: opcode;
 	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	bp_patching_in_progress = true;
 
 	lockdep_assert_held(&text_mutex);
 
 	/*
+	 * poke_int3_handler() relies on @opcode being a 5 byte instruction;
+	 * notably a JMP, CALL or NOP5_ATOMIC.
+	 */
+	BUG_ON(len != 5);
+
+	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
-	 * in_progress and handler are correctly ordered wrt. patching.
+	 * in_progress and opcode are correctly ordered wrt. patching.
 	 */
 	smp_wmb();
 
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -87,8 +87,7 @@ static void __ref __jump_label_transform
 		return;
 	}
 
-	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
-		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, NULL);
 }
 
 void arch_jump_label_transform(struct jump_entry *entry,
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
 		insn_buff[0] = RELATIVEJUMP_OPCODE;
 		*(s32 *)(&insn_buff[1]) = rel;
 
-		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
-			     op->optinsn.insn);
+		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);
 
 		list_del_init(&op->list);
 	}
@@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
 void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 {
 	u8 insn_buff[RELATIVEJUMP_SIZE];
+	u8 emulate_buff[RELATIVEJUMP_SIZE];
 
 	/* Set int3 to first byte for kprobes */
 	insn_buff[0] = BREAKPOINT_INSTRUCTION;
 	memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
+
+	emulate_buff[0] = RELATIVEJUMP_OPCODE;
+	*(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
+			((long)op->kp.addr + RELATIVEJUMP_SIZE));
+
 	text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
-		     op->optinsn.insn);
+		     emulate_buff);
 }
 
 /*



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 09/15] compiler.h: Make __ADDRESSABLE() symbol truly unique
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (7 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-05 13:08 ` [PATCH 10/15] static_call: Add basic static call infrastructure Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira, Josh Poimboeuf

From: Josh Poimboeuf <jpoimboe@redhat.com>

The __ADDRESSABLE() macro uses the __LINE__ macro to create a temporary
symbol which has a unique name.  However, if the macro is used multiple
times from within another macro, the line number will always be the
same, resulting in duplicate symbols.

Make the temporary symbols truly unique by using __UNIQUE_ID instead of
__LINE__.

Cc: Julia Cartwright <julia@ni.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: x86@kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Nadav Amit <namit@vmware.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8bc857824f82462a296a8a3c4913a11a7f801e74.1547073843.git.jpoimboe@redhat.com
---
 include/linux/compiler.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -294,7 +294,7 @@ unsigned long read_word_at_a_time(const
  */
 #define __ADDRESSABLE(sym) \
 	static void * __section(".discard.addressable") __used \
-		__PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
+		__UNIQUE_ID(__addressable_##sym) = (void *)&sym;
 
 /**
  * offset_to_ptr - convert a relative memory offset to an absolute pointer



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (8 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 09/15] compiler.h: Make __ADDRESSABLE() symbol truly unique Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-06 22:44   ` Nadav Amit
  2019-06-05 13:08 ` [PATCH 11/15] static_call: Add inline " Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira, Josh Poimboeuf

From: Josh Poimboeuf <jpoimboe@redhat.com>

Static calls are a replacement for global function pointers.  They use
code patching to allow direct calls to be used instead of indirect
calls.  They give the flexibility of function pointers, but with
improved performance.  This is especially important for cases where
retpolines would otherwise be used, as retpolines can significantly
impact performance.

The concept and code are an extension of previous work done by Ard
Biesheuvel and Steven Rostedt:

  https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheuvel@linaro.org
  https://lkml.kernel.org/r/20181006015110.653946300@goodmis.org

There are two implementations, depending on arch support:

 1) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL)
 2) basic function pointers

For more details, see the comments in include/linux/static_call.h.

Cc: x86@kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Julia Cartwright <julia@ni.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/a01f733889ebf4bc447507ab8041a60378eaa89f.1547073843.git.jpoimboe@redhat.com
---
 arch/Kconfig                      |    3 
 include/linux/static_call.h       |  135 ++++++++++++++++++++++++++++++++++++++
 include/linux/static_call_types.h |   13 +++
 3 files changed, 151 insertions(+)
 create mode 100644 include/linux/static_call.h
 create mode 100644 include/linux/static_call_types.h

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -927,6 +927,9 @@ config LOCK_EVENT_COUNTS
 	  the chance of application behavior change because of timing
 	  differences. The counts are reported via debugfs.
 
+config HAVE_STATIC_CALL
+	bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
--- /dev/null
+++ b/include/linux/static_call.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_STATIC_CALL_H
+#define _LINUX_STATIC_CALL_H
+
+/*
+ * Static call support
+ *
+ * Static calls use code patching to hard-code function pointers into direct
+ * branch instructions.  They give the flexibility of function pointers, but
+ * with improved performance.  This is especially important for cases where
+ * retpolines would otherwise be used, as retpolines can significantly impact
+ * performance.
+ *
+ *
+ * API overview:
+ *
+ *   DECLARE_STATIC_CALL(key, func);
+ *   DEFINE_STATIC_CALL(key, func);
+ *   static_call(key, args...);
+ *   static_call_update(key, func);
+ *
+ *
+ * Usage example:
+ *
+ *   # Start with the following functions (with identical prototypes):
+ *   int func_a(int arg1, int arg2);
+ *   int func_b(int arg1, int arg2);
+ *
+ *   # Define a 'my_key' reference, associated with func_a() by default
+ *   DEFINE_STATIC_CALL(my_key, func_a);
+ *
+ *   # Call func_a()
+ *   static_call(my_key, arg1, arg2);
+ *
+ *   # Update 'my_key' to point to func_b()
+ *   static_call_update(my_key, func_b);
+ *
+ *   # Call func_b()
+ *   static_call(my_key, arg1, arg2);
+ *
+ *
+ * Implementation details:
+ *
+ *    This requires some arch-specific code (CONFIG_HAVE_STATIC_CALL).
+ *    Otherwise basic indirect calls are used (with function pointers).
+ *
+ *    Each static_call() site calls into a trampoline associated with the key.
+ *    The trampoline has a direct branch to the default function.  Updates to a
+ *    key will modify the trampoline's branch destination.
+ */
+
+#include <linux/types.h>
+#include <linux/cpu.h>
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_HAVE_STATIC_CALL
+#include <asm/static_call.h>
+extern void arch_static_call_transform(void *site, void *tramp, void *func);
+#endif
+
+
+#define DECLARE_STATIC_CALL(key, func)					\
+	extern struct static_call_key key;				\
+	extern typeof(func) STATIC_CALL_TRAMP(key)
+
+
+#if defined(CONFIG_HAVE_STATIC_CALL)
+
+struct static_call_key {
+	void *func, *tramp;
+};
+
+#define DEFINE_STATIC_CALL(key, _func)					\
+	DECLARE_STATIC_CALL(key, _func);				\
+	struct static_call_key key = {					\
+		.func = _func,						\
+		.tramp = STATIC_CALL_TRAMP(key),			\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)
+
+#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
+
+static inline void __static_call_update(struct static_call_key *key, void *func)
+{
+	cpus_read_lock();
+	WRITE_ONCE(key->func, func);
+	arch_static_call_transform(NULL, key->tramp, func);
+	cpus_read_unlock();
+}
+
+#define static_call_update(key, func)					\
+({									\
+	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
+	__static_call_update(&key, func);				\
+})
+
+#define EXPORT_STATIC_CALL(key)						\
+	EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key)					\
+	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#else /* Generic implementation */
+
+struct static_call_key {
+	void *func;
+};
+
+#define DEFINE_STATIC_CALL(key, _func)					\
+	DECLARE_STATIC_CALL(key, _func);				\
+	struct static_call_key key = {					\
+		.func = _func,						\
+	}
+
+#define static_call(key, args...)					\
+	((typeof(STATIC_CALL_TRAMP(key))*)(key.func))(args)
+
+static inline void __static_call_update(struct static_call_key *key, void *func)
+{
+	WRITE_ONCE(key->func, func);
+}
+
+#define static_call_update(key, func)					\
+({									\
+	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
+	__static_call_update(&key, func);				\
+})
+
+#define EXPORT_STATIC_CALL(key) EXPORT_SYMBOL(key)
+#define EXPORT_STATIC_CALL_GPL(key) EXPORT_SYMBOL_GPL(key)
+
+#endif /* CONFIG_HAVE_STATIC_CALL */
+
+#endif /* _LINUX_STATIC_CALL_H */
--- /dev/null
+++ b/include/linux/static_call_types.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+#endif /* _STATIC_CALL_TYPES_H */



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (9 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 10/15] static_call: Add basic static call infrastructure Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-06 22:24   ` Nadav Amit
  2019-06-05 13:08 ` [PATCH 12/15] x86/static_call: Add out-of-line static call implementation Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira, Josh Poimboeuf

From: Josh Poimboeuf <jpoimboe@redhat.com>

Add infrastructure for an arch-specific CONFIG_HAVE_STATIC_CALL_INLINE
option, which is a faster version of CONFIG_HAVE_STATIC_CALL.  At
runtime, the static call sites are patched directly, rather than using
the out-of-line trampolines.

Compared to out-of-line static calls, the performance benefits are more
modest, but still measurable.  Steven Rostedt did some tracepoint
measurements:

  https://lkml.kernel.org/r/20181126155405.72b4f718@gandalf.local.home

This code is heavily inspired by the jump label code (aka "static
jumps"), as some of the concepts are very similar.

For more details, see the comments in include/linux/static_call.h.

Cc: x86@kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Julia Cartwright <julia@ni.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c70ea8c00b93dadcb97b9d83659cf204121372d6.1547073843.git.jpoimboe@redhat.com
---
 arch/Kconfig                      |    4 
 include/asm-generic/vmlinux.lds.h |    7 
 include/linux/module.h            |   10 +
 include/linux/static_call.h       |   63 +++++++
 include/linux/static_call_types.h |    9 +
 kernel/Makefile                   |    1 
 kernel/module.c                   |    5 
 kernel/static_call.c              |  316 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 414 insertions(+), 1 deletion(-)
 create mode 100644 kernel/static_call.c

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -930,6 +930,10 @@ config LOCK_EVENT_COUNTS
 config HAVE_STATIC_CALL
 	bool
 
+config HAVE_STATIC_CALL_INLINE
+	bool
+	depends on HAVE_STATIC_CALL
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -311,6 +311,12 @@
 	KEEP(*(__jump_table))						\
 	__stop___jump_table = .;
 
+#define STATIC_CALL_DATA						\
+	. = ALIGN(8);							\
+	__start_static_call_sites = .;					\
+	KEEP(*(.static_call_sites))					\
+	__stop_static_call_sites = .;
+
 /*
  * Allow architectures to handle ro_after_init data on their
  * own by defining an empty RO_AFTER_INIT_DATA.
@@ -320,6 +326,7 @@
 	__start_ro_after_init = .;					\
 	*(.data..ro_after_init)						\
 	JUMP_TABLE_DATA							\
+	STATIC_CALL_DATA						\
 	__end_ro_after_init = .;
 #endif
 
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -21,6 +21,7 @@
 #include <linux/rbtree_latch.h>
 #include <linux/error-injection.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/static_call_types.h>
 
 #include <linux/percpu.h>
 #include <asm/module.h>
@@ -472,6 +473,10 @@ struct module {
 	unsigned int num_ftrace_callsites;
 	unsigned long *ftrace_callsites;
 #endif
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+	int num_static_call_sites;
+	struct static_call_site *static_call_sites;
+#endif
 
 #ifdef CONFIG_LIVEPATCH
 	bool klp; /* Is this a livepatch module? */
@@ -728,6 +733,11 @@ static inline bool within_module(unsigne
 {
 	return false;
 }
+
+static inline bool within_module_init(unsigned long addr, const struct module *mod)
+{
+	return false;
+}
 
 /* Get/put a kernel symbol (calls should be symmetric) */
 #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -47,6 +47,12 @@
  *    Each static_call() site calls into a trampoline associated with the key.
  *    The trampoline has a direct branch to the default function.  Updates to a
  *    key will modify the trampoline's branch destination.
+ *
+ *    If the arch has CONFIG_HAVE_STATIC_CALL_INLINE, then the call sites
+ *    themselves will be patched at runtime to call the functions directly,
+ *    rather than calling through the trampoline.  This requires objtool or a
+ *    compiler plugin to detect all the static_call() sites and annotate them
+ *    in the .static_call_sites section.
  */
 
 #include <linux/types.h>
@@ -64,7 +70,62 @@ extern void arch_static_call_transform(v
 	extern typeof(func) STATIC_CALL_TRAMP(key)
 
 
-#if defined(CONFIG_HAVE_STATIC_CALL)
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+
+struct static_call_key {
+	void *func, *tramp;
+	/*
+	 * List of modules (including vmlinux) and their call sites associated
+	 * with this key.
+	 */
+	struct list_head site_mods;
+};
+
+struct static_call_mod {
+	struct list_head list;
+	struct module *mod; /* for vmlinux, mod == NULL */
+	struct static_call_site *sites;
+};
+
+extern void __static_call_update(struct static_call_key *key, void *func);
+extern int static_call_mod_init(struct module *mod);
+
+#define DEFINE_STATIC_CALL(key, _func)					\
+	DECLARE_STATIC_CALL(key, _func);				\
+	struct static_call_key key = {					\
+		.func = _func,						\
+		.tramp = STATIC_CALL_TRAMP(key),			\
+		.site_mods = LIST_HEAD_INIT(key.site_mods),		\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)
+
+/*
+ * __ADDRESSABLE() is used to ensure the key symbol doesn't get stripped from
+ * the symbol table so objtool can reference it when it generates the
+ * static_call_site structs.
+ */
+#define static_call(key, args...)					\
+({									\
+	__ADDRESSABLE(key);						\
+	STATIC_CALL_TRAMP(key)(args);					\
+})
+
+#define static_call_update(key, func)					\
+({									\
+	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
+	__static_call_update(&key, func);				\
+})
+
+#define EXPORT_STATIC_CALL(key)						\
+	EXPORT_SYMBOL(key);						\
+	EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key)					\
+	EXPORT_SYMBOL_GPL(key);						\
+	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#elif defined(CONFIG_HAVE_STATIC_CALL)
 
 struct static_call_key {
 	void *func, *tramp;
--- a/include/linux/static_call_types.h
+++ b/include/linux/static_call_types.h
@@ -10,4 +10,13 @@
 #define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
 #define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
 
+/*
+ * The static call site table needs to be created by external tooling (objtool
+ * or a compiler plugin).
+ */
+struct static_call_site {
+	s32 addr;
+	s32 key;
+};
+
 #endif /* _STATIC_CALL_TYPES_H */
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
+obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3117,6 +3117,11 @@ static int find_module_sections(struct m
 					    sizeof(*mod->ei_funcs),
 					    &mod->num_ei_funcs);
 #endif
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+	mod->static_call_sites = section_objs(info, ".static_call_sites",
+					      sizeof(*mod->static_call_sites),
+					      &mod->num_static_call_sites);
+#endif
 	mod->extable = section_objs(info, "__ex_table",
 				    sizeof(*mod->extable), &mod->num_exentries);
 
--- /dev/null
+++ b/kernel/static_call.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/init.h>
+#include <linux/static_call.h>
+#include <linux/bug.h>
+#include <linux/smp.h>
+#include <linux/sort.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/cpu.h>
+#include <linux/processor.h>
+#include <asm/sections.h>
+
+extern struct static_call_site __start_static_call_sites[],
+			       __stop_static_call_sites[];
+
+static bool static_call_initialized;
+
+#define STATIC_CALL_INIT 1UL
+
+/* mutex to protect key modules/sites */
+static DEFINE_MUTEX(static_call_mutex);
+
+static void static_call_lock(void)
+{
+	mutex_lock(&static_call_mutex);
+}
+
+static void static_call_unlock(void)
+{
+	mutex_unlock(&static_call_mutex);
+}
+
+static inline void *static_call_addr(struct static_call_site *site)
+{
+	return (void *)((long)site->addr + (long)&site->addr);
+}
+
+
+static inline struct static_call_key *static_call_key(const struct static_call_site *site)
+{
+	return (struct static_call_key *)
+		(((long)site->key + (long)&site->key) & ~STATIC_CALL_INIT);
+}
+
+/* These assume the key is word-aligned. */
+static inline bool static_call_is_init(struct static_call_site *site)
+{
+	return ((long)site->key + (long)&site->key) & STATIC_CALL_INIT;
+}
+
+static inline void static_call_set_init(struct static_call_site *site)
+{
+	site->key = ((long)static_call_key(site) | STATIC_CALL_INIT) -
+		    (long)&site->key;
+}
+
+static int static_call_site_cmp(const void *_a, const void *_b)
+{
+	const struct static_call_site *a = _a;
+	const struct static_call_site *b = _b;
+	const struct static_call_key *key_a = static_call_key(a);
+	const struct static_call_key *key_b = static_call_key(b);
+
+	if (key_a < key_b)
+		return -1;
+
+	if (key_a > key_b)
+		return 1;
+
+	return 0;
+}
+
+static void static_call_site_swap(void *_a, void *_b, int size)
+{
+	long delta = (unsigned long)_a - (unsigned long)_b;
+	struct static_call_site *a = _a;
+	struct static_call_site *b = _b;
+	struct static_call_site tmp = *a;
+
+	a->addr = b->addr  - delta;
+	a->key  = b->key   - delta;
+
+	b->addr = tmp.addr + delta;
+	b->key  = tmp.key  + delta;
+}
+
+static inline void static_call_sort_entries(struct static_call_site *start,
+					    struct static_call_site *stop)
+{
+	sort(start, stop - start, sizeof(struct static_call_site),
+	     static_call_site_cmp, static_call_site_swap);
+}
+
+void __static_call_update(struct static_call_key *key, void *func)
+{
+	struct static_call_mod *site_mod;
+	struct static_call_site *site, *stop;
+
+	cpus_read_lock();
+	static_call_lock();
+
+	if (key->func == func)
+		goto done;
+
+	key->func = func;
+
+	/*
+	 * If called before init, leave the call sites unpatched for now.
+	 * In the meantime they'll continue to call the temporary trampoline.
+	 */
+	if (!static_call_initialized)
+		goto done;
+
+	list_for_each_entry(site_mod, &key->site_mods, list) {
+		if (!site_mod->sites) {
+			/*
+			 * This can happen if the static call key is defined in
+			 * a module which doesn't use it.
+			 */
+			continue;
+		}
+
+		stop = __stop_static_call_sites;
+
+#ifdef CONFIG_MODULES
+		if (site_mod->mod) {
+			stop = site_mod->mod->static_call_sites +
+			       site_mod->mod->num_static_call_sites;
+		}
+#endif
+
+		for (site = site_mod->sites;
+		     site < stop && static_call_key(site) == key; site++) {
+			void *site_addr = static_call_addr(site);
+			struct module *mod = site_mod->mod;
+
+			if (static_call_is_init(site)) {
+				/*
+				 * Don't write to call sites which were in
+				 * initmem and have since been freed.
+				 */
+				if (!mod && system_state >= SYSTEM_RUNNING)
+					continue;
+				if (mod && !within_module_init((unsigned long)site_addr, mod))
+					continue;
+			}
+
+			if (!kernel_text_address((unsigned long)site_addr)) {
+				WARN_ONCE(1, "can't patch static call site at %pS",
+					  site_addr);
+				continue;
+			}
+
+			arch_static_call_transform(site_addr, key->tramp, func);
+		}
+	}
+
+done:
+	static_call_unlock();
+	cpus_read_unlock();
+}
+EXPORT_SYMBOL_GPL(__static_call_update);
+
+#ifdef CONFIG_MODULES
+
+static int static_call_add_module(struct module *mod)
+{
+	struct static_call_site *start = mod->static_call_sites;
+	struct static_call_site *stop = mod->static_call_sites +
+					mod->num_static_call_sites;
+	struct static_call_site *site;
+	struct static_call_key *key, *prev_key = NULL;
+	struct static_call_mod *site_mod;
+
+	if (start == stop)
+		return 0;
+
+	static_call_sort_entries(start, stop);
+
+	for (site = start; site < stop; site++) {
+		void *site_addr = static_call_addr(site);
+
+		if (within_module_init((unsigned long)site_addr, mod))
+			static_call_set_init(site);
+
+		key = static_call_key(site);
+		if (key != prev_key) {
+			prev_key = key;
+
+			site_mod = kzalloc(sizeof(*site_mod), GFP_KERNEL);
+			if (!site_mod)
+				return -ENOMEM;
+
+			site_mod->mod = mod;
+			site_mod->sites = site;
+			list_add_tail(&site_mod->list, &key->site_mods);
+		}
+
+		arch_static_call_transform(site_addr, key->tramp, key->func);
+	}
+
+	return 0;
+}
+
+static void static_call_del_module(struct module *mod)
+{
+	struct static_call_site *start = mod->static_call_sites;
+	struct static_call_site *stop = mod->static_call_sites +
+					mod->num_static_call_sites;
+	struct static_call_site *site;
+	struct static_call_key *key, *prev_key = NULL;
+	struct static_call_mod *site_mod;
+
+	for (site = start; site < stop; site++) {
+		key = static_call_key(site);
+		if (key == prev_key)
+			continue;
+		prev_key = key;
+
+		list_for_each_entry(site_mod, &key->site_mods, list) {
+			if (site_mod->mod == mod) {
+				list_del(&site_mod->list);
+				kfree(site_mod);
+				break;
+			}
+		}
+	}
+}
+
+static int static_call_module_notify(struct notifier_block *nb,
+				     unsigned long val, void *data)
+{
+	struct module *mod = data;
+	int ret = 0;
+
+	cpus_read_lock();
+	static_call_lock();
+
+	switch (val) {
+	case MODULE_STATE_COMING:
+		module_disable_ro(mod);
+		ret = static_call_add_module(mod);
+		module_enable_ro(mod, false);
+		if (ret) {
+			WARN(1, "Failed to allocate memory for static calls");
+			static_call_del_module(mod);
+		}
+		break;
+	case MODULE_STATE_GOING:
+		static_call_del_module(mod);
+		break;
+	}
+
+	static_call_unlock();
+	cpus_read_unlock();
+
+	return notifier_from_errno(ret);
+}
+
+static struct notifier_block static_call_module_nb = {
+	.notifier_call = static_call_module_notify,
+};
+
+#endif /* CONFIG_MODULES */
+
+static void __init static_call_init(void)
+{
+	struct static_call_site *start = __start_static_call_sites;
+	struct static_call_site *stop  = __stop_static_call_sites;
+	struct static_call_site *site;
+
+	if (start == stop) {
+		pr_warn("WARNING: empty static call table\n");
+		return;
+	}
+
+	cpus_read_lock();
+	static_call_lock();
+
+	static_call_sort_entries(start, stop);
+
+	for (site = start; site < stop; site++) {
+		struct static_call_key *key = static_call_key(site);
+		void *site_addr = static_call_addr(site);
+
+		if (init_section_contains(site_addr, 1))
+			static_call_set_init(site);
+
+		if (list_empty(&key->site_mods)) {
+			struct static_call_mod *site_mod;
+
+			site_mod = kzalloc(sizeof(*site_mod), GFP_KERNEL);
+			if (!site_mod) {
+				WARN(1, "Failed to allocate memory for static calls");
+				goto done;
+			}
+
+			site_mod->sites = site;
+			list_add_tail(&site_mod->list, &key->site_mods);
+		}
+
+		arch_static_call_transform(site_addr, key->tramp, key->func);
+	}
+
+	static_call_initialized = true;
+
+done:
+	static_call_unlock();
+	cpus_read_unlock();
+
+#ifdef CONFIG_MODULES
+	if (static_call_initialized)
+		register_module_notifier(&static_call_module_nb);
+#endif
+}
+early_initcall(static_call_init);



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 12/15] x86/static_call: Add out-of-line static call implementation
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (10 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 11/15] static_call: Add inline " Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-07  6:13   ` Nadav Amit
  2019-06-05 13:08 ` [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64 Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira, Josh Poimboeuf

From: Josh Poimboeuf <jpoimboe@redhat.com>

Add the x86 out-of-line static call implementation.  For each key, a
permanent trampoline is created which is the destination for all static
calls for the given key.  The trampoline has a direct jump which gets
patched by static_call_update() when the destination function changes.

Cc: x86@kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Julia Cartwright <julia@ni.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/00b08f2194e80241decbf206624b6580b9b8855b.1543200841.git.jpoimboe@redhat.com
---
 arch/x86/Kconfig                   |    1 
 arch/x86/include/asm/static_call.h |   28 +++++++++++++++++++++++++++
 arch/x86/kernel/Makefile           |    1 
 arch/x86/kernel/static_call.c      |   38 +++++++++++++++++++++++++++++++++++++
 4 files changed, 68 insertions(+)
 create mode 100644 arch/x86/include/asm/static_call.h
 create mode 100644 arch/x86/kernel/static_call.c

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -198,6 +198,7 @@ config X86
 	select HAVE_FUNCTION_ARG_ACCESS_API
 	select HAVE_STACKPROTECTOR		if CC_HAS_SANE_STACKPROTECTOR
 	select HAVE_STACK_VALIDATION		if X86_64
+	select HAVE_STATIC_CALL
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
--- /dev/null
+++ b/arch/x86/include/asm/static_call.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+/*
+ * Manually construct a 5-byte direct JMP to prevent the assembler from
+ * optimizing it into a 2-byte JMP.
+ */
+#define __ARCH_STATIC_CALL_JMP_LABEL(key) ".L" __stringify(key ## _after_jmp)
+#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
+	".byte 0xe9						\n"	\
+	".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n"	\
+	__ARCH_STATIC_CALL_JMP_LABEL(key) ":"
+
+/*
+ * This is a permanent trampoline which does a direct jump to the function.
+ * The direct jump get patched by static_call_update().
+ */
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)			\
+	asm(".pushsection .text, \"ax\"				\n"	\
+	    ".align 4						\n"	\
+	    ".globl " STATIC_CALL_TRAMP_STR(key) "		\n"	\
+	    ".type " STATIC_CALL_TRAMP_STR(key) ", @function	\n"	\
+	    STATIC_CALL_TRAMP_STR(key) ":			\n"	\
+	    __ARCH_STATIC_CALL_TRAMP_JMP(key, func) "           \n"	\
+	    ".popsection					\n")
+
+#endif /* _ASM_STATIC_CALL_H */
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -63,6 +63,7 @@ obj-y			+= tsc.o tsc_msr.o io_delay.o rt
 obj-y			+= pci-iommu_table.o
 obj-y			+= resource.o
 obj-y			+= irqflags.o
+obj-y			+= static_call.o
 
 obj-y				+= process.o
 obj-y				+= fpu/
--- /dev/null
+++ b/arch/x86/kernel/static_call.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/static_call.h>
+#include <linux/memory.h>
+#include <linux/bug.h>
+#include <asm/text-patching.h>
+#include <asm/nospec-branch.h>
+
+#define CALL_INSN_SIZE 5
+
+void arch_static_call_transform(void *site, void *tramp, void *func)
+{
+	unsigned char opcodes[CALL_INSN_SIZE];
+	unsigned char insn_opcode;
+	unsigned long insn;
+	s32 dest_relative;
+
+	mutex_lock(&text_mutex);
+
+	insn = (unsigned long)tramp;
+
+	insn_opcode = *(unsigned char *)insn;
+	if (insn_opcode != 0xE9) {
+		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
+			  insn_opcode, (void *)insn);
+		goto unlock;
+	}
+
+	dest_relative = (long)(func) - (long)(insn + CALL_INSN_SIZE);
+
+	opcodes[0] = insn_opcode;
+	memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
+
+	text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE, NULL);
+
+unlock:
+	mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(arch_static_call_transform);



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (11 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 12/15] x86/static_call: Add out-of-line static call implementation Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-07  5:50   ` Nadav Amit
  2019-06-10 18:33   ` Josh Poimboeuf
  2019-06-05 13:08 ` [PATCH 14/15] static_call: Simple self-test module Peter Zijlstra
  2019-06-05 13:08 ` [PATCH 15/15] tracepoints: Use static_call Peter Zijlstra
  14 siblings, 2 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira, Josh Poimboeuf

From: Josh Poimboeuf <jpoimboe@redhat.com>

Add the inline static call implementation for x86-64.  For each key, a
temporary trampoline is created, named __static_call_tramp_<key>.  The
trampoline has an indirect jump to the destination function.

Objtool uses the trampoline naming convention to detect all the call
sites.  It then annotates those call sites in the .static_call_sites
section.

During boot (and module init), the call sites are patched to call
directly into the destination function.  The temporary trampoline is
then no longer used.

Cc: x86@kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Julia Cartwright <julia@ni.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/62188c62f6dda49ca2e20629ee8e5a62a6c0b500.1543200841.git.jpoimboe@redhat.com
---
 arch/x86/Kconfig                                |    3 
 arch/x86/include/asm/static_call.h              |   28 ++++-
 arch/x86/kernel/asm-offsets.c                   |    6 +
 arch/x86/kernel/static_call.c                   |   12 +-
 include/linux/static_call.h                     |    2 
 tools/objtool/Makefile                          |    3 
 tools/objtool/check.c                           |  125 +++++++++++++++++++++++-
 tools/objtool/check.h                           |    2 
 tools/objtool/elf.h                             |    1 
 tools/objtool/include/linux/static_call_types.h |   19 +++
 tools/objtool/sync-check.sh                     |    1 
 11 files changed, 193 insertions(+), 9 deletions(-)
 create mode 100644 tools/objtool/include/linux/static_call_types.h

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -199,6 +199,7 @@ config X86
 	select HAVE_STACKPROTECTOR		if CC_HAS_SANE_STACKPROTECTOR
 	select HAVE_STACK_VALIDATION		if X86_64
 	select HAVE_STATIC_CALL
+	select HAVE_STATIC_CALL_INLINE		if HAVE_STACK_VALIDATION
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
@@ -213,6 +214,7 @@ config X86
 	select RTC_MC146818_LIB
 	select SPARSE_IRQ
 	select SRCU
+	select STACK_VALIDATION			if HAVE_STACK_VALIDATION && (HAVE_STATIC_CALL_INLINE || RETPOLINE)
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
@@ -439,7 +441,6 @@ config GOLDFISH
 config RETPOLINE
 	bool "Avoid speculative indirect branches in kernel"
 	default y
-	select STACK_VALIDATION if HAVE_STACK_VALIDATION
 	help
 	  Compile kernel with the retpoline compiler options to guard against
 	  kernel-to-user data leaks by avoiding speculative indirect
--- a/arch/x86/include/asm/static_call.h
+++ b/arch/x86/include/asm/static_call.h
@@ -2,6 +2,20 @@
 #ifndef _ASM_STATIC_CALL_H
 #define _ASM_STATIC_CALL_H
 
+#include <asm/asm-offsets.h>
+
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+
+/*
+ * This trampoline is only used during boot / module init, so it's safe to use
+ * the indirect branch without a retpoline.
+ */
+#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
+	ANNOTATE_RETPOLINE_SAFE						\
+	"jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
+
+#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
+
 /*
  * Manually construct a 5-byte direct JMP to prevent the assembler from
  * optimizing it into a 2-byte JMP.
@@ -12,9 +26,19 @@
 	".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n"	\
 	__ARCH_STATIC_CALL_JMP_LABEL(key) ":"
 
+#endif /* !CONFIG_HAVE_STATIC_CALL_INLINE */
+
 /*
- * This is a permanent trampoline which does a direct jump to the function.
- * The direct jump get patched by static_call_update().
+ * For CONFIG_HAVE_STATIC_CALL_INLINE, this is a temporary trampoline which
+ * uses the current value of the key->func pointer to do an indirect jump to
+ * the function.  This trampoline is only used during boot, before the call
+ * sites get patched by static_call_update().  The name of this trampoline has
+ * a magical aspect: objtool uses it to find static call sites so it can create
+ * the .static_call_sites section.
+ *
+ * For CONFIG_HAVE_STATIC_CALL, this is a permanent trampoline which
+ * does a direct jump to the function.  The direct jump gets patched by
+ * static_call_update().
  */
 #define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)			\
 	asm(".pushsection .text, \"ax\"				\n"	\
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -12,6 +12,7 @@
 #include <linux/hardirq.h>
 #include <linux/suspend.h>
 #include <linux/kbuild.h>
+#include <linux/static_call.h>
 #include <asm/processor.h>
 #include <asm/thread_info.h>
 #include <asm/sigframe.h>
@@ -104,4 +105,9 @@ static void __used common(void)
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+	BLANK();
+	OFFSET(SC_KEY_func, static_call_key, func);
+#endif
 }
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -10,16 +10,22 @@
 void arch_static_call_transform(void *site, void *tramp, void *func)
 {
 	unsigned char opcodes[CALL_INSN_SIZE];
-	unsigned char insn_opcode;
+	unsigned char insn_opcode, expected;
 	unsigned long insn;
 	s32 dest_relative;
 
 	mutex_lock(&text_mutex);
 
-	insn = (unsigned long)tramp;
+	if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) {
+		insn = (unsigned long)site;
+		expected = 0xE8; /* CALL */
+	} else {
+		insn = (unsigned long)tramp;
+		expected = 0xE9; /* JMP */
+	}
 
 	insn_opcode = *(unsigned char *)insn;
-	if (insn_opcode != 0xE9) {
+	if (insn_opcode != expected) {
 		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
 			  insn_opcode, (void *)insn);
 		goto unlock;
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -59,7 +59,7 @@
 #include <linux/cpu.h>
 #include <linux/static_call_types.h>
 
-#ifdef CONFIG_HAVE_STATIC_CALL
+#if defined(CONFIG_HAVE_STATIC_CALL) && !defined(COMPILE_OFFSETS)
 #include <asm/static_call.h>
 extern void arch_static_call_transform(void *site, void *tramp, void *func);
 #endif
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -33,7 +33,8 @@ all: $(OBJTOOL)
 
 INCLUDES := -I$(srctree)/tools/include \
 	    -I$(srctree)/tools/arch/$(HOSTARCH)/include/uapi \
-	    -I$(srctree)/tools/objtool/arch/$(ARCH)/include
+	    -I$(srctree)/tools/objtool/arch/$(ARCH)/include \
+	    -I$(srctree)/tools/objtool/include
 WARNINGS := $(EXTRA_WARNINGS) -Wno-switch-default -Wno-switch-enum -Wno-packed
 CFLAGS   += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES) $(LIBELF_FLAGS)
 LDFLAGS  += $(LIBELF_LIBS) $(LIBSUBCMD) $(KBUILD_HOSTLDFLAGS)
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -15,6 +15,7 @@
 
 #include <linux/hashtable.h>
 #include <linux/kernel.h>
+#include <linux/static_call_types.h>
 
 #define FAKE_JUMP_OFFSET -1
 
@@ -584,6 +585,10 @@ static int add_jump_destinations(struct
 			/* sibling call */
 			insn->call_dest = rela->sym;
 			insn->jump_dest = NULL;
+			if (rela->sym->static_call_tramp) {
+				list_add_tail(&insn->static_call_node,
+					      &file->static_call_list);
+			}
 			continue;
 		}
 
@@ -1271,6 +1276,24 @@ static int read_retpoline_hints(struct o
 	return 0;
 }
 
+static int read_static_call_tramps(struct objtool_file *file)
+{
+	struct section *sec;
+	struct symbol *func;
+
+	for_each_sec(file, sec) {
+		list_for_each_entry(func, &sec->symbol_list, list) {
+			if (func->bind == STB_GLOBAL &&
+			    !strncmp(func->name, STATIC_CALL_TRAMP_PREFIX_STR,
+				     strlen(STATIC_CALL_TRAMP_PREFIX_STR)))
+				func->static_call_tramp = true;
+		}
+
+	}
+
+	return 0;
+}
+
 static void mark_rodata(struct objtool_file *file)
 {
 	struct section *sec;
@@ -1337,6 +1360,10 @@ static int decode_sections(struct objtoo
 	if (ret)
 		return ret;
 
+	ret = read_static_call_tramps(file);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -2071,6 +2098,11 @@ static int validate_branch(struct objtoo
 				if (is_fentry_call(insn))
 					break;
 
+				if (insn->call_dest->static_call_tramp) {
+					list_add_tail(&insn->static_call_node,
+						      &file->static_call_list);
+				}
+
 				ret = dead_end_function(file, insn->call_dest);
 				if (ret == 1)
 					return 0;
@@ -2382,6 +2414,89 @@ static int validate_reachable_instructio
 	return 0;
 }
 
+static int create_static_call_sections(struct objtool_file *file)
+{
+	struct section *sec, *rela_sec;
+	struct rela *rela;
+	struct static_call_site *site;
+	struct instruction *insn;
+	char *key_name;
+	struct symbol *key_sym;
+	int idx;
+
+	sec = find_section_by_name(file->elf, ".static_call_sites");
+	if (sec) {
+		WARN("file already has .static_call_sites section, skipping");
+		return 0;
+	}
+
+	if (list_empty(&file->static_call_list))
+		return 0;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->static_call_list, static_call_node)
+		idx++;
+
+	sec = elf_create_section(file->elf, ".static_call_sites",
+				 sizeof(struct static_call_site), idx);
+	if (!sec)
+		return -1;
+
+	rela_sec = elf_create_rela_section(file->elf, sec);
+	if (!rela_sec)
+		return -1;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->static_call_list, static_call_node) {
+
+		site = (struct static_call_site *)sec->data->d_buf + idx;
+		memset(site, 0, sizeof(struct static_call_site));
+
+		/* populate rela for 'addr' */
+		rela = malloc(sizeof(*rela));
+		if (!rela) {
+			perror("malloc");
+			return -1;
+		}
+		memset(rela, 0, sizeof(*rela));
+		rela->sym = insn->sec->sym;
+		rela->addend = insn->offset;
+		rela->type = R_X86_64_PC32;
+		rela->offset = idx * sizeof(struct static_call_site);
+		list_add_tail(&rela->list, &rela_sec->rela_list);
+		hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+		/* find key symbol */
+		key_name = insn->call_dest->name + strlen(STATIC_CALL_TRAMP_PREFIX_STR);
+		key_sym = find_symbol_by_name(file->elf, key_name);
+		if (!key_sym) {
+			WARN("can't find static call key symbol: %s", key_name);
+			return -1;
+		}
+
+		/* populate rela for 'key' */
+		rela = malloc(sizeof(*rela));
+		if (!rela) {
+			perror("malloc");
+			return -1;
+		}
+		memset(rela, 0, sizeof(*rela));
+		rela->sym = key_sym;
+		rela->addend = 0;
+		rela->type = R_X86_64_PC32;
+		rela->offset = idx * sizeof(struct static_call_site) + 4;
+		list_add_tail(&rela->list, &rela_sec->rela_list);
+		hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+		idx++;
+	}
+
+	if (elf_rebuild_rela_section(rela_sec))
+		return -1;
+
+	return 0;
+}
+
 static void cleanup(struct objtool_file *file)
 {
 	struct instruction *insn, *tmpinsn;
@@ -2407,12 +2522,13 @@ int check(const char *_objname, bool orc
 
 	objname = _objname;
 
-	file.elf = elf_open(objname, orc ? O_RDWR : O_RDONLY);
+	file.elf = elf_open(objname, O_RDWR);
 	if (!file.elf)
 		return 1;
 
 	INIT_LIST_HEAD(&file.insn_list);
 	hash_init(file.insn_hash);
+	INIT_LIST_HEAD(&file.static_call_list);
 	file.c_file = find_section_by_name(file.elf, ".comment");
 	file.ignore_unreachables = no_unreachable;
 	file.hints = false;
@@ -2451,6 +2567,11 @@ int check(const char *_objname, bool orc
 		warnings += ret;
 	}
 
+	ret = create_static_call_sections(&file);
+	if (ret < 0)
+		goto out;
+	warnings += ret;
+
 	if (orc) {
 		ret = create_orc(&file);
 		if (ret < 0)
@@ -2459,7 +2580,9 @@ int check(const char *_objname, bool orc
 		ret = create_orc_sections(&file);
 		if (ret < 0)
 			goto out;
+	}
 
+	if (orc || !list_empty(&file.static_call_list)) {
 		ret = elf_write(file.elf);
 		if (ret < 0)
 			goto out;
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -28,6 +28,7 @@ struct insn_state {
 struct instruction {
 	struct list_head list;
 	struct hlist_node hash;
+	struct list_head static_call_node;
 	struct section *sec;
 	unsigned long offset;
 	unsigned int len;
@@ -49,6 +50,7 @@ struct objtool_file {
 	struct elf *elf;
 	struct list_head insn_list;
 	DECLARE_HASHTABLE(insn_hash, 16);
+	struct list_head static_call_list;
 	bool ignore_unreachables, c_file, hints, rodata;
 };
 
--- a/tools/objtool/elf.h
+++ b/tools/objtool/elf.h
@@ -51,6 +51,7 @@ struct symbol {
 	unsigned int len;
 	struct symbol *pfunc, *cfunc, *alias;
 	bool uaccess_safe;
+	bool static_call_tramp;
 };
 
 struct rela {
--- /dev/null
+++ b/tools/objtool/include/linux/static_call_types.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+/* The static call site table is created by objtool. */
+struct static_call_site {
+	s32 addr;
+	s32 key;
+};
+
+#endif /* _STATIC_CALL_TYPES_H */
--- a/tools/objtool/sync-check.sh
+++ b/tools/objtool/sync-check.sh
@@ -10,6 +10,7 @@ arch/x86/include/asm/insn.h
 arch/x86/include/asm/inat.h
 arch/x86/include/asm/inat_types.h
 arch/x86/include/asm/orc_types.h
+include/linux/static_call_types.h
 '
 
 check()



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 14/15] static_call: Simple self-test module
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (12 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64 Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  2019-06-10 17:24   ` Josh Poimboeuf
  2019-06-05 13:08 ` [PATCH 15/15] tracepoints: Use static_call Peter Zijlstra
  14 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 lib/Kconfig.debug      |    8 ++++++++
 lib/Makefile           |    1 +
 lib/test_static_call.c |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+)

--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1955,6 +1955,14 @@ config TEST_STATIC_KEYS
 
 	  If unsure, say N.
 
+config TEST_STATIC_CALL
+	tristate "Test static call"
+	depends on m
+	help
+	  Test the static call interfaces.
+
+	  If unsure, say N.
+
 config TEST_KMOD
 	tristate "kmod stress tester"
 	depends on m
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_TEST_SORT) += test_sort.o
 obj-$(CONFIG_TEST_USER_COPY) += test_user_copy.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
+obj-$(CONFIG_TEST_STATIC_CALL) += test_static_call.o
 obj-$(CONFIG_TEST_PRINTF) += test_printf.o
 obj-$(CONFIG_TEST_BITMAP) += test_bitmap.o
 obj-$(CONFIG_TEST_STRSCPY) += test_strscpy.o
--- /dev/null
+++ b/lib/test_static_call.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/module.h>
+#include <linux/static_call.h>
+#include <asm/bug.h>
+
+static int foo_a(int x)
+{
+	return x+1;
+}
+
+static int foo_b(int x)
+{
+	return x*2;
+}
+
+DEFINE_STATIC_CALL(foo, foo_a);
+
+static int __init test_static_call_init(void)
+{
+	WARN_ON(static_call(foo, 2) != 3);
+
+	static_call_update(foo, foo_b);
+
+	WARN_ON(static_call(foo, 2) != 4);
+
+	static_call_update(foo, foo_a);
+
+	WARN_ON(static_call(foo, 2) != 3);
+
+	return 0;
+}
+module_init(test_static_call_init);
+
+static void __exit test_static_call_exit(void)
+{
+}
+module_exit(test_static_call_exit);
+
+MODULE_AUTHOR("Peter Zijlstra <peterz@infradead.org>");
+MODULE_LICENSE("GPL");



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 15/15] tracepoints: Use static_call
  2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
                   ` (13 preceding siblings ...)
  2019-06-05 13:08 ` [PATCH 14/15] static_call: Simple self-test module Peter Zijlstra
@ 2019-06-05 13:08 ` Peter Zijlstra
  14 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-05 13:08 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

From: Steven Rostedt (VMware) <rostedt@goodmis.org>

... Changelog goes here ...

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/tracepoint-defs.h |    4 ++
 include/linux/tracepoint.h      |   73 ++++++++++++++++++++++++++--------------
 include/trace/define_trace.h    |   14 +++----
 kernel/tracepoint.c             |   25 +++++++++++--
 4 files changed, 80 insertions(+), 36 deletions(-)

--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -11,6 +11,8 @@
 #include <linux/atomic.h>
 #include <linux/static_key.h>
 
+struct static_call_key;
+
 struct trace_print_flags {
 	unsigned long		mask;
 	const char		*name;
@@ -30,6 +32,8 @@ struct tracepoint_func {
 struct tracepoint {
 	const char *name;		/* Tracepoint name */
 	struct static_key key;
+	struct static_call_key *static_call_key;
+	void *iterator;
 	int (*regfunc)(void);
 	void (*unregfunc)(void);
 	struct tracepoint_func __rcu *funcs;
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -21,6 +21,7 @@
 #include <linux/cpumask.h>
 #include <linux/rcupdate.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/static_call.h>
 
 struct module;
 struct tracepoint;
@@ -94,7 +95,9 @@ extern int syscall_regfunc(void);
 extern void syscall_unregfunc(void);
 #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */
 
+#ifndef PARAMS
 #define PARAMS(args...) args
+#endif
 
 #define TRACE_DEFINE_ENUM(x)
 #define TRACE_DEFINE_SIZEOF(x)
@@ -161,12 +164,11 @@ static inline struct tracepoint *tracepo
  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
  */
-#define __DO_TRACE(tp, proto, args, cond, rcuidle)			\
+#define __DO_TRACE(name, proto, args, cond, rcuidle)			\
 	do {								\
 		struct tracepoint_func *it_func_ptr;			\
-		void *it_func;						\
-		void *__data;						\
 		int __maybe_unused __idx = 0;				\
+		void *__data;						\
 									\
 		if (!(cond))						\
 			return;						\
@@ -186,14 +188,11 @@ static inline struct tracepoint *tracepo
 			rcu_irq_enter_irqson();				\
 		}							\
 									\
-		it_func_ptr = rcu_dereference_raw((tp)->funcs);		\
-									\
+		it_func_ptr =						\
+			rcu_dereference_raw((&__tracepoint_##name)->funcs); \
 		if (it_func_ptr) {					\
-			do {						\
-				it_func = (it_func_ptr)->func;		\
-				__data = (it_func_ptr)->data;		\
-				((void(*)(proto))(it_func))(args);	\
-			} while ((++it_func_ptr)->func);		\
+			__data = (it_func_ptr)->data;			\
+			static_call(tp_func_##name, args);		\
 		}							\
 									\
 		if (rcuidle) {						\
@@ -209,7 +208,7 @@ static inline struct tracepoint *tracepo
 	static inline void trace_##name##_rcuidle(proto)		\
 	{								\
 		if (static_key_false(&__tracepoint_##name.key))		\
-			__DO_TRACE(&__tracepoint_##name,		\
+			__DO_TRACE(name,				\
 				TP_PROTO(data_proto),			\
 				TP_ARGS(data_args),			\
 				TP_CONDITION(cond), 1);			\
@@ -231,11 +230,13 @@ static inline struct tracepoint *tracepo
  * poking RCU a bit.
  */
 #define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+	extern int __tracepoint_iter_##name(data_proto);		\
+	DECLARE_STATIC_CALL(tp_func_##name, __tracepoint_iter_##name); \
 	extern struct tracepoint __tracepoint_##name;			\
 	static inline void trace_##name(proto)				\
 	{								\
 		if (static_key_false(&__tracepoint_##name.key))		\
-			__DO_TRACE(&__tracepoint_##name,		\
+			__DO_TRACE(name,				\
 				TP_PROTO(data_proto),			\
 				TP_ARGS(data_args),			\
 				TP_CONDITION(cond), 0);			\
@@ -281,21 +282,43 @@ static inline struct tracepoint *tracepo
  * structures, so we create an array of pointers that will be used for iteration
  * on the tracepoints.
  */
-#define DEFINE_TRACE_FN(name, reg, unreg)				 \
-	static const char __tpstrtab_##name[]				 \
-	__attribute__((section("__tracepoints_strings"))) = #name;	 \
-	struct tracepoint __tracepoint_##name				 \
-	__attribute__((section("__tracepoints"), used)) =		 \
-		{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
-	__TRACEPOINT_ENTRY(name);
+#define DEFINE_TRACE_FN(name, reg, unreg, proto, args)			\
+	static const char __tpstrtab_##name[]				\
+	__attribute__((section("__tracepoints_strings"))) = #name;	\
+	extern struct static_call_key tp_func_##name;			\
+	int __tracepoint_iter_##name(void *__data, proto);		\
+	struct tracepoint __tracepoint_##name				\
+	__attribute__((section("__tracepoints"), used)) =		\
+		{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE,		\
+		  &tp_func_##name, __tracepoint_iter_##name,		\
+		  reg, unreg, NULL };					\
+	__TRACEPOINT_ENTRY(name);					\
+	int __tracepoint_iter_##name(void *__data, proto)		\
+	{								\
+		struct tracepoint_func *it_func_ptr;			\
+		void *it_func;						\
+									\
+		it_func_ptr =						\
+			rcu_dereference_raw((&__tracepoint_##name)->funcs); \
+		do {							\
+			it_func = (it_func_ptr)->func;			\
+			__data = (it_func_ptr)->data;			\
+			((void(*)(void *, proto))(it_func))(__data, args); \
+		} while ((++it_func_ptr)->func);			\
+		return 0;						\
+	}								\
+	DEFINE_STATIC_CALL(tp_func_##name, __tracepoint_iter_##name);
 
-#define DEFINE_TRACE(name)						\
-	DEFINE_TRACE_FN(name, NULL, NULL);
+#define DEFINE_TRACE(name, proto, args)		\
+	DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));
 
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)				\
-	EXPORT_SYMBOL_GPL(__tracepoint_##name)
+	EXPORT_SYMBOL_GPL(__tracepoint_##name);				\
+	EXPORT_STATIC_CALL_GPL(tp_func_##name)
 #define EXPORT_TRACEPOINT_SYMBOL(name)					\
-	EXPORT_SYMBOL(__tracepoint_##name)
+	EXPORT_SYMBOL(__tracepoint_##name);				\
+	EXPORT_STATIC_CALL(tp_func_##name)
+
 
 #else /* !TRACEPOINTS_ENABLED */
 #define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
@@ -324,8 +347,8 @@ static inline struct tracepoint *tracepo
 		return false;						\
 	}
 
-#define DEFINE_TRACE_FN(name, reg, unreg)
-#define DEFINE_TRACE(name)
+#define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
+#define DEFINE_TRACE(name, proto, args)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
 #define EXPORT_TRACEPOINT_SYMBOL(name)
 
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -25,7 +25,7 @@
 
 #undef TRACE_EVENT
 #define TRACE_EVENT(name, proto, args, tstruct, assign, print)	\
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
 
 #undef TRACE_EVENT_CONDITION
 #define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) \
@@ -39,12 +39,12 @@
 #undef TRACE_EVENT_FN
 #define TRACE_EVENT_FN(name, proto, args, tstruct,		\
 		assign, print, reg, unreg)			\
-	DEFINE_TRACE_FN(name, reg, unreg)
+	DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
 
 #undef TRACE_EVENT_FN_COND
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,		\
 		assign, print, reg, unreg)			\
-	DEFINE_TRACE_FN(name, reg, unreg)
+	DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
 
 #undef TRACE_EVENT_NOP
 #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
@@ -54,15 +54,15 @@
 
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args) \
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
 
 #undef DEFINE_EVENT_FN
 #define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
-	DEFINE_TRACE_FN(name, reg, unreg)
+	DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
 
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
 
 #undef DEFINE_EVENT_CONDITION
 #define DEFINE_EVENT_CONDITION(template, name, proto, args, cond) \
@@ -70,7 +70,7 @@
 
 #undef DECLARE_TRACE
 #define DECLARE_TRACE(name, proto, args)	\
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
 
 #undef TRACE_INCLUDE
 #undef __TRACE_INCLUDE
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -127,7 +127,7 @@ static void debug_print_probes(struct tr
 
 static struct tracepoint_func *
 func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
-	 int prio)
+	 int prio, int *tot_probes)
 {
 	struct tracepoint_func *old, *new;
 	int nr_probes = 0;
@@ -170,11 +170,12 @@ func_add(struct tracepoint_func **funcs,
 	new[nr_probes + 1].func = NULL;
 	*funcs = new;
 	debug_print_probes(*funcs);
+	*tot_probes = nr_probes + 1;
 	return old;
 }
 
 static void *func_remove(struct tracepoint_func **funcs,
-		struct tracepoint_func *tp_func)
+		struct tracepoint_func *tp_func, int *left)
 {
 	int nr_probes = 0, nr_del = 0, i;
 	struct tracepoint_func *old, *new;
@@ -228,6 +229,7 @@ static int tracepoint_add_func(struct tr
 			       struct tracepoint_func *func, int prio)
 {
 	struct tracepoint_func *old, *tp_funcs;
+	int probes = 0;
 	int ret;
 
 	if (tp->regfunc && !static_key_enabled(&tp->key)) {
@@ -238,7 +240,7 @@ static int tracepoint_add_func(struct tr
 
 	tp_funcs = rcu_dereference_protected(tp->funcs,
 			lockdep_is_held(&tracepoints_mutex));
-	old = func_add(&tp_funcs, func, prio);
+	old = func_add(&tp_funcs, func, prio, &probes);
 	if (IS_ERR(old)) {
 		WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
 		return PTR_ERR(old);
@@ -253,6 +255,13 @@ static int tracepoint_add_func(struct tr
 	rcu_assign_pointer(tp->funcs, tp_funcs);
 	if (!static_key_enabled(&tp->key))
 		static_key_slow_inc(&tp->key);
+
+	if (probes == 1) {
+		__static_call_update(tp->static_call_key, tp_funcs->func);
+	} else {
+		__static_call_update(tp->static_call_key, tp->iterator);
+	}
+
 	release_probes(old);
 	return 0;
 }
@@ -267,10 +276,11 @@ static int tracepoint_remove_func(struct
 		struct tracepoint_func *func)
 {
 	struct tracepoint_func *old, *tp_funcs;
+	int probes_left = 0;
 
 	tp_funcs = rcu_dereference_protected(tp->funcs,
 			lockdep_is_held(&tracepoints_mutex));
-	old = func_remove(&tp_funcs, func);
+	old = func_remove(&tp_funcs, func, &probes_left);
 	if (IS_ERR(old)) {
 		WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
 		return PTR_ERR(old);
@@ -284,6 +294,13 @@ static int tracepoint_remove_func(struct
 		if (static_key_enabled(&tp->key))
 			static_key_slow_dec(&tp->key);
 	}
+
+	if (probes_left == 1) {
+		__static_call_update(tp->static_call_key, tp_funcs->func);
+	} else {
+		__static_call_update(tp->static_call_key, tp->iterator);
+	}
+
 	rcu_assign_pointer(tp->funcs, tp_funcs);
 	release_probes(old);
 	return 0;



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-05 13:08 ` [PATCH 11/15] static_call: Add inline " Peter Zijlstra
@ 2019-06-06 22:24   ` Nadav Amit
  2019-06-07  8:37     ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Nadav Amit @ 2019-06-06 22:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

> On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> 
> Add infrastructure for an arch-specific CONFIG_HAVE_STATIC_CALL_INLINE
> option, which is a faster version of CONFIG_HAVE_STATIC_CALL.  At
> runtime, the static call sites are patched directly, rather than using
> the out-of-line trampolines.
> 
> Compared to out-of-line static calls, the performance benefits are more
> modest, but still measurable.  Steven Rostedt did some tracepoint
> measurements:

[ snip ]

> +static void static_call_del_module(struct module *mod)
> +{
> +	struct static_call_site *start = mod->static_call_sites;
> +	struct static_call_site *stop = mod->static_call_sites +
> +					mod->num_static_call_sites;
> +	struct static_call_site *site;
> +	struct static_call_key *key, *prev_key = NULL;
> +	struct static_call_mod *site_mod;
> +
> +	for (site = start; site < stop; site++) {
> +		key = static_call_key(site);
> +		if (key == prev_key)
> +			continue;
> +		prev_key = key;
> +
> +		list_for_each_entry(site_mod, &key->site_mods, list) {
> +			if (site_mod->mod == mod) {
> +				list_del(&site_mod->list);
> +				kfree(site_mod);
> +				break;
> +			}
> +		}
> +	}

I think that for safety, when a module is removed, all the static-calls
should be traversed to check that none of them calls any function in the
removed module. If that happens, perhaps it should be poisoned.

> +}
> +
> +static int static_call_module_notify(struct notifier_block *nb,
> +				     unsigned long val, void *data)
> +{
> +	struct module *mod = data;
> +	int ret = 0;
> +
> +	cpus_read_lock();
> +	static_call_lock();
> +
> +	switch (val) {
> +	case MODULE_STATE_COMING:
> +		module_disable_ro(mod);
> +		ret = static_call_add_module(mod);
> +		module_enable_ro(mod, false);

Doesn’t it cause some pages to be W+X ? Can it be avoided?

> +		if (ret) {
> +			WARN(1, "Failed to allocate memory for static calls");
> +			static_call_del_module(mod);

If static_call_add_module() succeeded in changing some of the calls, but not
all, I don’t think that static_call_del_module() will correctly undo
static_call_add_module(). The code transformations, I think, will remain.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-06-05 13:08 ` [PATCH 10/15] static_call: Add basic static call infrastructure Peter Zijlstra
@ 2019-06-06 22:44   ` Nadav Amit
  2019-06-07  8:28     ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Nadav Amit @ 2019-06-06 22:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

> On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> 
> Static calls are a replacement for global function pointers.  They use
> code patching to allow direct calls to be used instead of indirect
> calls.  They give the flexibility of function pointers, but with
> improved performance.  This is especially important for cases where
> retpolines would otherwise be used, as retpolines can significantly
> impact performance.
> 
> The concept and code are an extension of previous work done by Ard
> Biesheuvel and Steven Rostedt:
> 
>  https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181005081333.15018-1-ard.biesheuvel%40linaro.org&amp;data=02%7C01%7Cnamit%40vmware.com%7C3f2ebbeff15e444d2fa008d6e9b9023f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636953378182765229&amp;sdata=WHceTWVt%2BNu1RFBv8jHp2Tw7VZuI5HxvHt%2FrWnjAmm4%3D&amp;reserved=0
>  https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181006015110.653946300%40goodmis.org&amp;data=02%7C01%7Cnamit%40vmware.com%7C3f2ebbeff15e444d2fa008d6e9b9023f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636953378182765229&amp;sdata=12JcrUsOh7%2FjRwEV9ANHw5SA2A6D4qNJ6z3h5aMHpnE%3D&amp;reserved=0
> 
> There are two implementations, depending on arch support:
> 
> 1) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL)
> 2) basic function pointers
> 
> For more details, see the comments in include/linux/static_call.h.
> 
> Cc: x86@kernel.org
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Julia Cartwright <julia@ni.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Jason Baron <jbaron@akamai.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Jiri Kosina <jkosina@suse.cz>
> Cc: Edward Cree <ecree@solarflare.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: David Laight <David.Laight@ACULAB.COM>
> Cc: Jessica Yu <jeyu@kernel.org>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2Fa01f733889ebf4bc447507ab8041a60378eaa89f.1547073843.git.jpoimboe%40redhat.com&amp;data=02%7C01%7Cnamit%40vmware.com%7C3f2ebbeff15e444d2fa008d6e9b9023f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636953378182765229&amp;sdata=n5wgu%2FxNZiG77ExBcoT2wo7ak9xqyfJH3H8SMyxZj38%3D&amp;reserved=0
> ---
> arch/Kconfig                      |    3 
> include/linux/static_call.h       |  135 ++++++++++++++++++++++++++++++++++++++
> include/linux/static_call_types.h |   13 +++
> 3 files changed, 151 insertions(+)
> create mode 100644 include/linux/static_call.h
> create mode 100644 include/linux/static_call_types.h
> 
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -927,6 +927,9 @@ config LOCK_EVENT_COUNTS
> 	  the chance of application behavior change because of timing
> 	  differences. The counts are reported via debugfs.
> 
> +config HAVE_STATIC_CALL
> +	bool
> +
> source "kernel/gcov/Kconfig"
> 
> source "scripts/gcc-plugins/Kconfig"
> --- /dev/null
> +++ b/include/linux/static_call.h
> @@ -0,0 +1,135 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_STATIC_CALL_H
> +#define _LINUX_STATIC_CALL_H
> +
> +/*
> + * Static call support
> + *
> + * Static calls use code patching to hard-code function pointers into direct
> + * branch instructions.  They give the flexibility of function pointers, but
> + * with improved performance.  This is especially important for cases where
> + * retpolines would otherwise be used, as retpolines can significantly impact
> + * performance.
> + *
> + *
> + * API overview:
> + *
> + *   DECLARE_STATIC_CALL(key, func);
> + *   DEFINE_STATIC_CALL(key, func);
> + *   static_call(key, args...);
> + *   static_call_update(key, func);
> + *
> + *
> + * Usage example:
> + *
> + *   # Start with the following functions (with identical prototypes):
> + *   int func_a(int arg1, int arg2);
> + *   int func_b(int arg1, int arg2);
> + *
> + *   # Define a 'my_key' reference, associated with func_a() by default
> + *   DEFINE_STATIC_CALL(my_key, func_a);
> + *
> + *   # Call func_a()
> + *   static_call(my_key, arg1, arg2);
> + *
> + *   # Update 'my_key' to point to func_b()
> + *   static_call_update(my_key, func_b);
> + *
> + *   # Call func_b()
> + *   static_call(my_key, arg1, arg2);

I think that this calling interface is not very intuitive. I understand that
the macros/objtool cannot allow the calling interface to be completely
transparent (as compiler plugin could). But, can the macros be used to
paste the key with the “static_call”? I think that having something like:

  static_call__func(arg1, arg2)

Is more readable than

  static_call(func, arg1, arg2)

> +}
> +
> +#define static_call_update(key, func)					\
> +({									\
> +	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
> +	__static_call_update(&key, func);				\
> +})

Is this safe against concurrent module removal?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-05 13:08 ` [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions Peter Zijlstra
@ 2019-06-07  5:41   ` Nadav Amit
  2019-06-07  8:20     ` Peter Zijlstra
  2019-06-07 15:47   ` Masami Hiramatsu
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 87+ messages in thread
From: Nadav Amit @ 2019-06-07  5:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

> On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> In preparation for static_call support, teach text_poke_bp() to
> emulate instructions, including CALL.
> 
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
> 
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
> 
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.
> 
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Nadav Amit <namit@vmware.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> arch/x86/include/asm/text-patching.h |    2 -
> arch/x86/kernel/alternative.c        |   47 ++++++++++++++++++++++++++---------
> arch/x86/kernel/jump_label.c         |    3 --
> arch/x86/kernel/kprobes/opt.c        |   11 +++++---
> 4 files changed, 46 insertions(+), 17 deletions(-)
> 
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> -extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
> extern int after_bootmem;
> extern __ro_after_init struct mm_struct *poking_mm;
> extern __ro_after_init unsigned long poking_addr;
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -921,19 +921,25 @@ static void do_sync_core(void *info)
> }
> 
> static bool bp_patching_in_progress;
> -static void *bp_int3_handler, *bp_int3_addr;
> +static const void *bp_int3_opcode, *bp_int3_addr;
> 
> int poke_int3_handler(struct pt_regs *regs)
> {
> +	long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> +	struct opcode {
> +		u8 insn;
> +		s32 rel;
> +	} __packed opcode;
> +
> 	/*
> 	 * Having observed our INT3 instruction, we now must observe
> 	 * bp_patching_in_progress.
> 	 *
> -	 * 	in_progress = TRUE		INT3
> -	 * 	WMB				RMB
> -	 * 	write INT3			if (in_progress)
> +	 *	in_progress = TRUE		INT3
> +	 *	WMB				RMB
> +	 *	write INT3			if (in_progress)

I don’t see what has changed in this chunk… Whitespaces?

> 	 *
> -	 * Idem for bp_int3_handler.
> +	 * Idem for bp_int3_opcode.
> 	 */
> 	smp_rmb();
> 
> @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> 	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> 		return 0;
> 
> -	/* set up the specified breakpoint handler */
> -	regs->ip = (unsigned long) bp_int3_handler;
> +	opcode = *(struct opcode *)bp_int3_opcode;
> +
> +	switch (opcode.insn) {
> +	case 0xE8: /* CALL */
> +		int3_emulate_call(regs, ip + opcode.rel);
> +		break;
> +
> +	case 0xE9: /* JMP */
> +		int3_emulate_jmp(regs, ip + opcode.rel);
> +		break;

Consider using RELATIVECALL_OPCODE and RELATIVEJUMP_OPCODE instead of the
constants (0xE8, 0xE9), just as you do later in the patch.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64
  2019-06-05 13:08 ` [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64 Peter Zijlstra
@ 2019-06-07  5:50   ` Nadav Amit
  2019-06-10 18:33   ` Josh Poimboeuf
  1 sibling, 0 replies; 87+ messages in thread
From: Nadav Amit @ 2019-06-07  5:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

> On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> 
> Add the inline static call implementation for x86-64.  For each key, a
> temporary trampoline is created, named __static_call_tramp_<key>.  The
> trampoline has an indirect jump to the destination function.
> 
> Objtool uses the trampoline naming convention to detect all the call
> sites.  It then annotates those call sites in the .static_call_sites
> section.
> 
> During boot (and module init), the call sites are patched to call
> directly into the destination function.  The temporary trampoline is
> then no longer used.
> 
> Cc: x86@kernel.org
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Julia Cartwright <julia@ni.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Jason Baron <jbaron@akamai.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Jiri Kosina <jkosina@suse.cz>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: David Laight <David.Laight@ACULAB.COM>
> Cc: Jessica Yu <jeyu@kernel.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F62188c62f6dda49ca2e20629ee8e5a62a6c0b500.1543200841.git.jpoimboe%40redhat.com&amp;data=02%7C01%7Cnamit%40vmware.com%7C3a349bb2a7e042ef9d9d08d6e9b8fc2d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636953378066316039&amp;sdata=J%2BsCYwRi8GpP5GrJaLo8nM5jN2KNZlfwq7RDuKok%2FmE%3D&amp;reserved=0
> ---
> arch/x86/Kconfig                                |    3 
> arch/x86/include/asm/static_call.h              |   28 ++++-
> arch/x86/kernel/asm-offsets.c                   |    6 +
> arch/x86/kernel/static_call.c                   |   12 +-
> include/linux/static_call.h                     |    2 
> tools/objtool/Makefile                          |    3 
> tools/objtool/check.c                           |  125 +++++++++++++++++++++++-
> tools/objtool/check.h                           |    2 
> tools/objtool/elf.h                             |    1 
> tools/objtool/include/linux/static_call_types.h |   19 +++
> tools/objtool/sync-check.sh                     |    1 
> 11 files changed, 193 insertions(+), 9 deletions(-)
> create mode 100644 tools/objtool/include/linux/static_call_types.h
> 
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -199,6 +199,7 @@ config X86
> 	select HAVE_STACKPROTECTOR		if CC_HAS_SANE_STACKPROTECTOR
> 	select HAVE_STACK_VALIDATION		if X86_64
> 	select HAVE_STATIC_CALL
> +	select HAVE_STATIC_CALL_INLINE		if HAVE_STACK_VALIDATION
> 	select HAVE_RSEQ
> 	select HAVE_SYSCALL_TRACEPOINTS
> 	select HAVE_UNSTABLE_SCHED_CLOCK
> @@ -213,6 +214,7 @@ config X86
> 	select RTC_MC146818_LIB
> 	select SPARSE_IRQ
> 	select SRCU
> +	select STACK_VALIDATION			if HAVE_STACK_VALIDATION && (HAVE_STATIC_CALL_INLINE || RETPOLINE)
> 	select SYSCTL_EXCEPTION_TRACE
> 	select THREAD_INFO_IN_TASK
> 	select USER_STACKTRACE_SUPPORT
> @@ -439,7 +441,6 @@ config GOLDFISH
> config RETPOLINE
> 	bool "Avoid speculative indirect branches in kernel"
> 	default y
> -	select STACK_VALIDATION if HAVE_STACK_VALIDATION
> 	help
> 	  Compile kernel with the retpoline compiler options to guard against
> 	  kernel-to-user data leaks by avoiding speculative indirect
> --- a/arch/x86/include/asm/static_call.h
> +++ b/arch/x86/include/asm/static_call.h
> @@ -2,6 +2,20 @@
> #ifndef _ASM_STATIC_CALL_H
> #define _ASM_STATIC_CALL_H
> 
> +#include <asm/asm-offsets.h>
> +
> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> +
> +/*
> + * This trampoline is only used during boot / module init, so it's safe to use
> + * the indirect branch without a retpoline.
> + */
> +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
> +	ANNOTATE_RETPOLINE_SAFE						\
> +	"jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
> +
> +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> +
> /*
>  * Manually construct a 5-byte direct JMP to prevent the assembler from
>  * optimizing it into a 2-byte JMP.
> @@ -12,9 +26,19 @@
> 	".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n"	\
> 	__ARCH_STATIC_CALL_JMP_LABEL(key) ":"
> 
> +#endif /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> +
> /*
> - * This is a permanent trampoline which does a direct jump to the function.
> - * The direct jump get patched by static_call_update().
> + * For CONFIG_HAVE_STATIC_CALL_INLINE, this is a temporary trampoline which
> + * uses the current value of the key->func pointer to do an indirect jump to
> + * the function.  This trampoline is only used during boot, before the call
> + * sites get patched by static_call_update().  The name of this trampoline has
> + * a magical aspect: objtool uses it to find static call sites so it can create
> + * the .static_call_sites section.
> + *
> + * For CONFIG_HAVE_STATIC_CALL, this is a permanent trampoline which
> + * does a direct jump to the function.  The direct jump gets patched by
> + * static_call_update().
>  */
> #define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)			\
> 	asm(".pushsection .text, \"ax\"				\n"	\
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -12,6 +12,7 @@
> #include <linux/hardirq.h>
> #include <linux/suspend.h>
> #include <linux/kbuild.h>
> +#include <linux/static_call.h>
> #include <asm/processor.h>
> #include <asm/thread_info.h>
> #include <asm/sigframe.h>
> @@ -104,4 +105,9 @@ static void __used common(void)
> 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
> 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
> 	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
> +
> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> +	BLANK();
> +	OFFSET(SC_KEY_func, static_call_key, func);
> +#endif
> }
> --- a/arch/x86/kernel/static_call.c
> +++ b/arch/x86/kernel/static_call.c
> @@ -10,16 +10,22 @@
> void arch_static_call_transform(void *site, void *tramp, void *func)
> {
> 	unsigned char opcodes[CALL_INSN_SIZE];
> -	unsigned char insn_opcode;
> +	unsigned char insn_opcode, expected;
> 	unsigned long insn;
> 	s32 dest_relative;
> 
> 	mutex_lock(&text_mutex);
> 
> -	insn = (unsigned long)tramp;
> +	if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) {
> +		insn = (unsigned long)site;
> +		expected = 0xE8; /* CALL */

RELATIVECALL_OPCODE ?

> +	} else {
> +		insn = (unsigned long)tramp;
> +		expected = 0xE9; /* JMP */

RELATIVEJUMP_OPCODE ?

( I did not review the objtool parts )

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 12/15] x86/static_call: Add out-of-line static call implementation
  2019-06-05 13:08 ` [PATCH 12/15] x86/static_call: Add out-of-line static call implementation Peter Zijlstra
@ 2019-06-07  6:13   ` Nadav Amit
  2019-06-07  7:51     ` Steven Rostedt
  2019-06-07  8:38     ` Peter Zijlstra
  0 siblings, 2 replies; 87+ messages in thread
From: Nadav Amit @ 2019-06-07  6:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

> On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> 
> Add the x86 out-of-line static call implementation.  For each key, a
> permanent trampoline is created which is the destination for all static
> calls for the given key.  The trampoline has a direct jump which gets
> patched by static_call_update() when the destination function changes.
> 
> Cc: x86@kernel.org
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Julia Cartwright <julia@ni.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Jason Baron <jbaron@akamai.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Jiri Kosina <jkosina@suse.cz>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: David Laight <David.Laight@ACULAB.COM>
> Cc: Jessica Yu <jeyu@kernel.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F00b08f2194e80241decbf206624b6580b9b8855b.1543200841.git.jpoimboe%40redhat.com&amp;data=02%7C01%7Cnamit%40vmware.com%7C13bc03381930464a018e08d6e9b8f90e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636953378007810030&amp;sdata=UnHEUYEYV3FBSZj667lZYzGKRov%2B1PdAjAnM%2BqOz3Ns%3D&amp;reserved=0
> ---
> arch/x86/Kconfig                   |    1 
> arch/x86/include/asm/static_call.h |   28 +++++++++++++++++++++++++++
> arch/x86/kernel/Makefile           |    1 
> arch/x86/kernel/static_call.c      |   38 +++++++++++++++++++++++++++++++++++++
> 4 files changed, 68 insertions(+)
> create mode 100644 arch/x86/include/asm/static_call.h
> create mode 100644 arch/x86/kernel/static_call.c
> 
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -198,6 +198,7 @@ config X86
> 	select HAVE_FUNCTION_ARG_ACCESS_API
> 	select HAVE_STACKPROTECTOR		if CC_HAS_SANE_STACKPROTECTOR
> 	select HAVE_STACK_VALIDATION		if X86_64
> +	select HAVE_STATIC_CALL
> 	select HAVE_RSEQ
> 	select HAVE_SYSCALL_TRACEPOINTS
> 	select HAVE_UNSTABLE_SCHED_CLOCK
> --- /dev/null
> +++ b/arch/x86/include/asm/static_call.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_STATIC_CALL_H
> +#define _ASM_STATIC_CALL_H
> +
> +/*
> + * Manually construct a 5-byte direct JMP to prevent the assembler from
> + * optimizing it into a 2-byte JMP.
> + */
> +#define __ARCH_STATIC_CALL_JMP_LABEL(key) ".L" __stringify(key ## _after_jmp)
> +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
> +	".byte 0xe9						\n"	\
> +	".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n"	\
> +	__ARCH_STATIC_CALL_JMP_LABEL(key) ":"
> +
> +/*
> + * This is a permanent trampoline which does a direct jump to the function.
> + * The direct jump get patched by static_call_update().
> + */
> +#define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)			\
> +	asm(".pushsection .text, \"ax\"				\n"	\
> +	    ".align 4						\n"	\
> +	    ".globl " STATIC_CALL_TRAMP_STR(key) "		\n"	\
> +	    ".type " STATIC_CALL_TRAMP_STR(key) ", @function	\n"	\
> +	    STATIC_CALL_TRAMP_STR(key) ":			\n"	\
> +	    __ARCH_STATIC_CALL_TRAMP_JMP(key, func) "           \n"	\
> +	    ".popsection					\n")
> +
> +#endif /* _ASM_STATIC_CALL_H */
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -63,6 +63,7 @@ obj-y			+= tsc.o tsc_msr.o io_delay.o rt
> obj-y			+= pci-iommu_table.o
> obj-y			+= resource.o
> obj-y			+= irqflags.o
> +obj-y			+= static_call.o
> 
> obj-y				+= process.o
> obj-y				+= fpu/
> --- /dev/null
> +++ b/arch/x86/kernel/static_call.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/static_call.h>
> +#include <linux/memory.h>
> +#include <linux/bug.h>
> +#include <asm/text-patching.h>
> +#include <asm/nospec-branch.h>
> +
> +#define CALL_INSN_SIZE 5
> +
> +void arch_static_call_transform(void *site, void *tramp, void *func)
> +{
> +	unsigned char opcodes[CALL_INSN_SIZE];
> +	unsigned char insn_opcode;
> +	unsigned long insn;
> +	s32 dest_relative;
> +
> +	mutex_lock(&text_mutex);
> +
> +	insn = (unsigned long)tramp;
> +
> +	insn_opcode = *(unsigned char *)insn;
> +	if (insn_opcode != 0xE9) {
> +		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
> +			  insn_opcode, (void *)insn);
> +		goto unlock;

This might happen if a kprobe is installed on the call, no?

I don’t know if you want to be more gentle handling of this case (or perhaps
modify can_probe() to prevent such a case).


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 12/15] x86/static_call: Add out-of-line static call implementation
  2019-06-07  6:13   ` Nadav Amit
@ 2019-06-07  7:51     ` Steven Rostedt
  2019-06-07  8:38     ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Steven Rostedt @ 2019-06-07  7:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Fri, 7 Jun 2019 06:13:58 +0000
Nadav Amit <namit@vmware.com> wrote:

> > On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > From: Josh Poimboeuf <jpoimboe@redhat.com>
> > 
> > Add the x86 out-of-line static call implementation.  For each key, a
> > permanent trampoline is created which is the destination for all static
> > calls for the given key.  The trampoline has a direct jump which gets
> > patched by static_call_update() when the destination function changes.
> > 
> > Cc: x86@kernel.org
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Julia Cartwright <julia@ni.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > Cc: Jason Baron <jbaron@akamai.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Jiri Kosina <jkosina@suse.cz>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: David Laight <David.Laight@ACULAB.COM>
> > Cc: Jessica Yu <jeyu@kernel.org>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F00b08f2194e80241decbf206624b6580b9b8855b.1543200841.git.jpoimboe%40redhat.com&amp;data=02%7C01%7Cnamit%40vmware.com%7C13bc03381930464a018e08d6e9b8f90e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636953378007810030&amp;sdata=UnHEUYEYV3FBSZj667lZYzGKRov%2B1PdAjAnM%2BqOz3Ns%3D&amp;reserved=0
> > ---
> > arch/x86/Kconfig                   |    1 
> > arch/x86/include/asm/static_call.h |   28 +++++++++++++++++++++++++++
> > arch/x86/kernel/Makefile           |    1 
> > arch/x86/kernel/static_call.c      |   38 +++++++++++++++++++++++++++++++++++++
> > 4 files changed, 68 insertions(+)
> > create mode 100644 arch/x86/include/asm/static_call.h
> > create mode 100644 arch/x86/kernel/static_call.c
> > 
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -198,6 +198,7 @@ config X86
> > 	select HAVE_FUNCTION_ARG_ACCESS_API
> > 	select HAVE_STACKPROTECTOR		if CC_HAS_SANE_STACKPROTECTOR
> > 	select HAVE_STACK_VALIDATION		if X86_64
> > +	select HAVE_STATIC_CALL
> > 	select HAVE_RSEQ
> > 	select HAVE_SYSCALL_TRACEPOINTS
> > 	select HAVE_UNSTABLE_SCHED_CLOCK
> > --- /dev/null
> > +++ b/arch/x86/include/asm/static_call.h
> > @@ -0,0 +1,28 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_STATIC_CALL_H
> > +#define _ASM_STATIC_CALL_H
> > +
> > +/*
> > + * Manually construct a 5-byte direct JMP to prevent the assembler from
> > + * optimizing it into a 2-byte JMP.
> > + */
> > +#define __ARCH_STATIC_CALL_JMP_LABEL(key) ".L" __stringify(key ## _after_jmp)
> > +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
> > +	".byte 0xe9						\n"	\
> > +	".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n"	\
> > +	__ARCH_STATIC_CALL_JMP_LABEL(key) ":"
> > +
> > +/*
> > + * This is a permanent trampoline which does a direct jump to the function.
> > + * The direct jump get patched by static_call_update().
> > + */
> > +#define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)			\
> > +	asm(".pushsection .text, \"ax\"				\n"	\
> > +	    ".align 4						\n"	\
> > +	    ".globl " STATIC_CALL_TRAMP_STR(key) "		\n"	\
> > +	    ".type " STATIC_CALL_TRAMP_STR(key) ", @function	\n"	\
> > +	    STATIC_CALL_TRAMP_STR(key) ":			\n"	\
> > +	    __ARCH_STATIC_CALL_TRAMP_JMP(key, func) "           \n"	\
> > +	    ".popsection					\n")
> > +
> > +#endif /* _ASM_STATIC_CALL_H */
> > --- a/arch/x86/kernel/Makefile
> > +++ b/arch/x86/kernel/Makefile
> > @@ -63,6 +63,7 @@ obj-y			+= tsc.o tsc_msr.o io_delay.o rt
> > obj-y			+= pci-iommu_table.o
> > obj-y			+= resource.o
> > obj-y			+= irqflags.o
> > +obj-y			+= static_call.o
> > 
> > obj-y				+= process.o
> > obj-y				+= fpu/
> > --- /dev/null
> > +++ b/arch/x86/kernel/static_call.c
> > @@ -0,0 +1,38 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/static_call.h>
> > +#include <linux/memory.h>
> > +#include <linux/bug.h>
> > +#include <asm/text-patching.h>
> > +#include <asm/nospec-branch.h>
> > +
> > +#define CALL_INSN_SIZE 5
> > +
> > +void arch_static_call_transform(void *site, void *tramp, void *func)
> > +{
> > +	unsigned char opcodes[CALL_INSN_SIZE];
> > +	unsigned char insn_opcode;
> > +	unsigned long insn;
> > +	s32 dest_relative;
> > +
> > +	mutex_lock(&text_mutex);
> > +
> > +	insn = (unsigned long)tramp;
> > +
> > +	insn_opcode = *(unsigned char *)insn;
> > +	if (insn_opcode != 0xE9) {
> > +		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
> > +			  insn_opcode, (void *)insn);
> > +		goto unlock;  
> 
> This might happen if a kprobe is installed on the call, no?
> 
> I don’t know if you want to be more gentle handling of this case (or perhaps
> modify can_probe() to prevent such a case).
> 

Perhaps it is better to block kprobes from attaching to a static call.
Or have it use the static call directly as it does with ftrace. But
that would probably be much more work.

-- Steve

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07  5:41   ` Nadav Amit
@ 2019-06-07  8:20     ` Peter Zijlstra
  2019-06-07 14:27       ` Masami Hiramatsu
  0 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-07  8:20 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 05:41:42AM +0000, Nadav Amit wrote:

> > int poke_int3_handler(struct pt_regs *regs)
> > {
> > +	long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> > +	struct opcode {
> > +		u8 insn;
> > +		s32 rel;
> > +	} __packed opcode;
> > +
> > 	/*
> > 	 * Having observed our INT3 instruction, we now must observe
> > 	 * bp_patching_in_progress.
> > 	 *
> > -	 * 	in_progress = TRUE		INT3
> > -	 * 	WMB				RMB
> > -	 * 	write INT3			if (in_progress)
> > +	 *	in_progress = TRUE		INT3
> > +	 *	WMB				RMB
> > +	 *	write INT3			if (in_progress)
> 
> I don’t see what has changed in this chunk… Whitespaces?

Yep, my editor kept marking that stuff red (space before tab), which
annoyed me enough to fix it.

> > 	 *
> > -	 * Idem for bp_int3_handler.
> > +	 * Idem for bp_int3_opcode.
> > 	 */
> > 	smp_rmb();
> > 
> > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> > 	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> > 		return 0;
> > 
> > -	/* set up the specified breakpoint handler */
> > -	regs->ip = (unsigned long) bp_int3_handler;
> > +	opcode = *(struct opcode *)bp_int3_opcode;
> > +
> > +	switch (opcode.insn) {
> > +	case 0xE8: /* CALL */
> > +		int3_emulate_call(regs, ip + opcode.rel);
> > +		break;
> > +
> > +	case 0xE9: /* JMP */
> > +		int3_emulate_jmp(regs, ip + opcode.rel);
> > +		break;
> 
> Consider using RELATIVECALL_OPCODE and RELATIVEJUMP_OPCODE instead of the
> constants (0xE8, 0xE9), just as you do later in the patch.

Those are private to kprobes..

but I can do something like so:

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -48,8 +48,14 @@ static inline void int3_emulate_jmp(stru
 	regs->ip = ip;
 }
 
-#define INT3_INSN_SIZE 1
-#define CALL_INSN_SIZE 5
+#define INT3_INSN_SIZE		1
+#define INT3_INSN_OPCODE	0xCC
+
+#define CALL_INSN_SIZE		5
+#define CALL_INSN_OPCODE	0xE8
+
+#define JMP_INSN_SIZE		5
+#define JMP_INSN_OPCODE		0xE9
 
 static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
 {
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -952,11 +952,11 @@ int poke_int3_handler(struct pt_regs *re
 	opcode = *(struct opcode *)bp_int3_opcode;
 
 	switch (opcode.insn) {
-	case 0xE8: /* CALL */
+	case CALL_INSN_OPCODE:
 		int3_emulate_call(regs, ip + opcode.rel);
 		break;
 
-	case 0xE9: /* JMP */
+	case JMP_INSN_OPCODE:
 		int3_emulate_jmp(regs, ip + opcode.rel);
 		break;
 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-06-06 22:44   ` Nadav Amit
@ 2019-06-07  8:28     ` Peter Zijlstra
  2019-06-07  8:49       ` Ard Biesheuvel
  2019-10-02 13:54       ` Peter Zijlstra
  0 siblings, 2 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-07  8:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Thu, Jun 06, 2019 at 10:44:23PM +0000, Nadav Amit wrote:
> > + * Usage example:
> > + *
> > + *   # Start with the following functions (with identical prototypes):
> > + *   int func_a(int arg1, int arg2);
> > + *   int func_b(int arg1, int arg2);
> > + *
> > + *   # Define a 'my_key' reference, associated with func_a() by default
> > + *   DEFINE_STATIC_CALL(my_key, func_a);
> > + *
> > + *   # Call func_a()
> > + *   static_call(my_key, arg1, arg2);
> > + *
> > + *   # Update 'my_key' to point to func_b()
> > + *   static_call_update(my_key, func_b);
> > + *
> > + *   # Call func_b()
> > + *   static_call(my_key, arg1, arg2);
> 
> I think that this calling interface is not very intuitive.

Yeah, it is somewhat unfortunate..

> I understand that
> the macros/objtool cannot allow the calling interface to be completely
> transparent (as compiler plugin could). But, can the macros be used to
> paste the key with the “static_call”? I think that having something like:
> 
>   static_call__func(arg1, arg2)
> 
> Is more readable than
> 
>   static_call(func, arg1, arg2)

Doesn't really make it much better for me; I think I'd prefer to switch
to the GCC plugin scheme over this.  ISTR there being some propotypes
there, but I couldn't quickly locate them.

> > +}
> > +
> > +#define static_call_update(key, func)					\
> > +({									\
> > +	BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));	\
> > +	__static_call_update(&key, func);				\
> > +})
> 
> Is this safe against concurrent module removal?

It is for CONFIG_MODULE=n :-)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-06 22:24   ` Nadav Amit
@ 2019-06-07  8:37     ` Peter Zijlstra
  2019-06-07 16:35       ` Nadav Amit
  2019-06-10 17:19       ` Josh Poimboeuf
  0 siblings, 2 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-07  8:37 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Thu, Jun 06, 2019 at 10:24:17PM +0000, Nadav Amit wrote:

> > +static void static_call_del_module(struct module *mod)
> > +{
> > +	struct static_call_site *start = mod->static_call_sites;
> > +	struct static_call_site *stop = mod->static_call_sites +
> > +					mod->num_static_call_sites;
> > +	struct static_call_site *site;
> > +	struct static_call_key *key, *prev_key = NULL;
> > +	struct static_call_mod *site_mod;
> > +
> > +	for (site = start; site < stop; site++) {
> > +		key = static_call_key(site);
> > +		if (key == prev_key)
> > +			continue;
> > +		prev_key = key;
> > +
> > +		list_for_each_entry(site_mod, &key->site_mods, list) {
> > +			if (site_mod->mod == mod) {
> > +				list_del(&site_mod->list);
> > +				kfree(site_mod);
> > +				break;
> > +			}
> > +		}
> > +	}
> 
> I think that for safety, when a module is removed, all the static-calls
> should be traversed to check that none of them calls any function in the
> removed module. If that happens, perhaps it should be poisoned.

We don't do that for normal indirect calls either.. I suppose we could
here, but meh.

> > +}
> > +
> > +static int static_call_module_notify(struct notifier_block *nb,
> > +				     unsigned long val, void *data)
> > +{
> > +	struct module *mod = data;
> > +	int ret = 0;
> > +
> > +	cpus_read_lock();
> > +	static_call_lock();
> > +
> > +	switch (val) {
> > +	case MODULE_STATE_COMING:
> > +		module_disable_ro(mod);
> > +		ret = static_call_add_module(mod);
> > +		module_enable_ro(mod, false);
> 
> Doesn’t it cause some pages to be W+X ? Can it be avoided?

I don't know why it does this, jump_labels doesn't seem to need this,
and I'm not seeing what static_call needs differently.

> > +		if (ret) {
> > +			WARN(1, "Failed to allocate memory for static calls");
> > +			static_call_del_module(mod);
> 
> If static_call_add_module() succeeded in changing some of the calls, but not
> all, I don’t think that static_call_del_module() will correctly undo
> static_call_add_module(). The code transformations, I think, will remain.

Hurm, jump_labels has the same problem.

I wonder why kernel/module.c:prepare_coming_module() doesn't propagate
the error from the notifier call. If it were to do that, I think we'll
abort the module load and any modifications get lost anyway.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 12/15] x86/static_call: Add out-of-line static call implementation
  2019-06-07  6:13   ` Nadav Amit
  2019-06-07  7:51     ` Steven Rostedt
@ 2019-06-07  8:38     ` Peter Zijlstra
  2019-06-07  8:52       ` Peter Zijlstra
  1 sibling, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-07  8:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Fri, Jun 07, 2019 at 06:13:58AM +0000, Nadav Amit wrote:
> > On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > +void arch_static_call_transform(void *site, void *tramp, void *func)
> > +{
> > +	unsigned char opcodes[CALL_INSN_SIZE];
> > +	unsigned char insn_opcode;
> > +	unsigned long insn;
> > +	s32 dest_relative;
> > +
> > +	mutex_lock(&text_mutex);
> > +
> > +	insn = (unsigned long)tramp;
> > +
> > +	insn_opcode = *(unsigned char *)insn;
> > +	if (insn_opcode != 0xE9) {
> > +		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
> > +			  insn_opcode, (void *)insn);
> > +		goto unlock;
> 
> This might happen if a kprobe is installed on the call, no?
> 
> I don’t know if you want to be more gentle handling of this case (or perhaps
> modify can_probe() to prevent such a case).
> 

yuck.. yes, that's something that needs consideration.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-06-07  8:28     ` Peter Zijlstra
@ 2019-06-07  8:49       ` Ard Biesheuvel
  2019-06-07 16:33         ` Andy Lutomirski
  2019-06-07 16:58         ` Nadav Amit
  2019-10-02 13:54       ` Peter Zijlstra
  1 sibling, 2 replies; 87+ messages in thread
From: Ard Biesheuvel @ 2019-06-07  8:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, the arch/x86 maintainers, LKML, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Fri, 7 Jun 2019 at 10:29, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jun 06, 2019 at 10:44:23PM +0000, Nadav Amit wrote:
> > > + * Usage example:
> > > + *
> > > + *   # Start with the following functions (with identical prototypes):
> > > + *   int func_a(int arg1, int arg2);
> > > + *   int func_b(int arg1, int arg2);
> > > + *
> > > + *   # Define a 'my_key' reference, associated with func_a() by default
> > > + *   DEFINE_STATIC_CALL(my_key, func_a);
> > > + *
> > > + *   # Call func_a()
> > > + *   static_call(my_key, arg1, arg2);
> > > + *
> > > + *   # Update 'my_key' to point to func_b()
> > > + *   static_call_update(my_key, func_b);
> > > + *
> > > + *   # Call func_b()
> > > + *   static_call(my_key, arg1, arg2);
> >
> > I think that this calling interface is not very intuitive.
>
> Yeah, it is somewhat unfortunate..
>

Another thing I brought up at the time is that it would be useful to
have the ability to 'reset' a static call to its default target. E.g.,
for crypto modules that implement an accelerated version of a library
interface, removing the module should revert those call sites back to
the original target, without putting a disproportionate burden on the
module itself to implement the logic to support this.


> > I understand that
> > the macros/objtool cannot allow the calling interface to be completely
> > transparent (as compiler plugin could). But, can the macros be used to
> > paste the key with the “static_call”? I think that having something like:
> >
> >   static_call__func(arg1, arg2)
> >
> > Is more readable than
> >
> >   static_call(func, arg1, arg2)
>
> Doesn't really make it much better for me; I think I'd prefer to switch
> to the GCC plugin scheme over this.  ISTR there being some propotypes
> there, but I couldn't quickly locate them.
>

I implemented the GCC plugin here

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls

but IIRC, all it does is annotate call sites exactly how objtool does it.

> > > +}
> > > +
> > > +#define static_call_update(key, func)                                      \
> > > +({                                                                 \
> > > +   BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));       \
> > > +   __static_call_update(&key, func);                               \
> > > +})
> >
> > Is this safe against concurrent module removal?
>
> It is for CONFIG_MODULE=n :-)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 12/15] x86/static_call: Add out-of-line static call implementation
  2019-06-07  8:38     ` Peter Zijlstra
@ 2019-06-07  8:52       ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-07  8:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Fri, Jun 07, 2019 at 10:38:46AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 06:13:58AM +0000, Nadav Amit wrote:
> > > On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > +void arch_static_call_transform(void *site, void *tramp, void *func)
> > > +{
> > > +	unsigned char opcodes[CALL_INSN_SIZE];
> > > +	unsigned char insn_opcode;
> > > +	unsigned long insn;
> > > +	s32 dest_relative;
> > > +
> > > +	mutex_lock(&text_mutex);
> > > +
> > > +	insn = (unsigned long)tramp;
> > > +
> > > +	insn_opcode = *(unsigned char *)insn;
> > > +	if (insn_opcode != 0xE9) {
> > > +		WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
> > > +			  insn_opcode, (void *)insn);
> > > +		goto unlock;
> > 
> > This might happen if a kprobe is installed on the call, no?
> > 
> > I don’t know if you want to be more gentle handling of this case (or perhaps
> > modify can_probe() to prevent such a case).
> > 
> 
> yuck.. yes, that's something that needs consideration.

For jump_label this is avoided by jump_label_text_reserved(), I'm
thinking static_call should do the same.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 03/15] x86/kprobes: Fix frame pointer annotations
  2019-06-05 13:07 ` [PATCH 03/15] x86/kprobes: Fix frame pointer annotations Peter Zijlstra
@ 2019-06-07 13:02   ` Masami Hiramatsu
  2019-06-07 13:36     ` Josh Poimboeuf
  0 siblings, 1 reply; 87+ messages in thread
From: Masami Hiramatsu @ 2019-06-07 13:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira, Josh Poimboeuf

On Wed, 05 Jun 2019 15:07:56 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> The kprobe trampolines have a FRAME_POINTER annotation that makes no
> sense. It marks the frame in the middle of pt_regs, at the place of
> saving BP.

commit ee213fc72fd67 introduced this code, and this is for unwinder which
uses frame pointer. I think current code stores the address of previous
(original context's) frame pointer into %rbp. So with that, if unwinder
tries to decode frame pointer, it can get the original %rbp value,
instead of &pt_regs from current %rbp.

> 
> Change it to mark the pt_regs frame as per the ENCODE_FRAME_POINTER
> from the respective entry_*.S.
> 

With this change, I think stack unwinder can not get the original %rbp
value. Peter, could you check the above commit?

Thank you,

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/kprobes/common.h |   24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> --- a/arch/x86/kernel/kprobes/common.h
> +++ b/arch/x86/kernel/kprobes/common.h
> @@ -5,15 +5,10 @@
>  /* Kprobes and Optprobes common header */
>  
>  #include <asm/asm.h>
> -
> -#ifdef CONFIG_FRAME_POINTER
> -# define SAVE_RBP_STRING "	push %" _ASM_BP "\n" \
> -			 "	mov  %" _ASM_SP ", %" _ASM_BP "\n"
> -#else
> -# define SAVE_RBP_STRING "	push %" _ASM_BP "\n"
> -#endif
> +#include <asm/frame.h>
>  
>  #ifdef CONFIG_X86_64
> +
>  #define SAVE_REGS_STRING			\
>  	/* Skip cs, ip, orig_ax. */		\
>  	"	subq $24, %rsp\n"		\
> @@ -27,11 +22,13 @@
>  	"	pushq %r10\n"			\
>  	"	pushq %r11\n"			\
>  	"	pushq %rbx\n"			\
> -	SAVE_RBP_STRING				\
> +	"	pushq %rbp\n"			\
>  	"	pushq %r12\n"			\
>  	"	pushq %r13\n"			\
>  	"	pushq %r14\n"			\
> -	"	pushq %r15\n"
> +	"	pushq %r15\n"			\
> +	ENCODE_FRAME_POINTER
> +
>  #define RESTORE_REGS_STRING			\
>  	"	popq %r15\n"			\
>  	"	popq %r14\n"			\
> @@ -51,19 +48,22 @@
>  	/* Skip orig_ax, ip, cs */		\
>  	"	addq $24, %rsp\n"
>  #else
> +
>  #define SAVE_REGS_STRING			\
>  	/* Skip cs, ip, orig_ax and gs. */	\
> -	"	subl $16, %esp\n"		\
> +	"	subl $4*4, %esp\n"		\
>  	"	pushl %fs\n"			\
>  	"	pushl %es\n"			\
>  	"	pushl %ds\n"			\
>  	"	pushl %eax\n"			\
> -	SAVE_RBP_STRING				\
> +	"	pushl %ebp\n"			\
>  	"	pushl %edi\n"			\
>  	"	pushl %esi\n"			\
>  	"	pushl %edx\n"			\
>  	"	pushl %ecx\n"			\
> -	"	pushl %ebx\n"
> +	"	pushl %ebx\n"			\
> +	ENCODE_FRAME_POINTER
> +
>  #define RESTORE_REGS_STRING			\
>  	"	popl %ebx\n"			\
>  	"	popl %ecx\n"			\
> 
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 05/15] x86_32: Provide consistent pt_regs
  2019-06-05 13:07 ` [PATCH 05/15] x86_32: Provide consistent pt_regs Peter Zijlstra
@ 2019-06-07 13:13   ` Masami Hiramatsu
  2019-06-07 19:32   ` Josh Poimboeuf
  1 sibling, 0 replies; 87+ messages in thread
From: Masami Hiramatsu @ 2019-06-07 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, 05 Jun 2019 15:07:58 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Currently pt_regs on x86_32 has an oddity in that kernel regs
> (!user_mode(regs)) are short two entries (esp/ss). This means that any
> code trying to use them (typically: regs->sp) needs to jump through
> some unfortunate hoops.
> 
> Change the entry code to fix this up and create a full pt_regs frame.
> 
> This then simplifies various trampolines in ftrace and kprobes, the
> stack unwinder, ptrace, kdump and kgdb.

The kprobes parts are looks good to me.

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you!

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/entry/entry_32.S         |  105 ++++++++++++++++++++++++++++++++++----
>  arch/x86/include/asm/kexec.h      |   17 ------
>  arch/x86/include/asm/ptrace.h     |   17 ------
>  arch/x86/include/asm/stacktrace.h |    2 
>  arch/x86/kernel/crash.c           |    8 --
>  arch/x86/kernel/ftrace_32.S       |   77 +++++++++++++++------------
>  arch/x86/kernel/kgdb.c            |    8 --
>  arch/x86/kernel/kprobes/common.h  |    4 -
>  arch/x86/kernel/kprobes/core.c    |   29 ++++------
>  arch/x86/kernel/kprobes/opt.c     |   20 ++++---
>  arch/x86/kernel/process_32.c      |   16 +----
>  arch/x86/kernel/ptrace.c          |   29 ----------
>  arch/x86/kernel/time.c            |    3 -
>  arch/x86/kernel/unwind_frame.c    |   32 +----------
>  arch/x86/kernel/unwind_orc.c      |    2 
>  15 files changed, 178 insertions(+), 191 deletions(-)
> 
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -202,9 +202,102 @@
>  .Lend_\@:
>  .endm
>  
> +#define CS_FROM_ENTRY_STACK	(1 << 31)
> +#define CS_FROM_USER_CR3	(1 << 30)
> +#define CS_FROM_KERNEL		(1 << 29)
> +
> +.macro FIXUP_FRAME
> +	/*
> +	 * The high bits of the CS dword (__csh) are used for CS_FROM_*.
> +	 * Clear them in case hardware didn't do this for us.
> +	 */
> +	andl	$0x0000ffff, 3*4(%esp)
> +
> +#ifdef CONFIG_VM86
> +	testl	$X86_EFLAGS_VM, 4*4(%esp)
> +	jnz	.Lfrom_usermode_no_fixup_\@
> +#endif
> +	testl	$SEGMENT_RPL_MASK, 3*4(%esp)
> +	jnz	.Lfrom_usermode_no_fixup_\@
> +
> +	orl	$CS_FROM_KERNEL, 3*4(%esp)
> +
> +	/*
> +	 * When we're here from kernel mode; the (exception) stack looks like:
> +	 *
> +	 *  5*4(%esp) - <previous context>
> +	 *  4*4(%esp) - flags
> +	 *  3*4(%esp) - cs
> +	 *  2*4(%esp) - ip
> +	 *  1*4(%esp) - orig_eax
> +	 *  0*4(%esp) - gs / function
> +	 *
> +	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
> +	 * is complete and in particular regs->sp is correct. This gives us
> +	 * the original 5 enties as gap:
> +	 *
> +	 * 12*4(%esp) - <previous context>
> +	 * 11*4(%esp) - gap / flags
> +	 * 10*4(%esp) - gap / cs
> +	 *  9*4(%esp) - gap / ip
> +	 *  8*4(%esp) - gap / orig_eax
> +	 *  7*4(%esp) - gap / gs / function
> +	 *  6*4(%esp) - ss
> +	 *  5*4(%esp) - sp
> +	 *  4*4(%esp) - flags
> +	 *  3*4(%esp) - cs
> +	 *  2*4(%esp) - ip
> +	 *  1*4(%esp) - orig_eax
> +	 *  0*4(%esp) - gs / function
> +	 */
> +
> +	pushl	%ss		# ss
> +	pushl	%esp		# sp (points at ss)
> +	addl	$6*4, (%esp)	# point sp back at the previous context
> +	pushl	6*4(%esp)	# flags
> +	pushl	6*4(%esp)	# cs
> +	pushl	6*4(%esp)	# ip
> +	pushl	6*4(%esp)	# orig_eax
> +	pushl	6*4(%esp)	# gs / function
> +.Lfrom_usermode_no_fixup_\@:
> +.endm
> +
> +.macro IRET_FRAME
> +	testl $CS_FROM_KERNEL, 1*4(%esp)
> +	jz .Lfinished_frame_\@
> +
> +	/*
> +	 * Reconstruct the 3 entry IRET frame right after the (modified)
> +	 * regs->sp without lowering %esp in between, such that an NMI in the
> +	 * middle doesn't scribble our stack.
> +	 */
> +	pushl	%eax
> +	pushl	%ecx
> +	movl	5*4(%esp), %eax		# (modified) regs->sp
> +
> +	movl	4*4(%esp), %ecx		# flags
> +	movl	%ecx, -4(%eax)
> +
> +	movl	3*4(%esp), %ecx		# cs
> +	andl	$0x0000ffff, %ecx
> +	movl	%ecx, -8(%eax)
> +
> +	movl	2*4(%esp), %ecx		# ip
> +	movl	%ecx, -12(%eax)
> +
> +	movl	1*4(%esp), %ecx		# eax
> +	movl	%ecx, -16(%eax)
> +
> +	popl	%ecx
> +	lea	-16(%eax), %esp
> +	popl	%eax
> +.Lfinished_frame_\@:
> +.endm
> +
>  .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
>  	cld
>  	PUSH_GS
> +	FIXUP_FRAME
>  	pushl	%fs
>  	pushl	%es
>  	pushl	%ds
> @@ -358,9 +451,6 @@
>   * switch to it before we do any copying.
>   */
>  
> -#define CS_FROM_ENTRY_STACK	(1 << 31)
> -#define CS_FROM_USER_CR3	(1 << 30)
> -
>  .macro SWITCH_TO_KERNEL_STACK
>  
>  	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
> @@ -374,13 +464,6 @@
>  	 * that register for the time this macro runs
>  	 */
>  
> -	/*
> -	 * The high bits of the CS dword (__csh) are used for
> -	 * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case
> -	 * hardware didn't do this for us.
> -	 */
> -	andl	$(0x0000ffff), PT_CS(%esp)
> -
>  	/* Are we on the entry stack? Bail out if not! */
>  	movl	PER_CPU_VAR(cpu_entry_area), %ecx
>  	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
> @@ -990,6 +1073,7 @@ ENTRY(entry_INT80_32)
>  	/* Restore user state */
>  	RESTORE_REGS pop=4			# skip orig_eax/error_code
>  .Lirq_return:
> +	IRET_FRAME
>  	/*
>  	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
>  	 * when returning from IPI handler and when returning from
> @@ -1340,6 +1424,7 @@ END(page_fault)
>  
>  common_exception:
>  	/* the function address is in %gs's slot on the stack */
> +	FIXUP_FRAME
>  	pushl	%fs
>  	pushl	%es
>  	pushl	%ds
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -71,22 +71,6 @@ struct kimage;
>  #define KEXEC_BACKUP_SRC_END	(640 * 1024UL - 1)	/* 640K */
>  
>  /*
> - * CPU does not save ss and sp on stack if execution is already
> - * running in kernel mode at the time of NMI occurrence. This code
> - * fixes it.
> - */
> -static inline void crash_fixup_ss_esp(struct pt_regs *newregs,
> -				      struct pt_regs *oldregs)
> -{
> -#ifdef CONFIG_X86_32
> -	newregs->sp = (unsigned long)&(oldregs->sp);
> -	asm volatile("xorl %%eax, %%eax\n\t"
> -		     "movw %%ss, %%ax\n\t"
> -		     :"=a"(newregs->ss));
> -#endif
> -}
> -
> -/*
>   * This function is responsible for capturing register states if coming
>   * via panic otherwise just fix up the ss and sp if coming via kernel
>   * mode exception.
> @@ -96,7 +80,6 @@ static inline void crash_setup_regs(stru
>  {
>  	if (oldregs) {
>  		memcpy(newregs, oldregs, sizeof(*newregs));
> -		crash_fixup_ss_esp(newregs, oldregs);
>  	} else {
>  #ifdef CONFIG_X86_32
>  		asm volatile("movl %%ebx,%0" : "=m"(newregs->bx));
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struc
>  #define compat_user_stack_pointer()	current_pt_regs()->sp
>  #endif
>  
> -#ifdef CONFIG_X86_32
> -extern unsigned long kernel_stack_pointer(struct pt_regs *regs);
> -#else
>  static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
>  {
>  	return regs->sp;
>  }
> -#endif
>  
>  #define GET_IP(regs) ((regs)->ip)
>  #define GET_FP(regs) ((regs)->bp)
> @@ -201,14 +197,6 @@ static inline unsigned long regs_get_reg
>  	if (unlikely(offset > MAX_REG_OFFSET))
>  		return 0;
>  #ifdef CONFIG_X86_32
> -	/*
> -	 * Traps from the kernel do not save sp and ss.
> -	 * Use the helper function to retrieve sp.
> -	 */
> -	if (offset == offsetof(struct pt_regs, sp) &&
> -	    regs->cs == __KERNEL_CS)
> -		return kernel_stack_pointer(regs);
> -
>  	/* The selector fields are 16-bit. */
>  	if (offset == offsetof(struct pt_regs, cs) ||
>  	    offset == offsetof(struct pt_regs, ss) ||
> @@ -234,8 +222,7 @@ static inline unsigned long regs_get_reg
>  static inline int regs_within_kernel_stack(struct pt_regs *regs,
>  					   unsigned long addr)
>  {
> -	return ((addr & ~(THREAD_SIZE - 1))  ==
> -		(kernel_stack_pointer(regs) & ~(THREAD_SIZE - 1)));
> +	return ((addr & ~(THREAD_SIZE - 1)) == (regs->sp & ~(THREAD_SIZE - 1)));
>  }
>  
>  /**
> @@ -249,7 +236,7 @@ static inline int regs_within_kernel_sta
>   */
>  static inline unsigned long *regs_get_kernel_stack_nth_addr(struct pt_regs *regs, unsigned int n)
>  {
> -	unsigned long *addr = (unsigned long *)kernel_stack_pointer(regs);
> +	unsigned long *addr = (unsigned long *)regs->sp;
>  
>  	addr += n;
>  	if (regs_within_kernel_stack(regs, (unsigned long)addr))
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -78,7 +78,7 @@ static inline unsigned long *
>  get_stack_pointer(struct task_struct *task, struct pt_regs *regs)
>  {
>  	if (regs)
> -		return (unsigned long *)kernel_stack_pointer(regs);
> +		return (unsigned long *)regs->sp;
>  
>  	if (task == current)
>  		return __builtin_frame_address(0);
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -72,14 +72,6 @@ static inline void cpu_crash_vmclear_loa
>  
>  static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
>  {
> -#ifdef CONFIG_X86_32
> -	struct pt_regs fixed_regs;
> -
> -	if (!user_mode(regs)) {
> -		crash_fixup_ss_esp(&fixed_regs, regs);
> -		regs = &fixed_regs;
> -	}
> -#endif
>  	crash_save_cpu(regs, cpu);
>  
>  	/*
> --- a/arch/x86/kernel/ftrace_32.S
> +++ b/arch/x86/kernel/ftrace_32.S
> @@ -10,6 +10,7 @@
>  #include <asm/ftrace.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/frame.h>
> +#include <asm/asm-offsets.h>
>  
>  # define function_hook	__fentry__
>  EXPORT_SYMBOL(__fentry__)
> @@ -90,26 +91,38 @@ END(ftrace_caller)
>  
>  ENTRY(ftrace_regs_caller)
>  	/*
> -	 * i386 does not save SS and ESP when coming from kernel.
> -	 * Instead, to get sp, &regs->sp is used (see ptrace.h).
> -	 * Unfortunately, that means eflags must be at the same location
> -	 * as the current return ip is. We move the return ip into the
> -	 * regs->ip location, and move flags into the return ip location.
> +	 * We're here from an mcount/fentry CALL, and the stack frame looks like:
> +	 *
> +	 *  <previous context>
> +	 *  RET-IP
> +	 *
> +	 * The purpose of this function is to call out in an emulated INT3
> +	 * environment with a stack frame like:
> +	 *
> +	 *  <previous context>
> +	 *  gap / RET-IP
> +	 *  gap
> +	 *  gap
> +	 *  gap
> +	 *  pt_regs
> +	 *
> +	 * We do _NOT_ restore: ss, flags, cs, gs, fs, es, ds
>  	 */
> -	pushl	$__KERNEL_CS
> -	pushl	4(%esp)				/* Save the return ip */
> -	pushl	$0				/* Load 0 into orig_ax */
> +	subl	$3*4, %esp	# RET-IP + 3 gaps
> +	pushl	%ss		# ss
> +	pushl	%esp		# points at ss
> +	addl	$5*4, (%esp)	#   make it point at <previous context>
> +	pushfl			# flags
> +	pushl	$__KERNEL_CS	# cs
> +	pushl	7*4(%esp)	# ip <- RET-IP
> +	pushl	$0		# orig_eax
> +
>  	pushl	%gs
>  	pushl	%fs
>  	pushl	%es
>  	pushl	%ds
> -	pushl	%eax
> -
> -	/* Get flags and place them into the return ip slot */
> -	pushf
> -	popl	%eax
> -	movl	%eax, 8*4(%esp)
>  
> +	pushl	%eax
>  	pushl	%ebp
>  	pushl	%edi
>  	pushl	%esi
> @@ -119,24 +132,25 @@ ENTRY(ftrace_regs_caller)
>  
>  	ENCODE_FRAME_POINTER
>  
> -	movl	12*4(%esp), %eax		/* Load ip (1st parameter) */
> -	subl	$MCOUNT_INSN_SIZE, %eax		/* Adjust ip */
> -	movl	15*4(%esp), %edx		/* Load parent ip (2nd parameter) */
> -	movl	function_trace_op, %ecx		/* Save ftrace_pos in 3rd parameter */
> -	pushl	%esp				/* Save pt_regs as 4th parameter */
> +	movl	PT_EIP(%esp), %eax	# 1st argument: IP
> +	subl	$MCOUNT_INSN_SIZE, %eax
> +	movl	21*4(%esp), %edx	# 2nd argument: parent ip
> +	movl	function_trace_op, %ecx	# 3rd argument: ftrace_pos
> +	pushl	%esp			# 4th argument: pt_regs
>  
>  GLOBAL(ftrace_regs_call)
>  	call	ftrace_stub
>  
> -	addl	$4, %esp			/* Skip pt_regs */
> +	addl	$4, %esp		# skip 4th argument
>  
> -	/* restore flags */
> -	push	14*4(%esp)
> -	popf
> -
> -	/* Move return ip back to its original location */
> -	movl	12*4(%esp), %eax
> -	movl	%eax, 14*4(%esp)
> +	/* place IP below the new SP */
> +	movl	PT_OLDESP(%esp), %eax
> +	movl	PT_EIP(%esp), %ecx
> +	movl	%ecx, -4(%eax)
> +
> +	/* place EAX below that */
> +	movl	PT_EAX(%esp), %ecx
> +	movl	%ecx, -8(%eax)
>  
>  	popl	%ebx
>  	popl	%ecx
> @@ -144,14 +158,9 @@ GLOBAL(ftrace_regs_call)
>  	popl	%esi
>  	popl	%edi
>  	popl	%ebp
> -	popl	%eax
> -	popl	%ds
> -	popl	%es
> -	popl	%fs
> -	popl	%gs
>  
> -	/* use lea to not affect flags */
> -	lea	3*4(%esp), %esp			/* Skip orig_ax, ip and cs */
> +	lea	-8(%eax), %esp
> +	popl	%eax
>  
>  	jmp	.Lftrace_ret
>  
> --- a/arch/x86/kernel/kgdb.c
> +++ b/arch/x86/kernel/kgdb.c
> @@ -127,14 +127,6 @@ char *dbg_get_reg(int regno, void *mem,
>  
>  #ifdef CONFIG_X86_32
>  	switch (regno) {
> -	case GDB_SS:
> -		if (!user_mode(regs))
> -			*(unsigned long *)mem = __KERNEL_DS;
> -		break;
> -	case GDB_SP:
> -		if (!user_mode(regs))
> -			*(unsigned long *)mem = kernel_stack_pointer(regs);
> -		break;
>  	case GDB_GS:
>  	case GDB_FS:
>  		*(unsigned long *)mem = 0xFFFF;
> --- a/arch/x86/kernel/kprobes/common.h
> +++ b/arch/x86/kernel/kprobes/common.h
> @@ -72,8 +72,8 @@
>  	"	popl %edi\n"			\
>  	"	popl %ebp\n"			\
>  	"	popl %eax\n"			\
> -	/* Skip ds, es, fs, gs, orig_ax, and ip. Note: don't pop cs here*/\
> -	"	addl $24, %esp\n"
> +	/* Skip ds, es, fs, gs, orig_ax, ip, and cs. */\
> +	"	addl $7*4, %esp\n"
>  #endif
>  
>  /* Ensure if the instruction can be boostable */
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -69,7 +69,7 @@
>  DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
>  DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>  
> -#define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))
> +#define stack_addr(regs) ((unsigned long *)regs->sp)
>  
>  #define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
>  	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
> @@ -731,29 +731,27 @@ asm(
>  	".global kretprobe_trampoline\n"
>  	".type kretprobe_trampoline, @function\n"
>  	"kretprobe_trampoline:\n"
> -#ifdef CONFIG_X86_64
>  	/* We don't bother saving the ss register */
> +#ifdef CONFIG_X86_64
>  	"	pushq %rsp\n"
>  	"	pushfq\n"
>  	SAVE_REGS_STRING
>  	"	movq %rsp, %rdi\n"
>  	"	call trampoline_handler\n"
>  	/* Replace saved sp with true return address. */
> -	"	movq %rax, 152(%rsp)\n"
> +	"	movq %rax, 19*8(%rsp)\n"
>  	RESTORE_REGS_STRING
>  	"	popfq\n"
>  #else
> -	"	pushf\n"
> +	"	pushl %esp\n"
> +	"	pushfl\n"
>  	SAVE_REGS_STRING
>  	"	movl %esp, %eax\n"
>  	"	call trampoline_handler\n"
> -	/* Move flags to cs */
> -	"	movl 56(%esp), %edx\n"
> -	"	movl %edx, 52(%esp)\n"
> -	/* Replace saved flags with true return address. */
> -	"	movl %eax, 56(%esp)\n"
> +	/* Replace saved sp with true return address. */
> +	"	movl %eax, 15*4(%esp)\n"
>  	RESTORE_REGS_STRING
> -	"	popf\n"
> +	"	popfl\n"
>  #endif
>  	"	ret\n"
>  	".size kretprobe_trampoline, .-kretprobe_trampoline\n"
> @@ -794,16 +792,13 @@ __used __visible void *trampoline_handle
>  	INIT_HLIST_HEAD(&empty_rp);
>  	kretprobe_hash_lock(current, &head, &flags);
>  	/* fixup registers */
> -#ifdef CONFIG_X86_64
>  	regs->cs = __KERNEL_CS;
> -	/* On x86-64, we use pt_regs->sp for return address holder. */
> -	frame_pointer = &regs->sp;
> -#else
> -	regs->cs = __KERNEL_CS | get_kernel_rpl();
> +#ifdef CONFIG_X86_32
> +	regs->cs |= get_kernel_rpl();
>  	regs->gs = 0;
> -	/* On x86-32, we use pt_regs->flags for return address holder. */
> -	frame_pointer = &regs->flags;
>  #endif
> +	/* We use pt_regs->sp for return address holder. */
> +	frame_pointer = &regs->sp;
>  	regs->ip = trampoline_address;
>  	regs->orig_ax = ~0UL;
>  
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -115,14 +115,15 @@ asm (
>  			"optprobe_template_call:\n"
>  			ASM_NOP5
>  			/* Move flags to rsp */
> -			"	movq 144(%rsp), %rdx\n"
> -			"	movq %rdx, 152(%rsp)\n"
> +			"	movq 18*8(%rsp), %rdx\n"
> +			"	movq %rdx, 19*8(%rsp)\n"
>  			RESTORE_REGS_STRING
>  			/* Skip flags entry */
>  			"	addq $8, %rsp\n"
>  			"	popfq\n"
>  #else /* CONFIG_X86_32 */
> -			"	pushf\n"
> +			"	pushl %esp\n"
> +			"	pushfl\n"
>  			SAVE_REGS_STRING
>  			"	movl %esp, %edx\n"
>  			".global optprobe_template_val\n"
> @@ -131,9 +132,13 @@ asm (
>  			".global optprobe_template_call\n"
>  			"optprobe_template_call:\n"
>  			ASM_NOP5
> +			/* Move flags into esp */
> +			"	movl 14*4(%esp), %edx\n"
> +			"	movl %edx, 15*4(%esp)\n"
>  			RESTORE_REGS_STRING
> -			"	addl $4, %esp\n"	/* skip cs */
> -			"	popf\n"
> +			/* Skip flags entry */
> +			"	addl $4, %esp\n"
> +			"	popfl\n"
>  #endif
>  			".global optprobe_template_end\n"
>  			"optprobe_template_end:\n"
> @@ -165,10 +170,9 @@ optimized_callback(struct optimized_kpro
>  	} else {
>  		struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
>  		/* Save skipped registers */
> -#ifdef CONFIG_X86_64
>  		regs->cs = __KERNEL_CS;
> -#else
> -		regs->cs = __KERNEL_CS | get_kernel_rpl();
> +#ifdef CONFIG_X86_32
> +		regs->cs |= get_kernel_rpl();
>  		regs->gs = 0;
>  #endif
>  		regs->ip = (unsigned long)op->kp.addr + INT3_SIZE;
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -62,27 +62,21 @@ void __show_regs(struct pt_regs *regs, e
>  {
>  	unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
>  	unsigned long d0, d1, d2, d3, d6, d7;
> -	unsigned long sp;
> -	unsigned short ss, gs;
> +	unsigned short gs;
>  
> -	if (user_mode(regs)) {
> -		sp = regs->sp;
> -		ss = regs->ss;
> +	if (user_mode(regs))
>  		gs = get_user_gs(regs);
> -	} else {
> -		sp = kernel_stack_pointer(regs);
> -		savesegment(ss, ss);
> +	else
>  		savesegment(gs, gs);
> -	}
>  
>  	show_ip(regs, KERN_DEFAULT);
>  
>  	printk(KERN_DEFAULT "EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n",
>  		regs->ax, regs->bx, regs->cx, regs->dx);
>  	printk(KERN_DEFAULT "ESI: %08lx EDI: %08lx EBP: %08lx ESP: %08lx\n",
> -		regs->si, regs->di, regs->bp, sp);
> +		regs->si, regs->di, regs->bp, regs->sp);
>  	printk(KERN_DEFAULT "DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x EFLAGS: %08lx\n",
> -	       (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, ss, regs->flags);
> +	       (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, regs->ss, regs->flags);
>  
>  	if (mode != SHOW_REGS_ALL)
>  		return;
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -153,35 +153,6 @@ static inline bool invalid_selector(u16
>  
>  #define FLAG_MASK		FLAG_MASK_32
>  
> -/*
> - * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
> - * when it traps.  The previous stack will be directly underneath the saved
> - * registers, and 'sp/ss' won't even have been saved. Thus the '&regs->sp'.
> - *
> - * Now, if the stack is empty, '&regs->sp' is out of range. In this
> - * case we try to take the previous stack. To always return a non-null
> - * stack pointer we fall back to regs as stack if no previous stack
> - * exists.
> - *
> - * This is valid only for kernel mode traps.
> - */
> -unsigned long kernel_stack_pointer(struct pt_regs *regs)
> -{
> -	unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1);
> -	unsigned long sp = (unsigned long)&regs->sp;
> -	u32 *prev_esp;
> -
> -	if (context == (sp & ~(THREAD_SIZE - 1)))
> -		return sp;
> -
> -	prev_esp = (u32 *)(context);
> -	if (*prev_esp)
> -		return (unsigned long)*prev_esp;
> -
> -	return (unsigned long)regs;
> -}
> -EXPORT_SYMBOL_GPL(kernel_stack_pointer);
> -
>  static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno)
>  {
>  	BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0);
> --- a/arch/x86/kernel/time.c
> +++ b/arch/x86/kernel/time.c
> @@ -37,8 +37,7 @@ unsigned long profile_pc(struct pt_regs
>  #ifdef CONFIG_FRAME_POINTER
>  		return *(unsigned long *)(regs->bp + sizeof(long));
>  #else
> -		unsigned long *sp =
> -			(unsigned long *)kernel_stack_pointer(regs);
> +		unsigned long *sp = (unsigned long *)regs->sp;
>  		/*
>  		 * Return address is either directly at stack pointer
>  		 * or above a saved flags. Eflags has bits 22-31 zero,
> --- a/arch/x86/kernel/unwind_frame.c
> +++ b/arch/x86/kernel/unwind_frame.c
> @@ -69,15 +69,6 @@ static void unwind_dump(struct unwind_st
>  	}
>  }
>  
> -static size_t regs_size(struct pt_regs *regs)
> -{
> -	/* x86_32 regs from kernel mode are two words shorter: */
> -	if (IS_ENABLED(CONFIG_X86_32) && !user_mode(regs))
> -		return sizeof(*regs) - 2*sizeof(long);
> -
> -	return sizeof(*regs);
> -}
> -
>  static bool in_entry_code(unsigned long ip)
>  {
>  	char *addr = (char *)ip;
> @@ -197,12 +188,6 @@ static struct pt_regs *decode_frame_poin
>  }
>  #endif
>  
> -#ifdef CONFIG_X86_32
> -#define KERNEL_REGS_SIZE (sizeof(struct pt_regs) - 2*sizeof(long))
> -#else
> -#define KERNEL_REGS_SIZE (sizeof(struct pt_regs))
> -#endif
> -
>  static bool update_stack_state(struct unwind_state *state,
>  			       unsigned long *next_bp)
>  {
> @@ -213,7 +198,7 @@ static bool update_stack_state(struct un
>  	size_t len;
>  
>  	if (state->regs)
> -		prev_frame_end = (void *)state->regs + regs_size(state->regs);
> +		prev_frame_end = (void *)state->regs + sizeof(*state->regs);
>  	else
>  		prev_frame_end = (void *)state->bp + FRAME_HEADER_SIZE;
>  
> @@ -221,7 +206,7 @@ static bool update_stack_state(struct un
>  	regs = decode_frame_pointer(next_bp);
>  	if (regs) {
>  		frame = (unsigned long *)regs;
> -		len = KERNEL_REGS_SIZE;
> +		len = sizeof(*regs);
>  		state->got_irq = true;
>  	} else {
>  		frame = next_bp;
> @@ -245,14 +230,6 @@ static bool update_stack_state(struct un
>  	    frame < prev_frame_end)
>  		return false;
>  
> -	/*
> -	 * On 32-bit with user mode regs, make sure the last two regs are safe
> -	 * to access:
> -	 */
> -	if (IS_ENABLED(CONFIG_X86_32) && regs && user_mode(regs) &&
> -	    !on_stack(info, frame, len + 2*sizeof(long)))
> -		return false;
> -
>  	/* Move state to the next frame: */
>  	if (regs) {
>  		state->regs = regs;
> @@ -411,10 +388,9 @@ void __unwind_start(struct unwind_state
>  	 * Pretend that the frame is complete and that BP points to it, but save
>  	 * the real BP so that we can use it when looking for the next frame.
>  	 */
> -	if (regs && regs->ip == 0 &&
> -	    (unsigned long *)kernel_stack_pointer(regs) >= first_frame) {
> +	if (regs && regs->ip == 0 && (unsigned long *)regs->sp >= first_frame) {
>  		state->next_bp = bp;
> -		bp = ((unsigned long *)kernel_stack_pointer(regs)) - 1;
> +		bp = ((unsigned long *)regs->sp) - 1;
>  	}
>  
>  	/* Initialize stack info and make sure the frame data is accessible: */
> --- a/arch/x86/kernel/unwind_orc.c
> +++ b/arch/x86/kernel/unwind_orc.c
> @@ -579,7 +579,7 @@ void __unwind_start(struct unwind_state
>  			goto done;
>  
>  		state->ip = regs->ip;
> -		state->sp = kernel_stack_pointer(regs);
> +		state->sp = regs->sp;
>  		state->bp = regs->bp;
>  		state->regs = regs;
>  		state->full_regs = true;
> 
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 03/15] x86/kprobes: Fix frame pointer annotations
  2019-06-07 13:02   ` Masami Hiramatsu
@ 2019-06-07 13:36     ` Josh Poimboeuf
  2019-06-07 15:21       ` Masami Hiramatsu
  2019-06-11  8:12       ` Peter Zijlstra
  0 siblings, 2 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-07 13:36 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 10:02:10PM +0900, Masami Hiramatsu wrote:
> On Wed, 05 Jun 2019 15:07:56 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > The kprobe trampolines have a FRAME_POINTER annotation that makes no
> > sense. It marks the frame in the middle of pt_regs, at the place of
> > saving BP.
> 
> commit ee213fc72fd67 introduced this code, and this is for unwinder which
> uses frame pointer. I think current code stores the address of previous
> (original context's) frame pointer into %rbp. So with that, if unwinder
> tries to decode frame pointer, it can get the original %rbp value,
> instead of &pt_regs from current %rbp.
> 
> > 
> > Change it to mark the pt_regs frame as per the ENCODE_FRAME_POINTER
> > from the respective entry_*.S.
> > 
> 
> With this change, I think stack unwinder can not get the original %rbp
> value. Peter, could you check the above commit?

The unwinder knows how to decode the encoded frame pointer.  So it can
find regs by decoding the new rbp value, and it also knows that regs->bp
is the original rbp value.

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/15] x86/entry/32: Clean up return from interrupt preemption path
  2019-06-05 13:07 ` [PATCH 01/15] x86/entry/32: Clean up return from interrupt preemption path Peter Zijlstra
@ 2019-06-07 14:21   ` Josh Poimboeuf
  0 siblings, 0 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-07 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:07:54PM +0200, Peter Zijlstra wrote:
> The code flow around the return from interrupt preemption point seems
> needlesly complicated.

"needlessly"

> 
> There is only one site jumping to resume_kernel, and none (outside of
> resume_kernel) jumping to restore_all_kernel. Inline resume_kernel
> in restore_all_kernel and avoid the CONFIG_PREEMPT dependent label.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 02/15] x86: Move ENCODE_FRAME_POINTER to asm/frame.h
  2019-06-05 13:07 ` [PATCH 02/15] x86: Move ENCODE_FRAME_POINTER to asm/frame.h Peter Zijlstra
@ 2019-06-07 14:24   ` Josh Poimboeuf
  0 siblings, 0 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-07 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:07:55PM +0200, Peter Zijlstra wrote:
> In preparation for wider use, move the ENCODE_FRAME_POINTER macros to
> a common header and provide inline asm versions.
> 
> These macros are used to encode a pt_regs frame for the unwinder; see
> unwind_frame.c:decode_frame_pointer().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07  8:20     ` Peter Zijlstra
@ 2019-06-07 14:27       ` Masami Hiramatsu
  0 siblings, 0 replies; 87+ messages in thread
From: Masami Hiramatsu @ 2019-06-07 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, 7 Jun 2019 10:20:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jun 07, 2019 at 05:41:42AM +0000, Nadav Amit wrote:
> 
> > > int poke_int3_handler(struct pt_regs *regs)
> > > {
> > > +	long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> > > +	struct opcode {
> > > +		u8 insn;
> > > +		s32 rel;
> > > +	} __packed opcode;
> > > +
> > > 	/*
> > > 	 * Having observed our INT3 instruction, we now must observe
> > > 	 * bp_patching_in_progress.
> > > 	 *
> > > -	 * 	in_progress = TRUE		INT3
> > > -	 * 	WMB				RMB
> > > -	 * 	write INT3			if (in_progress)
> > > +	 *	in_progress = TRUE		INT3
> > > +	 *	WMB				RMB
> > > +	 *	write INT3			if (in_progress)
> > 
> > I don’t see what has changed in this chunk… Whitespaces?
> 
> Yep, my editor kept marking that stuff red (space before tab), which
> annoyed me enough to fix it.
> 
> > > 	 *
> > > -	 * Idem for bp_int3_handler.
> > > +	 * Idem for bp_int3_opcode.
> > > 	 */
> > > 	smp_rmb();
> > > 
> > > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> > > 	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> > > 		return 0;
> > > 
> > > -	/* set up the specified breakpoint handler */
> > > -	regs->ip = (unsigned long) bp_int3_handler;
> > > +	opcode = *(struct opcode *)bp_int3_opcode;
> > > +
> > > +	switch (opcode.insn) {
> > > +	case 0xE8: /* CALL */
> > > +		int3_emulate_call(regs, ip + opcode.rel);
> > > +		break;
> > > +
> > > +	case 0xE9: /* JMP */
> > > +		int3_emulate_jmp(regs, ip + opcode.rel);
> > > +		break;
> > 
> > Consider using RELATIVECALL_OPCODE and RELATIVEJUMP_OPCODE instead of the
> > constants (0xE8, 0xE9), just as you do later in the patch.
> 
> Those are private to kprobes..
> 
> but I can do something like so:
> 
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -48,8 +48,14 @@ static inline void int3_emulate_jmp(stru
>  	regs->ip = ip;
>  }
>  
> -#define INT3_INSN_SIZE 1
> -#define CALL_INSN_SIZE 5
> +#define INT3_INSN_SIZE		1
> +#define INT3_INSN_OPCODE	0xCC
> +
> +#define CALL_INSN_SIZE		5
> +#define CALL_INSN_OPCODE	0xE8
> +
> +#define JMP_INSN_SIZE		5
> +#define JMP_INSN_OPCODE		0xE9
>  
>  static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
>  {
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -952,11 +952,11 @@ int poke_int3_handler(struct pt_regs *re
>  	opcode = *(struct opcode *)bp_int3_opcode;
>  
>  	switch (opcode.insn) {
> -	case 0xE8: /* CALL */
> +	case CALL_INSN_OPCODE:
>  		int3_emulate_call(regs, ip + opcode.rel);
>  		break;
>  
> -	case 0xE9: /* JMP */
> +	case JMP_INSN_OPCODE:
>  		int3_emulate_jmp(regs, ip + opcode.rel);
>  		break;
>  

This looks good. I don't want to make those opcode as private.
I would like to share it.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 04/15] x86/ftrace: Add pt_regs frame annotations
  2019-06-05 13:07 ` [PATCH 04/15] x86/ftrace: Add pt_regs frame annotations Peter Zijlstra
@ 2019-06-07 14:45   ` Josh Poimboeuf
  0 siblings, 0 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-07 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:07:57PM +0200, Peter Zijlstra wrote:
> When CONFIG_FRAME_POINTER, we should mark pt_regs frames.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 03/15] x86/kprobes: Fix frame pointer annotations
  2019-06-07 13:36     ` Josh Poimboeuf
@ 2019-06-07 15:21       ` Masami Hiramatsu
  2019-06-11  8:12       ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Masami Hiramatsu @ 2019-06-07 15:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, 7 Jun 2019 09:36:02 -0400
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Fri, Jun 07, 2019 at 10:02:10PM +0900, Masami Hiramatsu wrote:
> > On Wed, 05 Jun 2019 15:07:56 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > The kprobe trampolines have a FRAME_POINTER annotation that makes no
> > > sense. It marks the frame in the middle of pt_regs, at the place of
> > > saving BP.
> > 
> > commit ee213fc72fd67 introduced this code, and this is for unwinder which
> > uses frame pointer. I think current code stores the address of previous
> > (original context's) frame pointer into %rbp. So with that, if unwinder
> > tries to decode frame pointer, it can get the original %rbp value,
> > instead of &pt_regs from current %rbp.
> > 
> > > 
> > > Change it to mark the pt_regs frame as per the ENCODE_FRAME_POINTER
> > > from the respective entry_*.S.
> > > 
> > 
> > With this change, I think stack unwinder can not get the original %rbp
> > value. Peter, could you check the above commit?
> 
> The unwinder knows how to decode the encoded frame pointer.  So it can
> find regs by decoding the new rbp value, and it also knows that regs->bp
> is the original rbp value.
> 
> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
> 

Ah, OK. My misunderstood. So this encode framepointer as same as other
interrupt entry stack.
Then, it looks good to me too.

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you Josh!




> -- 
> Josh


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-05 13:08 ` [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions Peter Zijlstra
  2019-06-07  5:41   ` Nadav Amit
@ 2019-06-07 15:47   ` Masami Hiramatsu
  2019-06-07 17:34     ` Peter Zijlstra
  2019-06-10 16:57   ` Josh Poimboeuf
  2019-06-11 15:14   ` Steven Rostedt
  3 siblings, 1 reply; 87+ messages in thread
From: Masami Hiramatsu @ 2019-06-07 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, 05 Jun 2019 15:08:01 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> In preparation for static_call support, teach text_poke_bp() to
> emulate instructions, including CALL.
> 
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
> 
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
> 
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.

Hm, actually it doesn't restores randome text, since the first byte
must always be int3. As the function name means, it just unoptimizes
(jump based optprobe -> int3 based kprobe).
Anyway, that is not an issue. With this patch, optprobe must still work.

> 
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Nadav Amit <namit@vmware.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/text-patching.h |    2 -
>  arch/x86/kernel/alternative.c        |   47 ++++++++++++++++++++++++++---------
>  arch/x86/kernel/jump_label.c         |    3 --
>  arch/x86/kernel/kprobes/opt.c        |   11 +++++---
>  4 files changed, 46 insertions(+), 17 deletions(-)
> 
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
> -extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
>  extern int after_bootmem;
>  extern __ro_after_init struct mm_struct *poking_mm;
>  extern __ro_after_init unsigned long poking_addr;
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -921,19 +921,25 @@ static void do_sync_core(void *info)
>  }
>  
>  static bool bp_patching_in_progress;
> -static void *bp_int3_handler, *bp_int3_addr;
> +static const void *bp_int3_opcode, *bp_int3_addr;
>  
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> +	long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> +	struct opcode {
> +		u8 insn;
> +		s32 rel;
> +	} __packed opcode;
> +
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
>  	 * bp_patching_in_progress.
>  	 *
> -	 * 	in_progress = TRUE		INT3
> -	 * 	WMB				RMB
> -	 * 	write INT3			if (in_progress)
> +	 *	in_progress = TRUE		INT3
> +	 *	WMB				RMB
> +	 *	write INT3			if (in_progress)
>  	 *
> -	 * Idem for bp_int3_handler.
> +	 * Idem for bp_int3_opcode.
>  	 */
>  	smp_rmb();
>  
> @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
>  	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>  		return 0;
>  
> -	/* set up the specified breakpoint handler */
> -	regs->ip = (unsigned long) bp_int3_handler;
> +	opcode = *(struct opcode *)bp_int3_opcode;
> +
> +	switch (opcode.insn) {
> +	case 0xE8: /* CALL */
> +		int3_emulate_call(regs, ip + opcode.rel);
> +		break;
> +
> +	case 0xE9: /* JMP */
> +		int3_emulate_jmp(regs, ip + opcode.rel);
> +		break;
> +
> +	default: /* assume NOP */

Shouldn't we check whether it is actually NOP here?

> +		int3_emulate_jmp(regs, ip);
> +		break;
> +	}

BTW, if we fix the length of patching always 5 bytes and allow user
to apply it only from/to jump/call/nop, we may be better to remove
"len" and rename it, something like "text_poke_branch" etc.

Thank you,

>  
>  	return 1;
>  }
> @@ -955,7 +974,7 @@ NOKPROBE_SYMBOL(poke_int3_handler);
>   * @addr:	address to patch
>   * @opcode:	opcode of new instruction
>   * @len:	length to copy
> - * @handler:	address to jump to when the temporary breakpoint is hit
> + * @emulate:	opcode to emulate, when NULL use @opcode
>   *
>   * Modify multi-byte instruction by using int3 breakpoint on SMP.
>   * We completely avoid stop_machine() here, and achieve the
> @@ -970,19 +989,25 @@ NOKPROBE_SYMBOL(poke_int3_handler);
>   *	  replacing opcode
>   *	- sync cores
>   */
> -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
>  {
>  	unsigned char int3 = 0xcc;
>  
> -	bp_int3_handler = handler;
> +	bp_int3_opcode = emulate ?: opcode;
>  	bp_int3_addr = (u8 *)addr + sizeof(int3);
>  	bp_patching_in_progress = true;
>  
>  	lockdep_assert_held(&text_mutex);
>  
>  	/*
> +	 * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> +	 * notably a JMP, CALL or NOP5_ATOMIC.
> +	 */
> +	BUG_ON(len != 5);
> +
> +	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> -	 * in_progress and handler are correctly ordered wrt. patching.
> +	 * in_progress and opcode are correctly ordered wrt. patching.
>  	 */
>  	smp_wmb();
>  
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -87,8 +87,7 @@ static void __ref __jump_label_transform
>  		return;
>  	}
>  
> -	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
> -		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
> +	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, NULL);
>  }
>  
>  void arch_jump_label_transform(struct jump_entry *entry,
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
>  		insn_buff[0] = RELATIVEJUMP_OPCODE;
>  		*(s32 *)(&insn_buff[1]) = rel;
>  
> -		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> -			     op->optinsn.insn);
> +		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);
>  
>  		list_del_init(&op->list);
>  	}
> @@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
>  void arch_unoptimize_kprobe(struct optimized_kprobe *op)
>  {
>  	u8 insn_buff[RELATIVEJUMP_SIZE];
> +	u8 emulate_buff[RELATIVEJUMP_SIZE];
>  
>  	/* Set int3 to first byte for kprobes */
>  	insn_buff[0] = BREAKPOINT_INSTRUCTION;
>  	memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
> +
> +	emulate_buff[0] = RELATIVEJUMP_OPCODE;
> +	*(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
> +			((long)op->kp.addr + RELATIVEJUMP_SIZE));
> +
>  	text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> -		     op->optinsn.insn);
> +		     emulate_buff);
>  }
>  
>  /*
> 
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-06-07  8:49       ` Ard Biesheuvel
@ 2019-06-07 16:33         ` Andy Lutomirski
  2019-06-07 16:58         ` Nadav Amit
  1 sibling, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2019-06-07 16:33 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Zijlstra, Nadav Amit, the arch/x86 maintainers, LKML,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira, Josh Poimboeuf



> On Jun 7, 2019, at 1:49 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> 
>> On Fri, 7 Jun 2019 at 10:29, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Thu, Jun 06, 2019 at 10:44:23PM +0000, Nadav Amit wrote:
>>>> + * Usage example:
>>>> + *
>>>> + *   # Start with the following functions (with identical prototypes):
>>>> + *   int func_a(int arg1, int arg2);
>>>> + *   int func_b(int arg1, int arg2);
>>>> + *
>>>> + *   # Define a 'my_key' reference, associated with func_a() by default
>>>> + *   DEFINE_STATIC_CALL(my_key, func_a);
>>>> + *
>>>> + *   # Call func_a()
>>>> + *   static_call(my_key, arg1, arg2);
>>>> + *
>>>> + *   # Update 'my_key' to point to func_b()
>>>> + *   static_call_update(my_key, func_b);
>>>> + *
>>>> + *   # Call func_b()
>>>> + *   static_call(my_key, arg1, arg2);
>>> 
>>> I think that this calling interface is not very intuitive.
>> 
>> Yeah, it is somewhat unfortunate..
>> 
> 
> Another thing I brought up at the time is that it would be useful to
> have the ability to 'reset' a static call to its default target. E.g.,
> for crypto modules that implement an accelerated version of a library
> interface, removing the module should revert those call sites back to
> the original target, without putting a disproportionate burden on the
> module itself to implement the logic to support this.

I was thinking this could be a layer on top.  We could have a way to register a static call with the module core so that, when a GPL module with an appropriate symbol is loaded, the static call gets replaced.

KVM could use this too.  Or we could just require KVM to be built in some day.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-07  8:37     ` Peter Zijlstra
@ 2019-06-07 16:35       ` Nadav Amit
  2019-06-07 17:41         ` Peter Zijlstra
  2019-06-10 17:19       ` Josh Poimboeuf
  1 sibling, 1 reply; 87+ messages in thread
From: Nadav Amit @ 2019-06-07 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

> On Jun 7, 2019, at 1:37 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Jun 06, 2019 at 10:24:17PM +0000, Nadav Amit wrote:
> 
>>> +static void static_call_del_module(struct module *mod)
>>> +{
>>> +	struct static_call_site *start = mod->static_call_sites;
>>> +	struct static_call_site *stop = mod->static_call_sites +
>>> +					mod->num_static_call_sites;
>>> +	struct static_call_site *site;
>>> +	struct static_call_key *key, *prev_key = NULL;
>>> +	struct static_call_mod *site_mod;
>>> +
>>> +	for (site = start; site < stop; site++) {
>>> +		key = static_call_key(site);
>>> +		if (key == prev_key)
>>> +			continue;
>>> +		prev_key = key;
>>> +
>>> +		list_for_each_entry(site_mod, &key->site_mods, list) {
>>> +			if (site_mod->mod == mod) {
>>> +				list_del(&site_mod->list);
>>> +				kfree(site_mod);
>>> +				break;
>>> +			}
>>> +		}
>>> +	}
>> 
>> I think that for safety, when a module is removed, all the static-calls
>> should be traversed to check that none of them calls any function in the
>> removed module. If that happens, perhaps it should be poisoned.
> 
> We don't do that for normal indirect calls either.. I suppose we could
> here, but meh.
> 
>>> +}
>>> +
>>> +static int static_call_module_notify(struct notifier_block *nb,
>>> +				     unsigned long val, void *data)
>>> +{
>>> +	struct module *mod = data;
>>> +	int ret = 0;
>>> +
>>> +	cpus_read_lock();
>>> +	static_call_lock();
>>> +
>>> +	switch (val) {
>>> +	case MODULE_STATE_COMING:
>>> +		module_disable_ro(mod);
>>> +		ret = static_call_add_module(mod);
>>> +		module_enable_ro(mod, false);
>> 
>> Doesn’t it cause some pages to be W+X ? Can it be avoided?
> 
> I don't know why it does this, jump_labels doesn't seem to need this,
> and I'm not seeing what static_call needs differently.
> 
>>> +		if (ret) {
>>> +			WARN(1, "Failed to allocate memory for static calls");
>>> +			static_call_del_module(mod);
>> 
>> If static_call_add_module() succeeded in changing some of the calls, but not
>> all, I don’t think that static_call_del_module() will correctly undo
>> static_call_add_module(). The code transformations, I think, will remain.
> 
> Hurm, jump_labels has the same problem.
> 
> I wonder why kernel/module.c:prepare_coming_module() doesn't propagate
> the error from the notifier call. If it were to do that, I think we'll
> abort the module load and any modifications get lost anyway.

This might be a security problem, since it can leave indirect branches,
which are susceptible to Spectre v2, in the code.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-06-07  8:49       ` Ard Biesheuvel
  2019-06-07 16:33         ` Andy Lutomirski
@ 2019-06-07 16:58         ` Nadav Amit
  1 sibling, 0 replies; 87+ messages in thread
From: Nadav Amit @ 2019-06-07 16:58 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Zijlstra, the arch/x86 maintainers, LKML, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

> On Jun 7, 2019, at 1:49 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> 
> On Fri, 7 Jun 2019 at 10:29, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Jun 06, 2019 at 10:44:23PM +0000, Nadav Amit wrote:
>>>> + * Usage example:
>>>> + *
>>>> + *   # Start with the following functions (with identical prototypes):
>>>> + *   int func_a(int arg1, int arg2);
>>>> + *   int func_b(int arg1, int arg2);
>>>> + *
>>>> + *   # Define a 'my_key' reference, associated with func_a() by default
>>>> + *   DEFINE_STATIC_CALL(my_key, func_a);
>>>> + *
>>>> + *   # Call func_a()
>>>> + *   static_call(my_key, arg1, arg2);
>>>> + *
>>>> + *   # Update 'my_key' to point to func_b()
>>>> + *   static_call_update(my_key, func_b);
>>>> + *
>>>> + *   # Call func_b()
>>>> + *   static_call(my_key, arg1, arg2);
>>> 
>>> I think that this calling interface is not very intuitive.
>> 
>> Yeah, it is somewhat unfortunate..
> 
> Another thing I brought up at the time is that it would be useful to
> have the ability to 'reset' a static call to its default target. E.g.,
> for crypto modules that implement an accelerated version of a library
> interface, removing the module should revert those call sites back to
> the original target, without putting a disproportionate burden on the
> module itself to implement the logic to support this.
> 
> 
>>> I understand that
>>> the macros/objtool cannot allow the calling interface to be completely
>>> transparent (as compiler plugin could). But, can the macros be used to
>>> paste the key with the “static_call”? I think that having something like:
>>> 
>>>  static_call__func(arg1, arg2)
>>> 
>>> Is more readable than
>>> 
>>>  static_call(func, arg1, arg2)
>> 
>> Doesn't really make it much better for me; I think I'd prefer to switch
>> to the GCC plugin scheme over this.  ISTR there being some propotypes
>> there, but I couldn't quickly locate them.
> 
> I implemented the GCC plugin here
> 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Fardb%2Flinux.git%2Flog%2F%3Fh%3Dstatic-calls&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd31c4713640c44a651bf08d6eb250faa%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636954941771964758&amp;sdata=h7RtT33E9FMapLZbAu9aTfjREP5kXrM0o2QQ1WpbDCM%3D&amp;reserved=0
> 
> but IIRC, all it does is annotate call sites exactly how objtool does it.

I did not see your version before I made mine for a similar (but slightly
different) purpose:

https://lore.kernel.org/lkml/20181231072112.21051-4-namit@vmware.com/

My version, I think, is more generic (I don’t think yours consider calls
that have a return value). Anyhow, I am sure you know more about GCC plugins
than I do.

I do have a version that can take annotations to say which call should be
static and to get the symbol it uses.

I also think that this implementation would disallow keys that reside within
structs. This would mean that paravirt, for instance, would need to go
through many changes to use this infrastructure.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07 15:47   ` Masami Hiramatsu
@ 2019-06-07 17:34     ` Peter Zijlstra
  2019-06-07 17:48       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-07 17:34 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jason Baron, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Nadav Amit,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:

> > This fits almost all text_poke_bp() users, except
> > arch_unoptimize_kprobe() which restores random text, and for that site
> > we have to build an explicit emulate instruction.
> 
> Hm, actually it doesn't restores randome text, since the first byte
> must always be int3. As the function name means, it just unoptimizes
> (jump based optprobe -> int3 based kprobe).
> Anyway, that is not an issue. With this patch, optprobe must still work.

I thought it basically restored 5 bytes of original text (with no
guarantee it is a single instruction, or even a complete instruction),
with the first byte replaced with INT3.

> > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> >  	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> >  		return 0;
> >  
> > -	/* set up the specified breakpoint handler */
> > -	regs->ip = (unsigned long) bp_int3_handler;
> > +	opcode = *(struct opcode *)bp_int3_opcode;
> > +
> > +	switch (opcode.insn) {
> > +	case 0xE8: /* CALL */
> > +		int3_emulate_call(regs, ip + opcode.rel);
> > +		break;
> > +
> > +	case 0xE9: /* JMP */
> > +		int3_emulate_jmp(regs, ip + opcode.rel);
> > +		break;
> > +
> > +	default: /* assume NOP */
> 
> Shouldn't we check whether it is actually NOP here?

I was/am lazy and didn't want to deal with:

arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5

But maybe we should check for all the various NOP5 variants and BUG() on
anything unexpected.

> > +		int3_emulate_jmp(regs, ip);
> > +		break;
> > +	}
> 
> BTW, if we fix the length of patching always 5 bytes and allow user
> to apply it only from/to jump/call/nop, we may be better to remove
> "len" and rename it, something like "text_poke_branch" etc.

I considered it; but was thinking we could still allow patching other
instructions, we'd just have to extend the emulation in
poke_int3_handler().

Then again, if/when we want to do that, we can also restore the @len
argument again.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-07 16:35       ` Nadav Amit
@ 2019-06-07 17:41         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-07 17:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Fri, Jun 07, 2019 at 04:35:42PM +0000, Nadav Amit wrote:
> > On Jun 7, 2019, at 1:37 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Jun 06, 2019 at 10:24:17PM +0000, Nadav Amit wrote:

> >>> +		if (ret) {
> >>> +			WARN(1, "Failed to allocate memory for static calls");
> >>> +			static_call_del_module(mod);
> >> 
> >> If static_call_add_module() succeeded in changing some of the calls, but not
> >> all, I don’t think that static_call_del_module() will correctly undo
> >> static_call_add_module(). The code transformations, I think, will remain.
> > 
> > Hurm, jump_labels has the same problem.
> > 
> > I wonder why kernel/module.c:prepare_coming_module() doesn't propagate
> > the error from the notifier call. If it were to do that, I think we'll
> > abort the module load and any modifications get lost anyway.
> 
> This might be a security problem, since it can leave indirect branches,
> which are susceptible to Spectre v2, in the code.

It's a correctness problem too; for both jump_label and static_call,
since if we don't patch the call site, we also don't patch the
trampoline and who knows what random code it ends up running.

I'll go stare at the module code once my migrane goes again :/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07 17:34     ` Peter Zijlstra
@ 2019-06-07 17:48       ` Linus Torvalds
  2019-06-11 10:44         ` Peter Zijlstra
  2019-06-07 18:10       ` Andy Lutomirski
  2019-06-12 17:09       ` Peter Zijlstra
  2 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2019-06-07 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Fri, Jun 7, 2019 at 10:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> I was/am lazy and didn't want to deal with:
>
> arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
> arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
> arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
> arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5

Ugh. Maybe we could just pick one atomic sequence, and not have the
magic atomic nops be dynamic.

It's essentially what STATIC_KEY_INIT_NOP #define seems to do anyway.

NOP5_ATOMIC is already special, and not used for the normal nop
rewriting, only for kprobe/jump_label/ftrace.

So I suspect we could just replace all cases of

   ideal_nops[NOP_ATOMIC5]

with

   STATIC_KEY_INIT_NOP

and get rid of the whole "let's optimize the atomic 5-byte nop" entirely.

Hmm?

By definition, NOP_ATOMIC5 is just a single nop anyway, it's not used
for the potentially more complex alternative rewriting cases.

                Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07 17:34     ` Peter Zijlstra
  2019-06-07 17:48       ` Linus Torvalds
@ 2019-06-07 18:10       ` Andy Lutomirski
  2019-06-07 20:22         ` hpa
  2019-06-11  8:03         ` Peter Zijlstra
  2019-06-12 17:09       ` Peter Zijlstra
  2 siblings, 2 replies; 87+ messages in thread
From: Andy Lutomirski @ 2019-06-07 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira



> On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
> 
>>> This fits almost all text_poke_bp() users, except
>>> arch_unoptimize_kprobe() which restores random text, and for that site
>>> we have to build an explicit emulate instruction.
>> 
>> Hm, actually it doesn't restores randome text, since the first byte
>> must always be int3. As the function name means, it just unoptimizes
>> (jump based optprobe -> int3 based kprobe).
>> Anyway, that is not an issue. With this patch, optprobe must still work.
> 
> I thought it basically restored 5 bytes of original text (with no
> guarantee it is a single instruction, or even a complete instruction),
> with the first byte replaced with INT3.
> 

I am surely missing some kprobe context, but is it really safe to use this mechanism to replace more than one instruction?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 05/15] x86_32: Provide consistent pt_regs
  2019-06-05 13:07 ` [PATCH 05/15] x86_32: Provide consistent pt_regs Peter Zijlstra
  2019-06-07 13:13   ` Masami Hiramatsu
@ 2019-06-07 19:32   ` Josh Poimboeuf
  2019-06-11  8:14     ` Peter Zijlstra
  1 sibling, 1 reply; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-07 19:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:07:58PM +0200, Peter Zijlstra wrote:
> Currently pt_regs on x86_32 has an oddity in that kernel regs
> (!user_mode(regs)) are short two entries (esp/ss). This means that any
> code trying to use them (typically: regs->sp) needs to jump through
> some unfortunate hoops.
> 
> Change the entry code to fix this up and create a full pt_regs frame.
> 
> This then simplifies various trampolines in ftrace and kprobes, the
> stack unwinder, ptrace, kdump and kgdb.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> ---
>  arch/x86/entry/entry_32.S         |  105 ++++++++++++++++++++++++++++++++++----
>  arch/x86/include/asm/kexec.h      |   17 ------
>  arch/x86/include/asm/ptrace.h     |   17 ------
>  arch/x86/include/asm/stacktrace.h |    2 
>  arch/x86/kernel/crash.c           |    8 --
>  arch/x86/kernel/ftrace_32.S       |   77 +++++++++++++++------------
>  arch/x86/kernel/kgdb.c            |    8 --
>  arch/x86/kernel/kprobes/common.h  |    4 -
>  arch/x86/kernel/kprobes/core.c    |   29 ++++------
>  arch/x86/kernel/kprobes/opt.c     |   20 ++++---
>  arch/x86/kernel/process_32.c      |   16 +----
>  arch/x86/kernel/ptrace.c          |   29 ----------
>  arch/x86/kernel/time.c            |    3 -
>  arch/x86/kernel/unwind_frame.c    |   32 +----------
>  arch/x86/kernel/unwind_orc.c      |    2 
>  15 files changed, 178 insertions(+), 191 deletions(-)

I recall writing some of this code (some of the kernel_stack_pointer
removal stuff) so please give me a shout-out ;-)

Otherwise:

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07 18:10       ` Andy Lutomirski
@ 2019-06-07 20:22         ` hpa
  2019-06-11  8:03         ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: hpa @ 2019-06-07 20:22 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, Nadav Amit,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On June 7, 2019 11:10:19 AM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <peterz@infradead.org>
>wrote:
>> 
>> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
>> 
>>>> This fits almost all text_poke_bp() users, except
>>>> arch_unoptimize_kprobe() which restores random text, and for that
>site
>>>> we have to build an explicit emulate instruction.
>>> 
>>> Hm, actually it doesn't restores randome text, since the first byte
>>> must always be int3. As the function name means, it just unoptimizes
>>> (jump based optprobe -> int3 based kprobe).
>>> Anyway, that is not an issue. With this patch, optprobe must still
>work.
>> 
>> I thought it basically restored 5 bytes of original text (with no
>> guarantee it is a single instruction, or even a complete
>instruction),
>> with the first byte replaced with INT3.
>> 
>
>I am surely missing some kprobe context, but is it really safe to use
>this mechanism to replace more than one instruction?

I don't see how it could be, except *perhaps* inside an NMI have, because you could have a preempted or interrupted now having an in-memory IP pointing inside the middle of the region you are intending to patch.


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 07/15] x86: Add int3_emulate_call() selftest
  2019-06-05 13:08 ` [PATCH 07/15] x86: Add int3_emulate_call() selftest Peter Zijlstra
@ 2019-06-10 16:52   ` Josh Poimboeuf
  2019-06-10 16:57     ` Andy Lutomirski
  0 siblings, 1 reply; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-10 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:08:00PM +0200, Peter Zijlstra wrote:
> Given that the entry_*.S changes for this functionality are somewhat
> tricky, make sure the paths are tested every boot, instead of on the
> rare occasion when we trip an INT3 while rewriting text.
> 
> Requested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-05 13:08 ` [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions Peter Zijlstra
  2019-06-07  5:41   ` Nadav Amit
  2019-06-07 15:47   ` Masami Hiramatsu
@ 2019-06-10 16:57   ` Josh Poimboeuf
  2019-06-11 15:14   ` Steven Rostedt
  3 siblings, 0 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-10 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:08:01PM +0200, Peter Zijlstra wrote:
> In preparation for static_call support, teach text_poke_bp() to
> emulate instructions, including CALL.
> 
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
> 
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
> 
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.
> 
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Nadav Amit <namit@vmware.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 07/15] x86: Add int3_emulate_call() selftest
  2019-06-10 16:52   ` Josh Poimboeuf
@ 2019-06-10 16:57     ` Andy Lutomirski
  2019-06-11  8:17       ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Andy Lutomirski @ 2019-06-10 16:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, X86 ML, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 10, 2019 at 9:53 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> On Wed, Jun 05, 2019 at 03:08:00PM +0200, Peter Zijlstra wrote:
> > Given that the entry_*.S changes for this functionality are somewhat
> > tricky, make sure the paths are tested every boot, instead of on the
> > rare occasion when we trip an INT3 while rewriting text.
> >
> > Requested-by: Andy Lutomirski <luto@kernel.org>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
>

Looks good to me, too, except that I seriously hate die notifiers that
return NOTIFY_STOP, and I eventually want to remove support for them.
This can wait, though.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-07  8:37     ` Peter Zijlstra
  2019-06-07 16:35       ` Nadav Amit
@ 2019-06-10 17:19       ` Josh Poimboeuf
  2019-06-10 18:33         ` Nadav Amit
  2019-10-01 12:00         ` Peter Zijlstra
  1 sibling, 2 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-10 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 10:37:56AM +0200, Peter Zijlstra wrote:
> > > +}
> > > +
> > > +static int static_call_module_notify(struct notifier_block *nb,
> > > +				     unsigned long val, void *data)
> > > +{
> > > +	struct module *mod = data;
> > > +	int ret = 0;
> > > +
> > > +	cpus_read_lock();
> > > +	static_call_lock();
> > > +
> > > +	switch (val) {
> > > +	case MODULE_STATE_COMING:
> > > +		module_disable_ro(mod);
> > > +		ret = static_call_add_module(mod);
> > > +		module_enable_ro(mod, false);
> > 
> > Doesn’t it cause some pages to be W+X ?

How so?

>> Can it be avoided?
> 
> I don't know why it does this, jump_labels doesn't seem to need this,
> and I'm not seeing what static_call needs differently.

I forgot why I did this, but it's probably for the case where there's a
static call site in module init code.  It deserves a comment.

Theoretically, jump labels need this to.

BTW, there's a change coming that will require the text_mutex before
calling module_{disable,enable}_ro().

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/15] static_call: Simple self-test module
  2019-06-05 13:08 ` [PATCH 14/15] static_call: Simple self-test module Peter Zijlstra
@ 2019-06-10 17:24   ` Josh Poimboeuf
  2019-06-11  8:29     ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-10 17:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:08:07PM +0200, Peter Zijlstra wrote:
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  lib/Kconfig.debug      |    8 ++++++++
>  lib/Makefile           |    1 +
>  lib/test_static_call.c |   41 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 50 insertions(+)
> 
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1955,6 +1955,14 @@ config TEST_STATIC_KEYS
>  
>  	  If unsure, say N.
>  
> +config TEST_STATIC_CALL
> +	tristate "Test static call"
> +	depends on m
> +	help
> +	  Test the static call interfaces.
> +
> +	  If unsure, say N.
> +

Any reason why we wouldn't just make this a built-in boot time test?

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-10 17:19       ` Josh Poimboeuf
@ 2019-06-10 18:33         ` Nadav Amit
  2019-06-10 18:42           ` Josh Poimboeuf
  2019-10-01 12:00         ` Peter Zijlstra
  1 sibling, 1 reply; 87+ messages in thread
From: Nadav Amit @ 2019-06-10 18:33 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

> On Jun 10, 2019, at 10:19 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Fri, Jun 07, 2019 at 10:37:56AM +0200, Peter Zijlstra wrote:
>>>> +}
>>>> +
>>>> +static int static_call_module_notify(struct notifier_block *nb,
>>>> +				     unsigned long val, void *data)
>>>> +{
>>>> +	struct module *mod = data;
>>>> +	int ret = 0;
>>>> +
>>>> +	cpus_read_lock();
>>>> +	static_call_lock();
>>>> +
>>>> +	switch (val) {
>>>> +	case MODULE_STATE_COMING:
>>>> +		module_disable_ro(mod);
>>>> +		ret = static_call_add_module(mod);
>>>> +		module_enable_ro(mod, false);
>>> 
>>> Doesn’t it cause some pages to be W+X ?
> 
> How so?
> 
>>> Can it be avoided?
>> 
>> I don't know why it does this, jump_labels doesn't seem to need this,
>> and I'm not seeing what static_call needs differently.
> 
> I forgot why I did this, but it's probably for the case where there's a
> static call site in module init code.  It deserves a comment.
> 
> Theoretically, jump labels need this to.
> 
> BTW, there's a change coming that will require the text_mutex before
> calling module_{disable,enable}_ro().

I think that eventually, the most secure flow is for the module executable
to be write-protected immediately after the module signature is checked and
then use text_poke() to change the code without removing the
write-protection in such manner.

Ideally, these pieces of code (module signature check and static-key/call
mechanisms) would somehow be isolated.

I wonder whether static-calls in init-code cannot just be avoided. They
would most likely introduce more overhead in patching than they would save
in execution time.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64
  2019-06-05 13:08 ` [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64 Peter Zijlstra
  2019-06-07  5:50   ` Nadav Amit
@ 2019-06-10 18:33   ` Josh Poimboeuf
  2019-06-10 18:45     ` Nadav Amit
  2019-10-01 14:43     ` Peter Zijlstra
  1 sibling, 2 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-10 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 05, 2019 at 03:08:06PM +0200, Peter Zijlstra wrote:
> --- a/arch/x86/include/asm/static_call.h
> +++ b/arch/x86/include/asm/static_call.h
> @@ -2,6 +2,20 @@
>  #ifndef _ASM_STATIC_CALL_H
>  #define _ASM_STATIC_CALL_H
>  
> +#include <asm/asm-offsets.h>
> +
> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> +
> +/*
> + * This trampoline is only used during boot / module init, so it's safe to use
> + * the indirect branch without a retpoline.
> + */
> +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
> +	ANNOTATE_RETPOLINE_SAFE						\
> +	"jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
> +
> +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */

I wonder if we can simplify this (and drop the indirect branch) by
getting rid of the above cruft, and instead just use the out-of-line
trampoline as the default for inline as well.

Then the inline case could fall back to the out-of-line implementation
(by patching the trampoline's jmp dest) before static_call_initialized
is set.

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-10 18:33         ` Nadav Amit
@ 2019-06-10 18:42           ` Josh Poimboeuf
  0 siblings, 0 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-10 18:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 10, 2019 at 06:33:26PM +0000, Nadav Amit wrote:
> > On Jun 10, 2019, at 10:19 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Fri, Jun 07, 2019 at 10:37:56AM +0200, Peter Zijlstra wrote:
> >>>> +}
> >>>> +
> >>>> +static int static_call_module_notify(struct notifier_block *nb,
> >>>> +				     unsigned long val, void *data)
> >>>> +{
> >>>> +	struct module *mod = data;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	cpus_read_lock();
> >>>> +	static_call_lock();
> >>>> +
> >>>> +	switch (val) {
> >>>> +	case MODULE_STATE_COMING:
> >>>> +		module_disable_ro(mod);
> >>>> +		ret = static_call_add_module(mod);
> >>>> +		module_enable_ro(mod, false);
> >>> 
> >>> Doesn’t it cause some pages to be W+X ?
> > 
> > How so?
> > 
> >>> Can it be avoided?
> >> 
> >> I don't know why it does this, jump_labels doesn't seem to need this,
> >> and I'm not seeing what static_call needs differently.
> > 
> > I forgot why I did this, but it's probably for the case where there's a
> > static call site in module init code.  It deserves a comment.
> > 
> > Theoretically, jump labels need this to.
> > 
> > BTW, there's a change coming that will require the text_mutex before
> > calling module_{disable,enable}_ro().
> 
> I think that eventually, the most secure flow is for the module executable
> to be write-protected immediately after the module signature is checked and
> then use text_poke() to change the code without removing the
> write-protection in such manner.
> 
> Ideally, these pieces of code (module signature check and static-key/call
> mechanisms) would somehow be isolated.
> 
> I wonder whether static-calls in init-code cannot just be avoided. They
> would most likely introduce more overhead in patching than they would save
> in execution time.

It's a valid question.  Are any tracepoints called from module init?  Or
-- thinking ahead -- are there any pv calls from module init?  That
might be plausible.

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64
  2019-06-10 18:33   ` Josh Poimboeuf
@ 2019-06-10 18:45     ` Nadav Amit
  2019-06-10 18:55       ` Josh Poimboeuf
  2019-10-01 14:43     ` Peter Zijlstra
  1 sibling, 1 reply; 87+ messages in thread
From: Nadav Amit @ 2019-06-10 18:45 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

> On Jun 10, 2019, at 11:33 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Wed, Jun 05, 2019 at 03:08:06PM +0200, Peter Zijlstra wrote:
>> --- a/arch/x86/include/asm/static_call.h
>> +++ b/arch/x86/include/asm/static_call.h
>> @@ -2,6 +2,20 @@
>> #ifndef _ASM_STATIC_CALL_H
>> #define _ASM_STATIC_CALL_H
>> 
>> +#include <asm/asm-offsets.h>
>> +
>> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
>> +
>> +/*
>> + * This trampoline is only used during boot / module init, so it's safe to use
>> + * the indirect branch without a retpoline.
>> + */
>> +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
>> +	ANNOTATE_RETPOLINE_SAFE						\
>> +	"jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
>> +
>> +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> 
> I wonder if we can simplify this (and drop the indirect branch) by
> getting rid of the above cruft, and instead just use the out-of-line
> trampoline as the default for inline as well.
> 
> Then the inline case could fall back to the out-of-line implementation
> (by patching the trampoline's jmp dest) before static_call_initialized
> is set.

I must be missing some context - but what guarantees that this indirect
branch would be exactly 5 bytes long? Isn’t there an assumption that this
would be the case? Shouldn’t there be some handling of the padding?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64
  2019-06-10 18:45     ` Nadav Amit
@ 2019-06-10 18:55       ` Josh Poimboeuf
  2019-06-10 19:20         ` Nadav Amit
  0 siblings, 1 reply; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-10 18:55 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 10, 2019 at 06:45:52PM +0000, Nadav Amit wrote:
> > On Jun 10, 2019, at 11:33 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Wed, Jun 05, 2019 at 03:08:06PM +0200, Peter Zijlstra wrote:
> >> --- a/arch/x86/include/asm/static_call.h
> >> +++ b/arch/x86/include/asm/static_call.h
> >> @@ -2,6 +2,20 @@
> >> #ifndef _ASM_STATIC_CALL_H
> >> #define _ASM_STATIC_CALL_H
> >> 
> >> +#include <asm/asm-offsets.h>
> >> +
> >> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> >> +
> >> +/*
> >> + * This trampoline is only used during boot / module init, so it's safe to use
> >> + * the indirect branch without a retpoline.
> >> + */
> >> +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
> >> +	ANNOTATE_RETPOLINE_SAFE						\
> >> +	"jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
> >> +
> >> +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > 
> > I wonder if we can simplify this (and drop the indirect branch) by
> > getting rid of the above cruft, and instead just use the out-of-line
> > trampoline as the default for inline as well.
> > 
> > Then the inline case could fall back to the out-of-line implementation
> > (by patching the trampoline's jmp dest) before static_call_initialized
> > is set.
> 
> I must be missing some context - but what guarantees that this indirect
> branch would be exactly 5 bytes long? Isn’t there an assumption that this
> would be the case? Shouldn’t there be some handling of the padding?

We don't patch the indirect branch.  It's just part of a temporary
trampoline which is called by the call site, and which does "jmp
key->func" during boot until static call initialization is done.

(Though I'm suggesting removing that.)

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64
  2019-06-10 18:55       ` Josh Poimboeuf
@ 2019-06-10 19:20         ` Nadav Amit
  0 siblings, 0 replies; 87+ messages in thread
From: Nadav Amit @ 2019-06-10 19:20 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

> On Jun 10, 2019, at 11:55 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Mon, Jun 10, 2019 at 06:45:52PM +0000, Nadav Amit wrote:
>>> On Jun 10, 2019, at 11:33 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> 
>>> On Wed, Jun 05, 2019 at 03:08:06PM +0200, Peter Zijlstra wrote:
>>>> --- a/arch/x86/include/asm/static_call.h
>>>> +++ b/arch/x86/include/asm/static_call.h
>>>> @@ -2,6 +2,20 @@
>>>> #ifndef _ASM_STATIC_CALL_H
>>>> #define _ASM_STATIC_CALL_H
>>>> 
>>>> +#include <asm/asm-offsets.h>
>>>> +
>>>> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
>>>> +
>>>> +/*
>>>> + * This trampoline is only used during boot / module init, so it's safe to use
>>>> + * the indirect branch without a retpoline.
>>>> + */
>>>> +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
>>>> +	ANNOTATE_RETPOLINE_SAFE						\
>>>> +	"jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
>>>> +
>>>> +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
>>> 
>>> I wonder if we can simplify this (and drop the indirect branch) by
>>> getting rid of the above cruft, and instead just use the out-of-line
>>> trampoline as the default for inline as well.
>>> 
>>> Then the inline case could fall back to the out-of-line implementation
>>> (by patching the trampoline's jmp dest) before static_call_initialized
>>> is set.
>> 
>> I must be missing some context - but what guarantees that this indirect
>> branch would be exactly 5 bytes long? Isn’t there an assumption that this
>> would be the case? Shouldn’t there be some handling of the padding?
> 
> We don't patch the indirect branch.  It's just part of a temporary
> trampoline which is called by the call site, and which does "jmp
> key->func" during boot until static call initialization is done.
> 
> (Though I'm suggesting removing that.)

Oh... I see.

On another note - even if this branch is only executed during module
initialization, it does seem safer to use a retpoline instead of an indirect
branch (consider a branch that is run many times on one hardware thread on
SMT, when STIBP is not set, and attacker code is run on the second thread).

I guess you don’t simply call the retpoline code so since you don’t have a
clobbered register to hold the target. But it still seems possible to use
a retpoline. Anyhow, it might be a moot discussion if this code is removed.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07 18:10       ` Andy Lutomirski
  2019-06-07 20:22         ` hpa
@ 2019-06-11  8:03         ` Peter Zijlstra
  2019-06-11 12:08           ` Peter Zijlstra
                             ` (2 more replies)
  1 sibling, 3 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11  8:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
> > 
> >>> This fits almost all text_poke_bp() users, except
> >>> arch_unoptimize_kprobe() which restores random text, and for that site
> >>> we have to build an explicit emulate instruction.
> >> 
> >> Hm, actually it doesn't restores randome text, since the first byte
> >> must always be int3. As the function name means, it just unoptimizes
> >> (jump based optprobe -> int3 based kprobe).
> >> Anyway, that is not an issue. With this patch, optprobe must still work.
> > 
> > I thought it basically restored 5 bytes of original text (with no
> > guarantee it is a single instruction, or even a complete instruction),
> > with the first byte replaced with INT3.
> > 
> 
> I am surely missing some kprobe context, but is it really safe to use
> this mechanism to replace more than one instruction?

I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
wrong.

So what happens is that arch_prepare_optimized_kprobe() <-
copy_optimized_instructions() copies however much of the instruction
stream is required such that we can overwrite the instruction at @addr
with a 5 byte jump.

arch_optimize_kprobe() then does the text_poke_bp() that replaces the
instruction @addr with int3, copies the rel jump address and overwrites
the int3 with jmp.

And I'm thinking the problem is with something like:

@addr: nop nop nop nop nop

We copy out the nops into the trampoline, overwrite the first nop with
an INT3, overwrite the remaining nops with the rel addr, but oops,
another CPU can still be executing one of those NOPs, right?

I'm thinking we could fix this by first writing INT3 into all relevant
instructions, which is going to be messy, given the current code base.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 03/15] x86/kprobes: Fix frame pointer annotations
  2019-06-07 13:36     ` Josh Poimboeuf
  2019-06-07 15:21       ` Masami Hiramatsu
@ 2019-06-11  8:12       ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11  8:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 09:36:02AM -0400, Josh Poimboeuf wrote:
> On Fri, Jun 07, 2019 at 10:02:10PM +0900, Masami Hiramatsu wrote:
> > On Wed, 05 Jun 2019 15:07:56 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > The kprobe trampolines have a FRAME_POINTER annotation that makes no
> > > sense. It marks the frame in the middle of pt_regs, at the place of
> > > saving BP.
> > 
> > commit ee213fc72fd67 introduced this code, and this is for unwinder which
> > uses frame pointer. I think current code stores the address of previous
> > (original context's) frame pointer into %rbp. So with that, if unwinder
> > tries to decode frame pointer, it can get the original %rbp value,
> > instead of &pt_regs from current %rbp.

The way I read that code is that we'll put the value of SP into BP at
the point where we've done 'PUSH BP', which is right in the middle of
that PUSH sequence. So while it works for a FP based unwinder, it
doesn't 'properly' identify the current frame.

> > > Change it to mark the pt_regs frame as per the ENCODE_FRAME_POINTER
> > > from the respective entry_*.S.
> > > 
> > 
> > With this change, I think stack unwinder can not get the original %rbp
> > value. Peter, could you check the above commit?
> 
> The unwinder knows how to decode the encoded frame pointer.  So it can
> find regs by decoding the new rbp value, and it also knows that regs->bp
> is the original rbp value.

Right, as Josh says the unwinder has a special case for this and it
knows these 'odd' BP values (either MSB or LSB set) indicate a pt_regs
set.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 05/15] x86_32: Provide consistent pt_regs
  2019-06-07 19:32   ` Josh Poimboeuf
@ 2019-06-11  8:14     ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11  8:14 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 03:32:15PM -0400, Josh Poimboeuf wrote:
> On Wed, Jun 05, 2019 at 03:07:58PM +0200, Peter Zijlstra wrote:
> > Currently pt_regs on x86_32 has an oddity in that kernel regs
> > (!user_mode(regs)) are short two entries (esp/ss). This means that any
> > code trying to use them (typically: regs->sp) needs to jump through
> > some unfortunate hoops.
> > 
> > Change the entry code to fix this up and create a full pt_regs frame.
> > 
> > This then simplifies various trampolines in ftrace and kprobes, the
> > stack unwinder, ptrace, kdump and kgdb.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >
> > ---
> >  arch/x86/entry/entry_32.S         |  105 ++++++++++++++++++++++++++++++++++----
> >  arch/x86/include/asm/kexec.h      |   17 ------
> >  arch/x86/include/asm/ptrace.h     |   17 ------
> >  arch/x86/include/asm/stacktrace.h |    2 
> >  arch/x86/kernel/crash.c           |    8 --
> >  arch/x86/kernel/ftrace_32.S       |   77 +++++++++++++++------------
> >  arch/x86/kernel/kgdb.c            |    8 --
> >  arch/x86/kernel/kprobes/common.h  |    4 -
> >  arch/x86/kernel/kprobes/core.c    |   29 ++++------
> >  arch/x86/kernel/kprobes/opt.c     |   20 ++++---
> >  arch/x86/kernel/process_32.c      |   16 +----
> >  arch/x86/kernel/ptrace.c          |   29 ----------
> >  arch/x86/kernel/time.c            |    3 -
> >  arch/x86/kernel/unwind_frame.c    |   32 +----------
> >  arch/x86/kernel/unwind_orc.c      |    2 
> >  15 files changed, 178 insertions(+), 191 deletions(-)
> 
> I recall writing some of this code (some of the kernel_stack_pointer
> removal stuff) so please give me a shout-out ;-)

Absolutely, sorry for not doing that. I've added:

 "Much thanks to Josh for help with the cleanups!"

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 07/15] x86: Add int3_emulate_call() selftest
  2019-06-10 16:57     ` Andy Lutomirski
@ 2019-06-11  8:17       ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11  8:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, X86 ML, LKML, Ard Biesheuvel, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Linus Torvalds, Masami Hiramatsu,
	Jason Baron, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Nadav Amit,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Mon, Jun 10, 2019 at 09:57:58AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 10, 2019 at 9:53 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > On Wed, Jun 05, 2019 at 03:08:00PM +0200, Peter Zijlstra wrote:
> > > Given that the entry_*.S changes for this functionality are somewhat
> > > tricky, make sure the paths are tested every boot, instead of on the
> > > rare occasion when we trip an INT3 while rewriting text.
> > >
> > > Requested-by: Andy Lutomirski <luto@kernel.org>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >
> > Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
> >
> 
> Looks good to me, too,

I'll translate that into an Acked-by from you, if you don't mind :-)

> except that I seriously hate die notifiers that
> return NOTIFY_STOP, and I eventually want to remove support for them.
> This can wait, though.

Yes, I share your hatred for notifiers in general. But since they are
still here and I do think it is a waste to have an unconditional
function call in do_int3() just for this, I figured I'd use them.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/15] static_call: Simple self-test module
  2019-06-10 17:24   ` Josh Poimboeuf
@ 2019-06-11  8:29     ` Peter Zijlstra
  2019-06-11 13:02       ` Josh Poimboeuf
  0 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11  8:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 10, 2019 at 12:24:28PM -0500, Josh Poimboeuf wrote:
> On Wed, Jun 05, 2019 at 03:08:07PM +0200, Peter Zijlstra wrote:
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  lib/Kconfig.debug      |    8 ++++++++
> >  lib/Makefile           |    1 +
> >  lib/test_static_call.c |   41 +++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 50 insertions(+)
> > 
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -1955,6 +1955,14 @@ config TEST_STATIC_KEYS
> >  
> >  	  If unsure, say N.
> >  
> > +config TEST_STATIC_CALL
> > +	tristate "Test static call"
> > +	depends on m
> > +	help
> > +	  Test the static call interfaces.
> > +
> > +	  If unsure, say N.
> > +
> 
> Any reason why we wouldn't just make this a built-in boot time test?

None what so ever; but I did copy paste from the static_key stuff and
that has it for some rasin.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07 17:48       ` Linus Torvalds
@ 2019-06-11 10:44         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11 10:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, the arch/x86 maintainers,
	Linux List Kernel Mailing, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 10:48:06AM -0700, Linus Torvalds wrote:
> On Fri, Jun 7, 2019 at 10:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > I was/am lazy and didn't want to deal with:
> >
> > arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
> > arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
> > arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
> > arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5
> 
> Ugh. Maybe we could just pick one atomic sequence, and not have the
> magic atomic nops be dynamic.

That'd be nice..

> It's essentially what STATIC_KEY_INIT_NOP #define seems to do anyway.

Well, that picks something, we'll overwrite it with the ideal nop later,
once we've figured out what it should be.

> NOP5_ATOMIC is already special, and not used for the normal nop
> rewriting, only for kprobe/jump_label/ftrace.

Right, but esp ftrace means there's a _lot_ of them around.

> So I suspect we could just replace all cases of
> 
>    ideal_nops[NOP_ATOMIC5]
> 
> with
> 
>    STATIC_KEY_INIT_NOP
> 
> and get rid of the whole "let's optimize the atomic 5-byte nop" entirely.
> 
> Hmm?

So we have:

GENERIC (x86_32):	leal ds:0x00(,%esi,1),%esi
K8 (x86_64):		osp osp osp osp nop
K7 (x86_32):		leal ds:0x00(,%eax,1),%eax
P6 (x86_64):		nopl 0x00(%eax,%eax,1)

And I guess the $64k question is if there's any actual performance
difference between the k8 and p6 variants on chips we still care about.

Most modern chips seem to end up selecting p6.

Anyway, the proposed patch looks like so:

---
Subject: x86: Remove ideal_nops[NOP_ATOMIC5]

By picking a single/fixed NOP5_ATOMIC instruction things become simpler.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/jump_label.h | 12 +++---------
 arch/x86/include/asm/nops.h       | 17 ++++++++---------
 arch/x86/kernel/alternative.c     |  8 --------
 arch/x86/kernel/ftrace.c          |  3 ++-
 arch/x86/kernel/jump_label.c      | 37 ++++++-------------------------------
 arch/x86/kernel/kprobes/core.c    |  3 ++-
 6 files changed, 21 insertions(+), 59 deletions(-)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 65191ce8e1cf..a3d45abcda95 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -4,12 +4,6 @@
 
 #define JUMP_LABEL_NOP_SIZE 5
 
-#ifdef CONFIG_X86_64
-# define STATIC_KEY_INIT_NOP P6_NOP5_ATOMIC
-#else
-# define STATIC_KEY_INIT_NOP GENERIC_NOP5_ATOMIC
-#endif
-
 #include <asm/asm.h>
 #include <asm/nops.h>
 
@@ -21,7 +15,7 @@
 static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
 {
 	asm_volatile_goto("1:"
-		".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
+		".byte " __stringify(NOP5_ATOMIC) "\n\t"
 		".pushsection __jump_table,  \"aw\" \n\t"
 		_ASM_ALIGN "\n\t"
 		".long 1b - ., %l[l_yes] - . \n\t"
@@ -61,7 +55,7 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
 	.long		\target - .Lstatic_jump_after_\@
 .Lstatic_jump_after_\@:
 	.else
-	.byte		STATIC_KEY_INIT_NOP
+	.byte		NOP5_ATOMIC
 	.endif
 	.pushsection __jump_table, "aw"
 	_ASM_ALIGN
@@ -73,7 +67,7 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
 .macro STATIC_JUMP_IF_FALSE target, key, def
 .Lstatic_jump_\@:
 	.if \def
-	.byte		STATIC_KEY_INIT_NOP
+	.byte		NOP5_ATOMIC
 	.else
 	/* Equivalent to "jmp.d32 \target" */
 	.byte		0xe9
diff --git a/arch/x86/include/asm/nops.h b/arch/x86/include/asm/nops.h
index 12f12b5cf2ca..14cf05e645f5 100644
--- a/arch/x86/include/asm/nops.h
+++ b/arch/x86/include/asm/nops.h
@@ -28,7 +28,6 @@
 #define GENERIC_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00
 #define GENERIC_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00
 #define GENERIC_NOP8 GENERIC_NOP1,GENERIC_NOP7
-#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
 
 /* Opteron 64bit nops
    1: nop
@@ -44,7 +43,6 @@
 #define K8_NOP6	K8_NOP3,K8_NOP3
 #define K8_NOP7	K8_NOP4,K8_NOP3
 #define K8_NOP8	K8_NOP4,K8_NOP4
-#define K8_NOP5_ATOMIC 0x66,K8_NOP4
 
 /* K7 nops
    uses eax dependencies (arbitrary choice)
@@ -63,7 +61,6 @@
 #define K7_NOP6	0x8d,0x80,0,0,0,0
 #define K7_NOP7	0x8D,0x04,0x05,0,0,0,0
 #define K7_NOP8	K7_NOP7,K7_NOP1
-#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
 
 /* P6 nops
    uses eax dependencies (Intel-recommended choice)
@@ -86,7 +83,12 @@
 #define P6_NOP6	0x66,0x0f,0x1f,0x44,0x00,0
 #define P6_NOP7	0x0f,0x1f,0x80,0,0,0,0
 #define P6_NOP8	0x0f,0x1f,0x84,0x00,0,0,0,0
-#define P6_NOP5_ATOMIC P6_NOP5
+
+#ifdef CONFIG_X86_64
+#define NOP5_ATOMIC	P6_NOP5
+#else
+#define NOP5_ATOMIC	NOP_DS_PREFIX,GENERIC_NOP4
+#endif
 
 #ifdef __ASSEMBLY__
 #define _ASM_MK_NOP(x) .byte x
@@ -103,7 +105,6 @@
 #define ASM_NOP6 _ASM_MK_NOP(K7_NOP6)
 #define ASM_NOP7 _ASM_MK_NOP(K7_NOP7)
 #define ASM_NOP8 _ASM_MK_NOP(K7_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K7_NOP5_ATOMIC)
 #elif defined(CONFIG_X86_P6_NOP)
 #define ASM_NOP1 _ASM_MK_NOP(P6_NOP1)
 #define ASM_NOP2 _ASM_MK_NOP(P6_NOP2)
@@ -113,7 +114,6 @@
 #define ASM_NOP6 _ASM_MK_NOP(P6_NOP6)
 #define ASM_NOP7 _ASM_MK_NOP(P6_NOP7)
 #define ASM_NOP8 _ASM_MK_NOP(P6_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(P6_NOP5_ATOMIC)
 #elif defined(CONFIG_X86_64)
 #define ASM_NOP1 _ASM_MK_NOP(K8_NOP1)
 #define ASM_NOP2 _ASM_MK_NOP(K8_NOP2)
@@ -123,7 +123,6 @@
 #define ASM_NOP6 _ASM_MK_NOP(K8_NOP6)
 #define ASM_NOP7 _ASM_MK_NOP(K8_NOP7)
 #define ASM_NOP8 _ASM_MK_NOP(K8_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K8_NOP5_ATOMIC)
 #else
 #define ASM_NOP1 _ASM_MK_NOP(GENERIC_NOP1)
 #define ASM_NOP2 _ASM_MK_NOP(GENERIC_NOP2)
@@ -133,11 +132,11 @@
 #define ASM_NOP6 _ASM_MK_NOP(GENERIC_NOP6)
 #define ASM_NOP7 _ASM_MK_NOP(GENERIC_NOP7)
 #define ASM_NOP8 _ASM_MK_NOP(GENERIC_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(GENERIC_NOP5_ATOMIC)
 #endif
 
+#define ASM_NOP5_ATOMIC _ASM_MK_NOP(NOP5_ATOMIC)
+
 #define ASM_NOP_MAX 8
-#define NOP_ATOMIC5 (ASM_NOP_MAX+1)	/* Entry for the 5-byte atomic NOP */
 
 #ifndef __ASSEMBLY__
 extern const unsigned char * const *ideal_nops;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0d57015114e7..4c0250049d4f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -90,7 +90,6 @@ static const unsigned char intelnops[] =
 	GENERIC_NOP6,
 	GENERIC_NOP7,
 	GENERIC_NOP8,
-	GENERIC_NOP5_ATOMIC
 };
 static const unsigned char * const intel_nops[ASM_NOP_MAX+2] =
 {
@@ -103,7 +102,6 @@ static const unsigned char * const intel_nops[ASM_NOP_MAX+2] =
 	intelnops + 1 + 2 + 3 + 4 + 5,
 	intelnops + 1 + 2 + 3 + 4 + 5 + 6,
 	intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-	intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
 };
 #endif
 
@@ -118,7 +116,6 @@ static const unsigned char k8nops[] =
 	K8_NOP6,
 	K8_NOP7,
 	K8_NOP8,
-	K8_NOP5_ATOMIC
 };
 static const unsigned char * const k8_nops[ASM_NOP_MAX+2] =
 {
@@ -131,7 +128,6 @@ static const unsigned char * const k8_nops[ASM_NOP_MAX+2] =
 	k8nops + 1 + 2 + 3 + 4 + 5,
 	k8nops + 1 + 2 + 3 + 4 + 5 + 6,
 	k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-	k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
 };
 #endif
 
@@ -146,7 +142,6 @@ static const unsigned char k7nops[] =
 	K7_NOP6,
 	K7_NOP7,
 	K7_NOP8,
-	K7_NOP5_ATOMIC
 };
 static const unsigned char * const k7_nops[ASM_NOP_MAX+2] =
 {
@@ -159,7 +154,6 @@ static const unsigned char * const k7_nops[ASM_NOP_MAX+2] =
 	k7nops + 1 + 2 + 3 + 4 + 5,
 	k7nops + 1 + 2 + 3 + 4 + 5 + 6,
 	k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-	k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
 };
 #endif
 
@@ -174,7 +168,6 @@ static const unsigned char p6nops[] =
 	P6_NOP6,
 	P6_NOP7,
 	P6_NOP8,
-	P6_NOP5_ATOMIC
 };
 static const unsigned char * const p6_nops[ASM_NOP_MAX+2] =
 {
@@ -187,7 +180,6 @@ static const unsigned char * const p6_nops[ASM_NOP_MAX+2] =
 	p6nops + 1 + 2 + 3 + 4 + 5,
 	p6nops + 1 + 2 + 3 + 4 + 5 + 6,
 	p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-	p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
 };
 #endif
 
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0927bb158ffc..6ea5ea506a5f 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -101,7 +101,8 @@ static unsigned long text_ip_addr(unsigned long ip)
 
 static const unsigned char *ftrace_nop_replace(void)
 {
-	return ideal_nops[NOP_ATOMIC5];
+	static const unsigned char nop[] = { NOP5_ATOMIC };
+	return nop;
 }
 
 static int
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e631c358f7f4..5a27cf6e1c73 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -40,8 +40,7 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 					 int init)
 {
 	union jump_code_union jmp;
-	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
-	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
+	const unsigned char nop[] = { NOP5_ATOMIC };
 	const void *expect, *code;
 	int line;
 
@@ -50,21 +49,13 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 
 	if (type == JUMP_LABEL_JMP) {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = ideal_nop; line = __LINE__;
-		}
-
+		expect = nop;
+		line = __LINE__;
 		code = &jmp.code;
 	} else {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = &jmp.code; line = __LINE__;
-		}
-
-		code = ideal_nop;
+		expect = &jmp.code;
+		line = __LINE__;
+		code = nop;
 	}
 
 	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
@@ -108,22 +99,6 @@ static enum {
 __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry,
 				      enum jump_label_type type)
 {
-	/*
-	 * This function is called at boot up and when modules are
-	 * first loaded. Check if the default nop, the one that is
-	 * inserted at compile time, is the ideal nop. If it is, then
-	 * we do not need to update the nop, and we can leave it as is.
-	 * If it is not, then we need to update the nop to the ideal nop.
-	 */
-	if (jlstate == JL_STATE_START) {
-		const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
-		const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-
-		if (memcmp(ideal_nop, default_nop, 5) != 0)
-			jlstate = JL_STATE_UPDATE;
-		else
-			jlstate = JL_STATE_NO_UPDATE;
-	}
 	if (jlstate == JL_STATE_UPDATE)
 		__jump_label_transform(entry, type, 1);
 }
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 6afd8061dbae..5b9aa5608d0d 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -204,6 +204,7 @@ int can_boost(struct insn *insn, void *addr)
 static unsigned long
 __recover_probed_insn(kprobe_opcode_t *buf, unsigned long addr)
 {
+	static const unsigned char nop[] = { NOP5_ATOMIC };
 	struct kprobe *kp;
 	unsigned long faddr;
 
@@ -247,7 +248,7 @@ __recover_probed_insn(kprobe_opcode_t *buf, unsigned long addr)
 		return 0UL;
 
 	if (faddr)
-		memcpy(buf, ideal_nops[NOP_ATOMIC5], 5);
+		memcpy(buf, nop, 5);
 	else
 		buf[0] = kp->opcode;
 	return (unsigned long)buf;

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11  8:03         ` Peter Zijlstra
@ 2019-06-11 12:08           ` Peter Zijlstra
  2019-06-11 12:34             ` Peter Zijlstra
  2019-06-11 15:22           ` Steven Rostedt
  2019-06-11 15:54           ` Andy Lutomirski
  2 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11 12:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 10:03:07AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:

> > I am surely missing some kprobe context, but is it really safe to use
> > this mechanism to replace more than one instruction?
> 
> I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
> wrong.
> 
> So what happens is that arch_prepare_optimized_kprobe() <-
> copy_optimized_instructions() copies however much of the instruction
> stream is required such that we can overwrite the instruction at @addr
> with a 5 byte jump.
> 
> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> instruction @addr with int3, copies the rel jump address and overwrites
> the int3 with jmp.
> 
> And I'm thinking the problem is with something like:
> 
> @addr: nop nop nop nop nop
> 
> We copy out the nops into the trampoline, overwrite the first nop with
> an INT3, overwrite the remaining nops with the rel addr, but oops,
> another CPU can still be executing one of those NOPs, right?
> 
> I'm thinking we could fix this by first writing INT3 into all relevant
> instructions, which is going to be messy, given the current code base.

Maybe not that bad; how's something like this?

(completely untested)

---
 arch/x86/kernel/alternative.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0d57015114e7..8f643dabea72 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -24,6 +24,7 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include <asm/insn.h>
 
 int __read_mostly alternatives_patched;
 
@@ -849,6 +850,7 @@ static void do_sync_core(void *info)
 
 static bool bp_patching_in_progress;
 static void *bp_int3_handler, *bp_int3_addr;
+static unsigned int bp_int3_length;
 
 int poke_int3_handler(struct pt_regs *regs)
 {
@@ -867,7 +869,11 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (likely(!bp_patching_in_progress))
 		return 0;
 
-	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+	if (user_mode(regs))
+		return 0;
+
+	if (regs->ip < (unsigned long)bp_int3_addr ||
+	    regs->ip >= (unsigned long)bp_int3_addr + bp_int3_length)
 		return 0;
 
 	/* set up the specified breakpoint handler */
@@ -900,9 +906,12 @@ NOKPROBE_SYMBOL(poke_int3_handler);
 void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 {
 	unsigned char int3 = 0xcc;
+	void *kaddr = addr;
+	struct insn insn;
 
 	bp_int3_handler = handler;
 	bp_int3_addr = (u8 *)addr + sizeof(int3);
+	bp_int3_length = len - sizeof(int3);
 	bp_patching_in_progress = true;
 
 	lockdep_assert_held(&text_mutex);
@@ -913,7 +922,14 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 */
 	smp_wmb();
 
-	text_poke(addr, &int3, sizeof(int3));
+	do {
+		kernel_insn_init(&insn, kaddr, MAX_INSN_SIZE);
+		insn_get_length(&insn);
+
+		text_poke(kaddr, &int3, sizeof(int3));
+
+		kaddr += insn.length;
+	} while (kaddr < addr + len);
 
 	on_each_cpu(do_sync_core, NULL, 1);
 

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 12:08           ` Peter Zijlstra
@ 2019-06-11 12:34             ` Peter Zijlstra
  2019-06-11 12:42               ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11 12:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 02:08:34PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 11, 2019 at 10:03:07AM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:
> 
> > > I am surely missing some kprobe context, but is it really safe to use
> > > this mechanism to replace more than one instruction?
> > 
> > I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
> > wrong.
> > 
> > So what happens is that arch_prepare_optimized_kprobe() <-
> > copy_optimized_instructions() copies however much of the instruction
> > stream is required such that we can overwrite the instruction at @addr
> > with a 5 byte jump.
> > 
> > arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> > instruction @addr with int3, copies the rel jump address and overwrites
> > the int3 with jmp.
> > 
> > And I'm thinking the problem is with something like:
> > 
> > @addr: nop nop nop nop nop
> > 
> > We copy out the nops into the trampoline, overwrite the first nop with
> > an INT3, overwrite the remaining nops with the rel addr, but oops,
> > another CPU can still be executing one of those NOPs, right?
> > 
> > I'm thinking we could fix this by first writing INT3 into all relevant
> > instructions, which is going to be messy, given the current code base.
> 
> Maybe not that bad; how's something like this?
> 
> (completely untested)
> 
> ---
>  arch/x86/kernel/alternative.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 0d57015114e7..8f643dabea72 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -24,6 +24,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/io.h>
>  #include <asm/fixmap.h>
> +#include <asm/insn.h>
>  
>  int __read_mostly alternatives_patched;
>  
> @@ -849,6 +850,7 @@ static void do_sync_core(void *info)
>  
>  static bool bp_patching_in_progress;
>  static void *bp_int3_handler, *bp_int3_addr;
> +static unsigned int bp_int3_length;
>  
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> @@ -867,7 +869,11 @@ int poke_int3_handler(struct pt_regs *regs)
>  	if (likely(!bp_patching_in_progress))
>  		return 0;
>  
> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> +	if (user_mode(regs))
> +		return 0;
> +
> +	if (regs->ip < (unsigned long)bp_int3_addr ||
> +	    regs->ip >= (unsigned long)bp_int3_addr + bp_int3_length)
>  		return 0;

Bugger, this isn't right. It'll jump to the beginning of the trampoline,
even if it is multiple instructions in, which would lead to executing
instructions twice, which would be BAD.

_maybe_, depending on what the slot looks like, we could do something
like:

	offset = regs->ip - (unsigned long)bp_int3_addr;
	regs->ip = bp_int3_handler + offset;

That is; jump into the slot at the same offset we hit the INT3, but this
is quickly getting yuck.

>  	/* set up the specified breakpoint handler */
> @@ -900,9 +906,12 @@ NOKPROBE_SYMBOL(poke_int3_handler);
>  void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  {
>  	unsigned char int3 = 0xcc;
> +	void *kaddr = addr;
> +	struct insn insn;
>  
>  	bp_int3_handler = handler;
>  	bp_int3_addr = (u8 *)addr + sizeof(int3);
> +	bp_int3_length = len - sizeof(int3);
>  	bp_patching_in_progress = true;
>  
>  	lockdep_assert_held(&text_mutex);
> @@ -913,7 +922,14 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	 */
>  	smp_wmb();
>  
> -	text_poke(addr, &int3, sizeof(int3));
> +	do {
> +		kernel_insn_init(&insn, kaddr, MAX_INSN_SIZE);
> +		insn_get_length(&insn);
> +
> +		text_poke(kaddr, &int3, sizeof(int3));
> +
> +		kaddr += insn.length;
> +	} while (kaddr < addr + len);
>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 12:34             ` Peter Zijlstra
@ 2019-06-11 12:42               ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11 12:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 02:34:02PM +0200, Peter Zijlstra wrote:

> Bugger, this isn't right. It'll jump to the beginning of the trampoline,
> even if it is multiple instructions in, which would lead to executing
> instructions twice, which would be BAD.
> 
> _maybe_, depending on what the slot looks like, we could do something
> like:
> 
> 	offset = regs->ip - (unsigned long)bp_int3_addr;
> 	regs->ip = bp_int3_handler + offset;
> 
> That is; jump into the slot at the same offset we hit the INT3, but this
> is quickly getting yuck.

Yeah, that won't work either... it needs something far more complex :/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/15] static_call: Simple self-test module
  2019-06-11  8:29     ` Peter Zijlstra
@ 2019-06-11 13:02       ` Josh Poimboeuf
  0 siblings, 0 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-06-11 13:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 10:29:31AM +0200, Peter Zijlstra wrote:
> On Mon, Jun 10, 2019 at 12:24:28PM -0500, Josh Poimboeuf wrote:
> > On Wed, Jun 05, 2019 at 03:08:07PM +0200, Peter Zijlstra wrote:
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  lib/Kconfig.debug      |    8 ++++++++
> > >  lib/Makefile           |    1 +
> > >  lib/test_static_call.c |   41 +++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 50 insertions(+)
> > > 
> > > --- a/lib/Kconfig.debug
> > > +++ b/lib/Kconfig.debug
> > > @@ -1955,6 +1955,14 @@ config TEST_STATIC_KEYS
> > >  
> > >  	  If unsure, say N.
> > >  
> > > +config TEST_STATIC_CALL
> > > +	tristate "Test static call"
> > > +	depends on m
> > > +	help
> > > +	  Test the static call interfaces.
> > > +
> > > +	  If unsure, say N.
> > > +
> > 
> > Any reason why we wouldn't just make this a built-in boot time test?
> 
> None what so ever; but I did copy paste from the static_key stuff and
> that has it for some rasin.

Their functionality is pretty crucial, and I doubt anybody is manually
building and loading these tests?  Seems like built-in tests would be
wiser for both static calls/keys.

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-05 13:08 ` [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions Peter Zijlstra
                     ` (2 preceding siblings ...)
  2019-06-10 16:57   ` Josh Poimboeuf
@ 2019-06-11 15:14   ` Steven Rostedt
  2019-06-11 15:52     ` Peter Zijlstra
  3 siblings, 1 reply; 87+ messages in thread
From: Steven Rostedt @ 2019-06-11 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Wed, 05 Jun 2019 15:08:01 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
>  {
>  	unsigned char int3 = 0xcc;
>  
> -	bp_int3_handler = handler;
> +	bp_int3_opcode = emulate ?: opcode;
>  	bp_int3_addr = (u8 *)addr + sizeof(int3);
>  	bp_patching_in_progress = true;
>  
>  	lockdep_assert_held(&text_mutex);
>  
>  	/*
> +	 * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> +	 * notably a JMP, CALL or NOP5_ATOMIC.
> +	 */
> +	BUG_ON(len != 5);

If we have a bug on here, why bother with passing in len at all? Just
force it to be 5.

We could make it a WARN_ON() and return without doing anything.

This also prevents us from ever changing two byte jmps.

-- Steve

> +
> +	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> -	 * in_progress and handler are correctly ordered wrt. patching.
> +	 * in_progress and opcode are correctly ordered wrt. patching.
>  	 */
>  	smp_wmb();
>  
> -

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11  8:03         ` Peter Zijlstra
  2019-06-11 12:08           ` Peter Zijlstra
@ 2019-06-11 15:22           ` Steven Rostedt
  2019-06-11 15:52             ` Steven Rostedt
  2019-06-11 15:55             ` Peter Zijlstra
  2019-06-11 15:54           ` Andy Lutomirski
  2 siblings, 2 replies; 87+ messages in thread
From: Steven Rostedt @ 2019-06-11 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Masami Hiramatsu, x86, linux-kernel,
	Ard Biesheuvel, Andy Lutomirski, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, 11 Jun 2019 10:03:07 +0200
Peter Zijlstra <peterz@infradead.org> wrote:


> So what happens is that arch_prepare_optimized_kprobe() <-
> copy_optimized_instructions() copies however much of the instruction
> stream is required such that we can overwrite the instruction at @addr
> with a 5 byte jump.
> 
> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> instruction @addr with int3, copies the rel jump address and overwrites
> the int3 with jmp.
> 
> And I'm thinking the problem is with something like:
> 
> @addr: nop nop nop nop nop

What would work would be to:

	add breakpoint to first opcode.

	call synchronize_tasks();

	/* All tasks now hitting breakpoint and jumping over affected
	code */

	update the rest of the instructions.

	replace breakpoint with jmp.

One caveat is that the replaced instructions must not be a call
function. As if the call function calls schedule then it will
circumvent the synchronize_tasks(). It would be OK if that call is the
last of the instructions. But I doubt we modify anything more then a
call size anyway, so this should still work for all current instances.

-- Steve

> 
> We copy out the nops into the trampoline, overwrite the first nop with
> an INT3, overwrite the remaining nops with the rel addr, but oops,
> another CPU can still be executing one of those NOPs, right?
> 
> I'm thinking we could fix this by first writing INT3 into all relevant
> instructions, which is going to be messy, given the current code base.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 15:22           ` Steven Rostedt
@ 2019-06-11 15:52             ` Steven Rostedt
  2019-06-11 15:55             ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Steven Rostedt @ 2019-06-11 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Masami Hiramatsu, x86, linux-kernel,
	Ard Biesheuvel, Andy Lutomirski, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, 11 Jun 2019 11:22:54 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> What would work would be to:
> 
> 	add breakpoint to first opcode.
> 
> 	call synchronize_tasks();

BTW, that should be "synchronize_rcu_tasks()"

-- Steve

> 
> 	/* All tasks now hitting breakpoint and jumping over affected
> 	code */
> 
> 	update the rest of the instructions.
> 
> 	replace breakpoint with jmp.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 15:14   ` Steven Rostedt
@ 2019-06-11 15:52     ` Peter Zijlstra
  2019-06-11 16:21       ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11 15:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 11:14:10AM -0400, Steven Rostedt wrote:
> On Wed, 05 Jun 2019 15:08:01 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> > +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
> >  {
> >  	unsigned char int3 = 0xcc;
> >  
> > -	bp_int3_handler = handler;
> > +	bp_int3_opcode = emulate ?: opcode;
> >  	bp_int3_addr = (u8 *)addr + sizeof(int3);
> >  	bp_patching_in_progress = true;
> >  
> >  	lockdep_assert_held(&text_mutex);
> >  
> >  	/*
> > +	 * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> > +	 * notably a JMP, CALL or NOP5_ATOMIC.
> > +	 */
> > +	BUG_ON(len != 5);
> 
> If we have a bug on here, why bother with passing in len at all? Just
> force it to be 5.

Masami said the same.

> We could make it a WARN_ON() and return without doing anything.
> 
> This also prevents us from ever changing two byte jmps.

It doesn't; that is, we'd need to add emulation for the 3 byte jump, but
that'd be pretty trivial.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11  8:03         ` Peter Zijlstra
  2019-06-11 12:08           ` Peter Zijlstra
  2019-06-11 15:22           ` Steven Rostedt
@ 2019-06-11 15:54           ` Andy Lutomirski
  2019-06-11 16:11             ` Steven Rostedt
  2019-06-17 14:31             ` Peter Zijlstra
  2 siblings, 2 replies; 87+ messages in thread
From: Andy Lutomirski @ 2019-06-11 15:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira



> On Jun 11, 2019, at 1:03 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:
>> 
>> 
>>> On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
>>> 
>>>>> This fits almost all text_poke_bp() users, except
>>>>> arch_unoptimize_kprobe() which restores random text, and for that site
>>>>> we have to build an explicit emulate instruction.
>>>> 
>>>> Hm, actually it doesn't restores randome text, since the first byte
>>>> must always be int3. As the function name means, it just unoptimizes
>>>> (jump based optprobe -> int3 based kprobe).
>>>> Anyway, that is not an issue. With this patch, optprobe must still work.
>>> 
>>> I thought it basically restored 5 bytes of original text (with no
>>> guarantee it is a single instruction, or even a complete instruction),
>>> with the first byte replaced with INT3.
>>> 
>> 
>> I am surely missing some kprobe context, but is it really safe to use
>> this mechanism to replace more than one instruction?
> 
> I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
> wrong.
> 
> So what happens is that arch_prepare_optimized_kprobe() <-
> copy_optimized_instructions() copies however much of the instruction
> stream is required such that we can overwrite the instruction at @addr
> with a 5 byte jump.
> 
> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> instruction @addr with int3, copies the rel jump address and overwrites
> the int3 with jmp.
> 
> And I'm thinking the problem is with something like:
> 
> @addr: nop nop nop nop nop
> 
> We copy out the nops into the trampoline, overwrite the first nop with
> an INT3, overwrite the remaining nops with the rel addr, but oops,
> another CPU can still be executing one of those NOPs, right?
> 
> I'm thinking we could fix this by first writing INT3 into all relevant
> instructions, which is going to be messy, given the current code base.

How does that help?  If RIP == x+2 and you want to put a 5-byte jump at address x, no amount of 0xcc is going to change the fact that RIP is in the middle of the jump.

Live patching can handle this by detecting this condition on each CPU, but performance won’t be great.  Maybe some synchronize_sched trickery could help.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 15:22           ` Steven Rostedt
  2019-06-11 15:52             ` Steven Rostedt
@ 2019-06-11 15:55             ` Peter Zijlstra
  2019-06-12 19:44               ` Nadav Amit
  1 sibling, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11 15:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Masami Hiramatsu, x86, linux-kernel,
	Ard Biesheuvel, Andy Lutomirski, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 11:22:54AM -0400, Steven Rostedt wrote:
> On Tue, 11 Jun 2019 10:03:07 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > So what happens is that arch_prepare_optimized_kprobe() <-
> > copy_optimized_instructions() copies however much of the instruction
> > stream is required such that we can overwrite the instruction at @addr
> > with a 5 byte jump.
> > 
> > arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> > instruction @addr with int3, copies the rel jump address and overwrites
> > the int3 with jmp.
> > 
> > And I'm thinking the problem is with something like:
> > 
> > @addr: nop nop nop nop nop
> 
> What would work would be to:
> 
> 	add breakpoint to first opcode.
> 
> 	call synchronize_tasks();
> 
> 	/* All tasks now hitting breakpoint and jumping over affected
> 	code */
> 
> 	update the rest of the instructions.
> 
> 	replace breakpoint with jmp.
> 
> One caveat is that the replaced instructions must not be a call
> function. As if the call function calls schedule then it will
> circumvent the synchronize_tasks(). It would be OK if that call is the
> last of the instructions. But I doubt we modify anything more then a
> call size anyway, so this should still work for all current instances.

Right, something like this could work (although I cannot currently find
synchronize_tasks), but it would make the optprobe stuff fairly slow
(iirc this sync_tasks() thing could be pretty horrible).



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 15:54           ` Andy Lutomirski
@ 2019-06-11 16:11             ` Steven Rostedt
  2019-06-17 14:31             ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Steven Rostedt @ 2019-06-11 16:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Masami Hiramatsu, x86, linux-kernel,
	Ard Biesheuvel, Andy Lutomirski, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, 11 Jun 2019 08:54:23 -0700
Andy Lutomirski <luto@amacapital.net> wrote:


> How does that help?  If RIP == x+2 and you want to put a 5-byte jump
> at address x, no amount of 0xcc is going to change the fact that RIP
> is in the middle of the jump.
> 
> Live patching can handle this by detecting this condition on each
> CPU, but performance won’t be great.  Maybe some synchronize_sched
> trickery could help.

We have synchronize_rcu_tasks() which return after all tasks have
either entered user space or did a voluntary schedule (was not
preempted). Or have not run (still in a sleeping state).

That way we guarantee that all tasks are no longer on any trampoline
or code paths that do not call schedule. I use this to free dynamically
allocated trampolines used by ftrace. And kprobes uses this too for its
own trampolines.

-- Steve

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 15:52     ` Peter Zijlstra
@ 2019-06-11 16:21       ` Peter Zijlstra
  2019-06-12 14:44         ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-11 16:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 05:52:48PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 11, 2019 at 11:14:10AM -0400, Steven Rostedt wrote:
> > On Wed, 05 Jun 2019 15:08:01 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> > > +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
> > >  {
> > >  	unsigned char int3 = 0xcc;
> > >  
> > > -	bp_int3_handler = handler;
> > > +	bp_int3_opcode = emulate ?: opcode;
> > >  	bp_int3_addr = (u8 *)addr + sizeof(int3);
> > >  	bp_patching_in_progress = true;
> > >  
> > >  	lockdep_assert_held(&text_mutex);
> > >  
> > >  	/*
> > > +	 * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> > > +	 * notably a JMP, CALL or NOP5_ATOMIC.
> > > +	 */
> > > +	BUG_ON(len != 5);
> > 
> > If we have a bug on here, why bother with passing in len at all? Just
> > force it to be 5.
> 
> Masami said the same.
> 
> > We could make it a WARN_ON() and return without doing anything.
> > 
> > This also prevents us from ever changing two byte jmps.
> 
> It doesn't; that is, we'd need to add emulation for the 3 byte jump, but
> that'd be pretty trivial.

I can't find a 3 byte jump on x86_64, I could only find a 2 byte one.
But something like so should work I suppose, although at this point I'm
thinking we should just used the instruction decode we have instead of
playing iffy games with packed structures.

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e1a4bb42eb92..abb9615dcb1d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -57,6 +57,9 @@ static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
 #define JMP_INSN_SIZE		5
 #define JMP_INSN_OPCODE		0xE9
 
+#define JMP8_INSN_SIZE		2
+#define JMP8_INSN_OPCODE	0xEB
+
 static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
 {
 	/*
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5d0123a8183b..5df6c74a0b08 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -924,13 +924,18 @@ static void do_sync_core(void *info)
 static bool bp_patching_in_progress;
 static const void *bp_int3_opcode, *bp_int3_addr;
 
+struct poke_insn {
+	u8 opcode;
+	union {
+		s8 rel8;
+		s32 rel32;
+	};
+} __packed;
+
 int poke_int3_handler(struct pt_regs *regs)
 {
 	long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
-	struct opcode {
-		u8 insn;
-		s32 rel;
-	} __packed opcode;
+	struct poke_insn insn;
 
 	/*
 	 * Having observed our INT3 instruction, we now must observe
@@ -950,15 +955,19 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
 		return 0;
 
-	opcode = *(struct opcode *)bp_int3_opcode;
+	insn = *(struct poke_insn *)bp_int3_opcode;
 
-	switch (opcode.insn) {
+	switch (insn.opcode) {
 	case CALL_INSN_OPCODE:
-		int3_emulate_call(regs, ip + opcode.rel);
+		int3_emulate_call(regs, ip + insn.rel32);
 		break;
 
 	case JMP_INSN_OPCODE:
-		int3_emulate_jmp(regs, ip + opcode.rel);
+		int3_emulate_jmp(regs, ip + insn.rel32);
+		break;
+
+	case JMP8_INSN_OPCODE:
+		int3_emulate_jmp(regs, ip + insn.rel8);
 		break;
 
 	default: /* assume NOP */
@@ -992,7 +1001,8 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  */
 void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
 {
-	unsigned char int3 = 0xcc;
+	unsigned char int3 = INT3_INSN_OPCODE;
+	unsigned char opcode;
 
 	bp_int3_opcode = emulate ?: opcode;
 	bp_int3_addr = (u8 *)addr + sizeof(int3);
@@ -1001,10 +1011,26 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulat
 	lockdep_assert_held(&text_mutex);
 
 	/*
-	 * poke_int3_handler() relies on @opcode being a 5 byte instruction;
-	 * notably a JMP, CALL or NOP5_ATOMIC.
+	 * Verify we support the actual instruction in poke_int3_handler().
 	 */
-	BUG_ON(len != 5);
+	opcode = *(unsigned char *)bp_int3_opcode;
+	switch (opcode) {
+	case CALL_INSN_OPCODE:
+		BUG_ON(len != CALL_INSN_SIZE);
+		break;
+
+	case JMP_INSN_OPCODE:
+		BUG_ON(len != JMP_INSN_SIZE);
+		break;
+
+	case JMP8_INSN_OPCODE:
+		BUG_ON(len != JMP8_INSN_SIZE);
+		break;
+
+	default: /* assume NOP5_ATOMIC */
+		BUG_ON(len != 5);
+		break;
+	}
 
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 16:21       ` Peter Zijlstra
@ 2019-06-12 14:44         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-12 14:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Masami Hiramatsu, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Nadav Amit, Rasmus Villemoes,
	Edward Cree, Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 06:21:28PM +0200, Peter Zijlstra wrote:
> although at this point I'm
> thinking we should just used the instruction decode we have instead of
> playing iffy games with packed structures.

How's something like this? It accepts jmp/32 jmp/8 call and nop5_atomic.

---
Subject: x86/alternatives: Teach text_poke_bp() to emulate instructions
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Jun 5 10:48:37 CEST 2019

In preparation for static_call support, teach text_poke_bp() to
emulate instructions, including CALL.

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/text-patching.h |   15 +++++--
 arch/x86/kernel/alternative.c        |   73 +++++++++++++++++++++++++----------
 arch/x86/kernel/jump_label.c         |    3 -
 arch/x86/kernel/kprobes/opt.c        |   11 +++--
 4 files changed, 75 insertions(+), 27 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
@@ -48,8 +48,17 @@ static inline void int3_emulate_jmp(stru
 	regs->ip = ip;
 }
 
-#define INT3_INSN_SIZE 1
-#define CALL_INSN_SIZE 5
+#define INT3_INSN_SIZE		1
+#define INT3_INSN_OPCODE	0xCC
+
+#define CALL_INSN_SIZE		5
+#define CALL_INSN_OPCODE	0xE8
+
+#define JMP_INSN_SIZE		5
+#define JMP_INSN_OPCODE		0xE9
+
+#define JMP8_INSN_SIZE		2
+#define JMP8_INSN_OPCODE	0xEB
 
 static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
 {
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -920,31 +920,45 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
-static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static const void *bp_int3_addr;
+static const struct insn *bp_int3_insn;
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	long ip;
+
 	/*
 	 * Having observed our INT3 instruction, we now must observe
-	 * bp_patching_in_progress.
-	 *
-	 * 	in_progress = TRUE		INT3
-	 * 	WMB				RMB
-	 * 	write INT3			if (in_progress)
+	 * bp_int3_addr and bp_int3_insn:
 	 *
-	 * Idem for bp_int3_handler.
+	 *	bp_int3_{addr,insn) = ..	INT3
+	 *	WMB				RMB
+	 *	write INT3			if (insn)
 	 */
 	smp_rmb();
 
-	if (likely(!bp_patching_in_progress))
+	if (likely(!bp_int3_insn))
 		return 0;
 
 	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	ip = regs->ip - INT3_INSN_SIZE + bp_int3_insn->length;
+
+	switch (bp_int3_insn->opcode.bytes[0]) {
+	case CALL_INSN_OPCODE:
+		int3_emulate_call(regs, ip + bp_int3_insn->immediate.value);
+		break;
+
+	case JMP_INSN_OPCODE:
+	case JMP8_INSN_OPCODE:
+		int3_emulate_jmp(regs, ip + bp_int3_insn->immediate.value);
+		break;
+
+	default: /* assume NOP */
+		int3_emulate_jmp(regs, ip);
+		break;
+	}
 
 	return 1;
 }
@@ -955,7 +969,7 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  * @addr:	address to patch
  * @opcode:	opcode of new instruction
  * @len:	length to copy
- * @handler:	address to jump to when the temporary breakpoint is hit
+ * @emulate:	opcode to emulate, when NULL use @opcode
  *
  * Modify multi-byte instruction by using int3 breakpoint on SMP.
  * We completely avoid stop_machine() here, and achieve the
@@ -970,19 +984,40 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  *	  replacing opcode
  *	- sync cores
  */
-void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
 {
-	unsigned char int3 = 0xcc;
+	unsigned char int3 = INT3_INSN_OPCODE;
+	struct insn insn;
 
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
-	bp_patching_in_progress = true;
+	bp_int3_addr = addr + INT3_INSN_SIZE;
 
 	lockdep_assert_held(&text_mutex);
 
+	if (!emulate)
+		emulate = opcode;
+
+	kernel_insn_init(&insn, emulate, MAX_INSN_SIZE);
+	insn_get_length(&insn);
+
+	BUG_ON(!insn_complete(&insn));
+	BUG_ON(insn.length != len);
+
+	switch (insn.opcode.bytes[0]) {
+	case CALL_INSN_OPCODE:
+	case JMP_INSN_OPCODE:
+	case JMP8_INSN_OPCODE:
+		break;
+
+	default:
+		BUG_ON(len != 5);
+		BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], 5));
+	}
+
+	bp_int3_insn = &insn;
+
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
-	 * in_progress and handler are correctly ordered wrt. patching.
+	 * in_progress and opcode are correctly ordered wrt. patching.
 	 */
 	smp_wmb();
 
@@ -1011,6 +1046,6 @@ void text_poke_bp(void *addr, const void
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.
 	 */
-	bp_patching_in_progress = false;
+	bp_int3_insn = NULL;
 }
 
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -87,8 +87,7 @@ static void __ref __jump_label_transform
 		return;
 	}
 
-	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
-		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, NULL);
 }
 
 void arch_jump_label_transform(struct jump_entry *entry,
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
 		insn_buff[0] = RELATIVEJUMP_OPCODE;
 		*(s32 *)(&insn_buff[1]) = rel;
 
-		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
-			     op->optinsn.insn);
+		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);
 
 		list_del_init(&op->list);
 	}
@@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
 void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 {
 	u8 insn_buff[RELATIVEJUMP_SIZE];
+	u8 emulate_buff[RELATIVEJUMP_SIZE];
 
 	/* Set int3 to first byte for kprobes */
 	insn_buff[0] = BREAKPOINT_INSTRUCTION;
 	memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
+
+	emulate_buff[0] = RELATIVEJUMP_OPCODE;
+	*(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
+			((long)op->kp.addr + RELATIVEJUMP_SIZE));
+
 	text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
-		     op->optinsn.insn);
+		     emulate_buff);
 }
 
 /*

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-07 17:34     ` Peter Zijlstra
  2019-06-07 17:48       ` Linus Torvalds
  2019-06-07 18:10       ` Andy Lutomirski
@ 2019-06-12 17:09       ` Peter Zijlstra
  2 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-12 17:09 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jason Baron, Jiri Kosina, David Laight, Borislav Petkov,
	Julia Cartwright, Jessica Yu, H. Peter Anvin, Nadav Amit,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira

On Fri, Jun 07, 2019 at 07:34:27PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:

> > > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> > >  	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> > >  		return 0;
> > >  
> > > -	/* set up the specified breakpoint handler */
> > > -	regs->ip = (unsigned long) bp_int3_handler;
> > > +	opcode = *(struct opcode *)bp_int3_opcode;
> > > +
> > > +	switch (opcode.insn) {
> > > +	case 0xE8: /* CALL */
> > > +		int3_emulate_call(regs, ip + opcode.rel);
> > > +		break;
> > > +
> > > +	case 0xE9: /* JMP */
> > > +		int3_emulate_jmp(regs, ip + opcode.rel);
> > > +		break;
> > > +
> > > +	default: /* assume NOP */
> > 
> > Shouldn't we check whether it is actually NOP here?
> 
> I was/am lazy and didn't want to deal with:
> 
> arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
> arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
> arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
> arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5
> 
> But maybe we should check for all the various NOP5 variants and BUG() on
> anything unexpected.

I realized we never actually poke a !ideal nop5_atomic, so I've added
that to the latest versions.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 15:55             ` Peter Zijlstra
@ 2019-06-12 19:44               ` Nadav Amit
  2019-06-17 14:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 87+ messages in thread
From: Nadav Amit @ 2019-06-12 19:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Andy Lutomirski, Masami Hiramatsu,
	the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Ingo Molnar, Thomas Gleixner, Linus Torvalds, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

> On Jun 11, 2019, at 8:55 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Jun 11, 2019 at 11:22:54AM -0400, Steven Rostedt wrote:
>> On Tue, 11 Jun 2019 10:03:07 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> 
>>> So what happens is that arch_prepare_optimized_kprobe() <-
>>> copy_optimized_instructions() copies however much of the instruction
>>> stream is required such that we can overwrite the instruction at @addr
>>> with a 5 byte jump.
>>> 
>>> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
>>> instruction @addr with int3, copies the rel jump address and overwrites
>>> the int3 with jmp.
>>> 
>>> And I'm thinking the problem is with something like:
>>> 
>>> @addr: nop nop nop nop nop
>> 
>> What would work would be to:
>> 
>> 	add breakpoint to first opcode.
>> 
>> 	call synchronize_tasks();
>> 
>> 	/* All tasks now hitting breakpoint and jumping over affected
>> 	code */
>> 
>> 	update the rest of the instructions.
>> 
>> 	replace breakpoint with jmp.
>> 
>> One caveat is that the replaced instructions must not be a call
>> function. As if the call function calls schedule then it will
>> circumvent the synchronize_tasks(). It would be OK if that call is the
>> last of the instructions. But I doubt we modify anything more then a
>> call size anyway, so this should still work for all current instances.
> 
> Right, something like this could work (although I cannot currently find
> synchronize_tasks), but it would make the optprobe stuff fairly slow
> (iirc this sync_tasks() thing could be pretty horrible).

I have run into similar problems before.

I had two problematic scenarios. In the first case, I had a “call” in the
middle of the patched code-block, but this call was always followed by a
“jump” to the end of the potentially patched code-block, so I did not have
the problem.

In the second case, I had an indirect call (which is shorter than a direct
call) being patched into a direct call. In this case, I preceded the
indirect call with NOPs so indeed the indirect call was at the end of the
patched block.

In certain cases, if a shorter instruction should be potentially patched
into a longer one, the shorter one can be preceded by some prefixes. If
there are multiple REX prefixes, for instance, the CPU only uses the last
one, IIRC. This can allow to avoid synchronize_sched() when patching a
single instruction into another instruction with a different length.

Not sure how helpful this information is, but sharing - just in case.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-11 15:54           ` Andy Lutomirski
  2019-06-11 16:11             ` Steven Rostedt
@ 2019-06-17 14:31             ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-17 14:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Masami Hiramatsu, x86, linux-kernel, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Tue, Jun 11, 2019 at 08:54:23AM -0700, Andy Lutomirski wrote:
> > On Jun 11, 2019, at 1:03 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> > instruction @addr with int3, copies the rel jump address and overwrites
> > the int3 with jmp.
> > 
> > And I'm thinking the problem is with something like:
> > 
> > @addr: nop nop nop nop nop
> > 
> > We copy out the nops into the trampoline, overwrite the first nop with
> > an INT3, overwrite the remaining nops with the rel addr, but oops,
> > another CPU can still be executing one of those NOPs, right?
> > 
> > I'm thinking we could fix this by first writing INT3 into all relevant
> > instructions, which is going to be messy, given the current code base.
> 
> How does that help?  If RIP == x+2 and you want to put a 5-byte jump
> at address x, no amount of 0xcc is going to change the fact that RIP
> is in the middle of the jump.

What I proposed was doing 0xCC on every instruction that's inside of
[x,x+5). After that we do machine wide IPI+SYNC.

So if RIP is at x+2, then it will observe 0xCC and trigger the exception
there.

Now, the problem is that my exception cannot recover from anything
except NOPs, so while it could work for some corner cases, it doesn't
work for the optkprobe case in general.

Only then do we write the JMP offset and again to IPI+SYNC; then we
write the 0xE8 and again IPI-SYNC.

But at that point we already have the guarantee nobody is inside
[x,x+5). That is, except if we can get to x+2 without first going
through x, IOW if x+2 is a branch target we're screwed any which way
around and the poke is never valid.

Is that something optkprobes checks? If so, where and how?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-12 19:44               ` Nadav Amit
@ 2019-06-17 14:42                 ` Peter Zijlstra
  2019-06-17 17:06                   ` Nadav Amit
  2019-06-17 17:25                   ` Andy Lutomirski
  0 siblings, 2 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-17 14:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Steven Rostedt, Andy Lutomirski, Masami Hiramatsu,
	the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Ingo Molnar, Thomas Gleixner, Linus Torvalds, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:

> I have run into similar problems before.
> 
> I had two problematic scenarios. In the first case, I had a “call” in the
> middle of the patched code-block, but this call was always followed by a
> “jump” to the end of the potentially patched code-block, so I did not have
> the problem.
> 
> In the second case, I had an indirect call (which is shorter than a direct

Longer, 6 bytes vs 5 if I'm not mistaken.

> call) being patched into a direct call. In this case, I preceded the
> indirect call with NOPs so indeed the indirect call was at the end of the
> patched block.
> 
> In certain cases, if a shorter instruction should be potentially patched
> into a longer one, the shorter one can be preceded by some prefixes. If
> there are multiple REX prefixes, for instance, the CPU only uses the last
> one, IIRC. This can allow to avoid synchronize_sched() when patching a
> single instruction into another instruction with a different length.
> 
> Not sure how helpful this information is, but sharing - just in case.

I think we can patch multiple instructions provided:

 - all but one instruction are a NOP,
 - there are no branch targets inside the range.

By poking INT3 at every instruction in the range and then doing the
machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.

Because all but one instruction are a NOP, we can emulate only the one
instruction (assuming the real instruction is always last), otherwise
NOP when we're behind the real instruction.

Then we can write new instructions, leaving the initial INT3 until last.

Something like this might be useful if we want to support immediate
instructions (like patch_data_* in paravirt_patch.c) for static_call().



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-17 14:42                 ` Peter Zijlstra
@ 2019-06-17 17:06                   ` Nadav Amit
  2019-06-17 17:25                   ` Andy Lutomirski
  1 sibling, 0 replies; 87+ messages in thread
From: Nadav Amit @ 2019-06-17 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Andy Lutomirski, Masami Hiramatsu,
	the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Ingo Molnar, Thomas Gleixner, Linus Torvalds, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

> On Jun 17, 2019, at 7:42 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:
> 
>> I have run into similar problems before.
>> 
>> I had two problematic scenarios. In the first case, I had a “call” in the
>> middle of the patched code-block, but this call was always followed by a
>> “jump” to the end of the potentially patched code-block, so I did not have
>> the problem.
>> 
>> In the second case, I had an indirect call (which is shorter than a direct
> 
> Longer, 6 bytes vs 5 if I'm not mistaken.

Shorter (2-3 bytes IIRC), since the target was held in a register.

> 
>> call) being patched into a direct call. In this case, I preceded the
>> indirect call with NOPs so indeed the indirect call was at the end of the
>> patched block.
>> 
>> In certain cases, if a shorter instruction should be potentially patched
>> into a longer one, the shorter one can be preceded by some prefixes. If
>> there are multiple REX prefixes, for instance, the CPU only uses the last
>> one, IIRC. This can allow to avoid synchronize_sched() when patching a
>> single instruction into another instruction with a different length.
>> 
>> Not sure how helpful this information is, but sharing - just in case.
> 
> I think we can patch multiple instructions provided:
> 
> - all but one instruction are a NOP,
> - there are no branch targets inside the range.
> 
> By poking INT3 at every instruction in the range and then doing the
> machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.
> 
> Because all but one instruction are a NOP, we can emulate only the one
> instruction (assuming the real instruction is always last), otherwise
> NOP when we're behind the real instruction.
> 
> Then we can write new instructions, leaving the initial INT3 until last.
> 
> Something like this might be useful if we want to support immediate
> instructions (like patch_data_* in paravirt_patch.c) for static_call().

I don’t know what you regard when you say SYNC, but if you regard something
like sync_core() (in contrast to something like synchronize_sched() ), I am
not sure it is sufficient.

Using IPI+sync_core(), I think, would make an assumption that IRQs are never
enabled inside IRQ and exception handlers, or that these handlers would not
be invoked while the patched code is executed. Otherwise, the IPI might be
received inside the IRQ/exception handler, and then return from the handler
will be into the middle of a patched instruction.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-17 14:42                 ` Peter Zijlstra
  2019-06-17 17:06                   ` Nadav Amit
@ 2019-06-17 17:25                   ` Andy Lutomirski
  2019-06-17 19:26                     ` Peter Zijlstra
  1 sibling, 1 reply; 87+ messages in thread
From: Andy Lutomirski @ 2019-06-17 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Steven Rostedt, Masami Hiramatsu,
	the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Ingo Molnar, Thomas Gleixner, Linus Torvalds, Jason Baron,
	Jiri Kosina, David Laight, Borislav Petkov, Julia Cartwright,
	Jessica Yu, H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 17, 2019 at 7:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:
>
> > I have run into similar problems before.
> >
> > I had two problematic scenarios. In the first case, I had a “call” in the
> > middle of the patched code-block, but this call was always followed by a
> > “jump” to the end of the potentially patched code-block, so I did not have
> > the problem.
> >
> > In the second case, I had an indirect call (which is shorter than a direct
>
> Longer, 6 bytes vs 5 if I'm not mistaken.
>
> > call) being patched into a direct call. In this case, I preceded the
> > indirect call with NOPs so indeed the indirect call was at the end of the
> > patched block.
> >
> > In certain cases, if a shorter instruction should be potentially patched
> > into a longer one, the shorter one can be preceded by some prefixes. If
> > there are multiple REX prefixes, for instance, the CPU only uses the last
> > one, IIRC. This can allow to avoid synchronize_sched() when patching a
> > single instruction into another instruction with a different length.
> >
> > Not sure how helpful this information is, but sharing - just in case.
>
> I think we can patch multiple instructions provided:
>
>  - all but one instruction are a NOP,
>  - there are no branch targets inside the range.
>
> By poking INT3 at every instruction in the range and then doing the
> machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.

How do you know you'll trap them?  You need to IPI, serialize, and get
them to execute an instruction.  If the CPU is in an interrupt and RIP
just happens to be pointed to the INT3, you need them to execute a
whole lot more than just one instruction.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
  2019-06-17 17:25                   ` Andy Lutomirski
@ 2019-06-17 19:26                     ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-06-17 19:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Steven Rostedt, Masami Hiramatsu,
	the arch/x86 maintainers, LKML, Ard Biesheuvel, Ingo Molnar,
	Thomas Gleixner, Linus Torvalds, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 17, 2019 at 10:25:27AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 17, 2019 at 7:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:
> >
> > > I have run into similar problems before.
> > >
> > > I had two problematic scenarios. In the first case, I had a “call” in the
> > > middle of the patched code-block, but this call was always followed by a
> > > “jump” to the end of the potentially patched code-block, so I did not have
> > > the problem.
> > >
> > > In the second case, I had an indirect call (which is shorter than a direct
> >
> > Longer, 6 bytes vs 5 if I'm not mistaken.
> >
> > > call) being patched into a direct call. In this case, I preceded the
> > > indirect call with NOPs so indeed the indirect call was at the end of the
> > > patched block.
> > >
> > > In certain cases, if a shorter instruction should be potentially patched
> > > into a longer one, the shorter one can be preceded by some prefixes. If
> > > there are multiple REX prefixes, for instance, the CPU only uses the last
> > > one, IIRC. This can allow to avoid synchronize_sched() when patching a
> > > single instruction into another instruction with a different length.
> > >
> > > Not sure how helpful this information is, but sharing - just in case.
> >
> > I think we can patch multiple instructions provided:
> >
> >  - all but one instruction are a NOP,
> >  - there are no branch targets inside the range.
> >
> > By poking INT3 at every instruction in the range and then doing the
> > machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.
> 
> How do you know you'll trap them?  You need to IPI, serialize, and get
> them to execute an instruction.  If the CPU is in an interrupt and RIP
> just happens to be pointed to the INT3, you need them to execute a
> whole lot more than just one instruction.

Argh, yes, I'm an idiot.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/15] static_call: Add inline static call infrastructure
  2019-06-10 17:19       ` Josh Poimboeuf
  2019-06-10 18:33         ` Nadav Amit
@ 2019-10-01 12:00         ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-10-01 12:00 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Nadav Amit, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 10, 2019 at 12:19:29PM -0500, Josh Poimboeuf wrote:
> On Fri, Jun 07, 2019 at 10:37:56AM +0200, Peter Zijlstra wrote:
> > > > +}
> > > > +
> > > > +static int static_call_module_notify(struct notifier_block *nb,
> > > > +				     unsigned long val, void *data)
> > > > +{
> > > > +	struct module *mod = data;
> > > > +	int ret = 0;
> > > > +
> > > > +	cpus_read_lock();
> > > > +	static_call_lock();
> > > > +
> > > > +	switch (val) {
> > > > +	case MODULE_STATE_COMING:
> > > > +		module_disable_ro(mod);
> > > > +		ret = static_call_add_module(mod);
> > > > +		module_enable_ro(mod, false);
> > > 
> > > Doesn’t it cause some pages to be W+X ?
> 
> How so?

This is after complete_formation() which does RO,X. If we then disable
RO we end up with W+X pages, which is bad.

That said, alternatives, ftrace, dynamic_debug all run before
complete_formation() specifically such that they can directly poke text.

Possibly we should add a notifier callback for MODULE_STATE_UNFORMED,
but that is for another day.

> >> Can it be avoided?
> > 
> > I don't know why it does this, jump_labels doesn't seem to need this,
> > and I'm not seeing what static_call needs differently.
> 
> I forgot why I did this, but it's probably for the case where there's a
> static call site in module init code.  It deserves a comment.
> 
> Theoretically, jump labels need this to.
> 
> BTW, there's a change coming that will require the text_mutex before
> calling module_{disable,enable}_ro().

I can't find why it would need this (and I'm going to remove it).
Specifically complete_formation() does enable_ro(.after_init=false),
which leaves .ro_after_init writable so
{jump_label,static_call}_sort_entries() will work.

But both jump_label and static_call then use the full text_poke(), not
text_poke_early(), for modules.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64
  2019-06-10 18:33   ` Josh Poimboeuf
  2019-06-10 18:45     ` Nadav Amit
@ 2019-10-01 14:43     ` Peter Zijlstra
  1 sibling, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2019-10-01 14:43 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Nadav Amit, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Mon, Jun 10, 2019 at 01:33:57PM -0500, Josh Poimboeuf wrote:
> On Wed, Jun 05, 2019 at 03:08:06PM +0200, Peter Zijlstra wrote:
> > --- a/arch/x86/include/asm/static_call.h
> > +++ b/arch/x86/include/asm/static_call.h
> > @@ -2,6 +2,20 @@
> >  #ifndef _ASM_STATIC_CALL_H
> >  #define _ASM_STATIC_CALL_H
> >  
> > +#include <asm/asm-offsets.h>
> > +
> > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > +
> > +/*
> > + * This trampoline is only used during boot / module init, so it's safe to use
> > + * the indirect branch without a retpoline.
> > + */
> > +#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func)				\
> > +	ANNOTATE_RETPOLINE_SAFE						\
> > +	"jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
> > +
> > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> 
> I wonder if we can simplify this (and drop the indirect branch) by
> getting rid of the above cruft, and instead just use the out-of-line
> trampoline as the default for inline as well.
> 
> Then the inline case could fall back to the out-of-line implementation
> (by patching the trampoline's jmp dest) before static_call_initialized
> is set.

I think I've got that covered. I changed arch_static_call_transform() to
(always) first rewrite the trampoline and then the in-line site.

That way, when early/module crud comes in that still points at the
trampoline, it will jump to the right place.

I've not even compiled yet, but it ought to work ;-)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-06-07  8:28     ` Peter Zijlstra
  2019-06-07  8:49       ` Ard Biesheuvel
@ 2019-10-02 13:54       ` Peter Zijlstra
  2019-10-02 20:48         ` Josh Poimboeuf
  1 sibling, 1 reply; 87+ messages in thread
From: Peter Zijlstra @ 2019-10-02 13:54 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, LKML, Ard Biesheuvel, Andy Lutomirski,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Masami Hiramatsu, Jason Baron, Jiri Kosina, David Laight,
	Borislav Petkov, Julia Cartwright, Jessica Yu, H. Peter Anvin,
	Rasmus Villemoes, Edward Cree, Daniel Bristot de Oliveira,
	Josh Poimboeuf

On Fri, Jun 07, 2019 at 10:28:51AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 06, 2019 at 10:44:23PM +0000, Nadav Amit wrote:
> > > + * Usage example:
> > > + *
> > > + *   # Start with the following functions (with identical prototypes):
> > > + *   int func_a(int arg1, int arg2);
> > > + *   int func_b(int arg1, int arg2);
> > > + *
> > > + *   # Define a 'my_key' reference, associated with func_a() by default
> > > + *   DEFINE_STATIC_CALL(my_key, func_a);
> > > + *
> > > + *   # Call func_a()
> > > + *   static_call(my_key, arg1, arg2);
> > > + *
> > > + *   # Update 'my_key' to point to func_b()
> > > + *   static_call_update(my_key, func_b);
> > > + *
> > > + *   # Call func_b()
> > > + *   static_call(my_key, arg1, arg2);
> > 
> > I think that this calling interface is not very intuitive.
> 
> Yeah, it is somewhat unfortunate..
> 
> > I understand that
> > the macros/objtool cannot allow the calling interface to be completely
> > transparent (as compiler plugin could). But, can the macros be used to
> > paste the key with the “static_call”? I think that having something like:
> > 
> >   static_call__func(arg1, arg2)
> > 
> > Is more readable than
> > 
> >   static_call(func, arg1, arg2)
> 
> Doesn't really make it much better for me; I think I'd prefer to switch
> to the GCC plugin scheme over this.  ISTR there being some propotypes
> there, but I couldn't quickly locate them.

How about something like:

	static_call(key)(arg1, arg2);

which is very close to the regular indirect call syntax. Furthermore,
how about we put the trampolines in .static_call.text instead of relying
on prefixes?

Also, I think I can shrink static_call_key by half:

 - we can do away with static_call_key::tramp; there are only two usage
   sites:

     o __static_call_update, the static_call() macro can provide the
       address of STATIC_CALL_TRAMP(key) directly

     o static_call_add_module(), has two cases:

       * the trampoline is from outside the module; in this case
         it will already have been updated by a previous call to
	 __static_call_update.
       * the trampoline is from inside the module; in this case
         it will have the default value and it doesn't need an
	 update.

       so in no case does static_call_add_module() need to modify a
       trampoline.

  - we can change static_call_key::site_mods into a single next pointer,
    just like jump_label's static_key.

But so far all the schemes I've come up with require 'key' to be a name,
it cannot be an actual 'struct static_call_key *' value. And therefore
usage from within structures isn't allowed.



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 10/15] static_call: Add basic static call infrastructure
  2019-10-02 13:54       ` Peter Zijlstra
@ 2019-10-02 20:48         ` Josh Poimboeuf
  0 siblings, 0 replies; 87+ messages in thread
From: Josh Poimboeuf @ 2019-10-02 20:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, the arch/x86 maintainers, LKML, Ard Biesheuvel,
	Andy Lutomirski, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, Masami Hiramatsu, Jason Baron, Jiri Kosina,
	David Laight, Borislav Petkov, Julia Cartwright, Jessica Yu,
	H. Peter Anvin, Rasmus Villemoes, Edward Cree,
	Daniel Bristot de Oliveira

On Wed, Oct 02, 2019 at 03:54:17PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 10:28:51AM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 06, 2019 at 10:44:23PM +0000, Nadav Amit wrote:
> > > > + * Usage example:
> > > > + *
> > > > + *   # Start with the following functions (with identical prototypes):
> > > > + *   int func_a(int arg1, int arg2);
> > > > + *   int func_b(int arg1, int arg2);
> > > > + *
> > > > + *   # Define a 'my_key' reference, associated with func_a() by default
> > > > + *   DEFINE_STATIC_CALL(my_key, func_a);
> > > > + *
> > > > + *   # Call func_a()
> > > > + *   static_call(my_key, arg1, arg2);
> > > > + *
> > > > + *   # Update 'my_key' to point to func_b()
> > > > + *   static_call_update(my_key, func_b);
> > > > + *
> > > > + *   # Call func_b()
> > > > + *   static_call(my_key, arg1, arg2);
> > > 
> > > I think that this calling interface is not very intuitive.
> > 
> > Yeah, it is somewhat unfortunate..
> > 
> > > I understand that
> > > the macros/objtool cannot allow the calling interface to be completely
> > > transparent (as compiler plugin could). But, can the macros be used to
> > > paste the key with the “static_call”? I think that having something like:
> > > 
> > >   static_call__func(arg1, arg2)
> > > 
> > > Is more readable than
> > > 
> > >   static_call(func, arg1, arg2)
> > 
> > Doesn't really make it much better for me; I think I'd prefer to switch
> > to the GCC plugin scheme over this.  ISTR there being some propotypes
> > there, but I couldn't quickly locate them.
> 
> How about something like:
> 
> 	static_call(key)(arg1, arg2);
> 
> which is very close to the regular indirect call syntax.

Looks ok to me.

> Furthermore, how about we put the trampolines in .static_call.text
> instead of relying on prefixes?

Yeah, that would probably be better.

> Also, I think I can shrink static_call_key by half:
> 
>  - we can do away with static_call_key::tramp; there are only two usage
>    sites:
> 
>      o __static_call_update, the static_call() macro can provide the
>        address of STATIC_CALL_TRAMP(key) directly
> 
>      o static_call_add_module(), has two cases:
> 
>        * the trampoline is from outside the module; in this case
>          it will already have been updated by a previous call to
> 	 __static_call_update.
>        * the trampoline is from inside the module; in this case
>          it will have the default value and it doesn't need an
> 	 update.
> 
>        so in no case does static_call_add_module() need to modify a
>        trampoline.

Sounds plausible.

>   - we can change static_call_key::site_mods into a single next pointer,
>     just like jump_label's static_key.

Yep.

> But so far all the schemes I've come up with require 'key' to be a name,
> it cannot be an actual 'struct static_call_key *' value. And therefore
> usage from within structures isn't allowed.

Is that something we need?  At least we were able to work around this
limitation with tracepoints' usage of static calls.  But I could see how
it could be useful.

One way to solve that would be a completely different implementation:
have a global trampoline which detects the call site of the caller,
associates it with the given key, schedules some work to patch the call
site later, and then jumps to key->func.  So the first call would
trigger the patching.

Then we might not even need objtool :-)  But it might be tricky to pull
off.

-- 
Josh

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2019-10-02 20:48 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05 13:07 [PATCH 00/15] x86 cleanups and static_call() Peter Zijlstra
2019-06-05 13:07 ` [PATCH 01/15] x86/entry/32: Clean up return from interrupt preemption path Peter Zijlstra
2019-06-07 14:21   ` Josh Poimboeuf
2019-06-05 13:07 ` [PATCH 02/15] x86: Move ENCODE_FRAME_POINTER to asm/frame.h Peter Zijlstra
2019-06-07 14:24   ` Josh Poimboeuf
2019-06-05 13:07 ` [PATCH 03/15] x86/kprobes: Fix frame pointer annotations Peter Zijlstra
2019-06-07 13:02   ` Masami Hiramatsu
2019-06-07 13:36     ` Josh Poimboeuf
2019-06-07 15:21       ` Masami Hiramatsu
2019-06-11  8:12       ` Peter Zijlstra
2019-06-05 13:07 ` [PATCH 04/15] x86/ftrace: Add pt_regs frame annotations Peter Zijlstra
2019-06-07 14:45   ` Josh Poimboeuf
2019-06-05 13:07 ` [PATCH 05/15] x86_32: Provide consistent pt_regs Peter Zijlstra
2019-06-07 13:13   ` Masami Hiramatsu
2019-06-07 19:32   ` Josh Poimboeuf
2019-06-11  8:14     ` Peter Zijlstra
2019-06-05 13:07 ` [PATCH 06/15] x86_32: Allow int3_emulate_push() Peter Zijlstra
2019-06-05 13:08 ` [PATCH 07/15] x86: Add int3_emulate_call() selftest Peter Zijlstra
2019-06-10 16:52   ` Josh Poimboeuf
2019-06-10 16:57     ` Andy Lutomirski
2019-06-11  8:17       ` Peter Zijlstra
2019-06-05 13:08 ` [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions Peter Zijlstra
2019-06-07  5:41   ` Nadav Amit
2019-06-07  8:20     ` Peter Zijlstra
2019-06-07 14:27       ` Masami Hiramatsu
2019-06-07 15:47   ` Masami Hiramatsu
2019-06-07 17:34     ` Peter Zijlstra
2019-06-07 17:48       ` Linus Torvalds
2019-06-11 10:44         ` Peter Zijlstra
2019-06-07 18:10       ` Andy Lutomirski
2019-06-07 20:22         ` hpa
2019-06-11  8:03         ` Peter Zijlstra
2019-06-11 12:08           ` Peter Zijlstra
2019-06-11 12:34             ` Peter Zijlstra
2019-06-11 12:42               ` Peter Zijlstra
2019-06-11 15:22           ` Steven Rostedt
2019-06-11 15:52             ` Steven Rostedt
2019-06-11 15:55             ` Peter Zijlstra
2019-06-12 19:44               ` Nadav Amit
2019-06-17 14:42                 ` Peter Zijlstra
2019-06-17 17:06                   ` Nadav Amit
2019-06-17 17:25                   ` Andy Lutomirski
2019-06-17 19:26                     ` Peter Zijlstra
2019-06-11 15:54           ` Andy Lutomirski
2019-06-11 16:11             ` Steven Rostedt
2019-06-17 14:31             ` Peter Zijlstra
2019-06-12 17:09       ` Peter Zijlstra
2019-06-10 16:57   ` Josh Poimboeuf
2019-06-11 15:14   ` Steven Rostedt
2019-06-11 15:52     ` Peter Zijlstra
2019-06-11 16:21       ` Peter Zijlstra
2019-06-12 14:44         ` Peter Zijlstra
2019-06-05 13:08 ` [PATCH 09/15] compiler.h: Make __ADDRESSABLE() symbol truly unique Peter Zijlstra
2019-06-05 13:08 ` [PATCH 10/15] static_call: Add basic static call infrastructure Peter Zijlstra
2019-06-06 22:44   ` Nadav Amit
2019-06-07  8:28     ` Peter Zijlstra
2019-06-07  8:49       ` Ard Biesheuvel
2019-06-07 16:33         ` Andy Lutomirski
2019-06-07 16:58         ` Nadav Amit
2019-10-02 13:54       ` Peter Zijlstra
2019-10-02 20:48         ` Josh Poimboeuf
2019-06-05 13:08 ` [PATCH 11/15] static_call: Add inline " Peter Zijlstra
2019-06-06 22:24   ` Nadav Amit
2019-06-07  8:37     ` Peter Zijlstra
2019-06-07 16:35       ` Nadav Amit
2019-06-07 17:41         ` Peter Zijlstra
2019-06-10 17:19       ` Josh Poimboeuf
2019-06-10 18:33         ` Nadav Amit
2019-06-10 18:42           ` Josh Poimboeuf
2019-10-01 12:00         ` Peter Zijlstra
2019-06-05 13:08 ` [PATCH 12/15] x86/static_call: Add out-of-line static call implementation Peter Zijlstra
2019-06-07  6:13   ` Nadav Amit
2019-06-07  7:51     ` Steven Rostedt
2019-06-07  8:38     ` Peter Zijlstra
2019-06-07  8:52       ` Peter Zijlstra
2019-06-05 13:08 ` [PATCH 13/15] x86/static_call: Add inline static call implementation for x86-64 Peter Zijlstra
2019-06-07  5:50   ` Nadav Amit
2019-06-10 18:33   ` Josh Poimboeuf
2019-06-10 18:45     ` Nadav Amit
2019-06-10 18:55       ` Josh Poimboeuf
2019-06-10 19:20         ` Nadav Amit
2019-10-01 14:43     ` Peter Zijlstra
2019-06-05 13:08 ` [PATCH 14/15] static_call: Simple self-test module Peter Zijlstra
2019-06-10 17:24   ` Josh Poimboeuf
2019-06-11  8:29     ` Peter Zijlstra
2019-06-11 13:02       ` Josh Poimboeuf
2019-06-05 13:08 ` [PATCH 15/15] tracepoints: Use static_call Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).