linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes
@ 2020-05-05 13:41 Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
                   ` (17 more replies)
  0 siblings, 18 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Folks!

This is the second part of the rework series. Part 1 can be found here:

 https://lore.kernel.org/r/20200505131602.633487962@linutronix.de

The series has a total of 138 patches and is split into 5 parts. The base
for this series is:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v4-part-1

The full series with all parts applied is available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v4-part-5

The second part, i.e. this series is available from:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v4-part-2
 
This part contains the modifications for the syscall entry code and the
adjustment of the KVM code:

 - Move the non entry related ASM code into the regular text section so it
   is not part of the protected section

 - Move the low level entry C code into the .noinstr.text section

 - Make the interaction with lockdep, RCU and tracing correct

 - Move the KVM guest_enter/exit() handling strict vs. RCU and
   instrumentation protection. It's more or less the same problem as the
   syscall entry/exit and needs to be equally restrictive so that the
   rules can be enforced with objtool.

The objtool check for the noinstr.text correctness is not yet added to the
build machinery and has to be invoked manually for now:

   objtool check -fal vmlinux.o

The checking only works for builtin code as objtool cannot do a combined
analysis of vmlinux.o and a module.o

Thanks,

	tglx

8<----------
 arch/x86/entry/Makefile                |    2 
 arch/x86/entry/common.c                |  173 ++++++++++++++++++++++++---------
 arch/x86/entry/entry_32.S              |   35 ++----
 arch/x86/entry/entry_64.S              |   16 +--
 arch/x86/entry/entry_64_compat.S       |   55 ++++------
 arch/x86/entry/thunk_64.S              |   45 +++++++-
 arch/x86/include/asm/hardirq.h         |    4 
 arch/x86/include/asm/irqflags.h        |    3 
 arch/x86/include/asm/kvm_host.h        |    8 +
 arch/x86/include/asm/nospec-branch.h   |    4 
 arch/x86/include/asm/paravirt.h        |    3 
 arch/x86/kernel/ftrace_64.S            |    2 
 arch/x86/kvm/svm/svm.c                 |   66 ++++++++++--
 arch/x86/kvm/svm/vmenter.S             |    2 
 arch/x86/kvm/vmx/ops.h                 |    4 
 arch/x86/kvm/vmx/vmenter.S             |    5 
 arch/x86/kvm/vmx/vmx.c                 |   78 +++++++++++---
 arch/x86/kvm/x86.c                     |    4 
 include/linux/context_tracking.h       |   27 +++--
 include/linux/context_tracking_state.h |    6 -
 kernel/context_tracking.c              |   14 +-
 lib/smp_processor_id.c                 |   10 -
 22 files changed, 391 insertions(+), 175 deletions(-)


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-06 15:51   ` Peter Zijlstra
                     ` (3 more replies)
  2020-05-05 13:41 ` [patch V4 part 2 02/18] x86/entry/32: " Thomas Gleixner
                   ` (16 subsequent siblings)
  17 siblings, 4 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

All ASM code which is not part of the entry functionality can move out into
the .text section. No reason to keep it in the non-instrumentable entry
section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/entry_64.S   |   10 ++++++++++
 arch/x86/kernel/ftrace_64.S |    2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -279,6 +279,7 @@ SYM_CODE_END(entry_SYSCALL_64)
  * %rdi: prev task
  * %rsi: next task
  */
+.pushsection .text, "ax"
 SYM_CODE_START(__switch_to_asm)
 	UNWIND_HINT_FUNC
 	/*
@@ -322,6 +323,7 @@ SYM_CODE_START(__switch_to_asm)
 
 	jmp	__switch_to
 SYM_CODE_END(__switch_to_asm)
+.popsection
 
 /*
  * A newly forked process directly context switches into this address.
@@ -330,6 +332,7 @@ SYM_CODE_END(__switch_to_asm)
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
  */
+.pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
 	UNWIND_HINT_EMPTY
 	movq	%rax, %rdi
@@ -358,6 +361,7 @@ SYM_CODE_START(ret_from_fork)
 	movq	$0, RAX(%rsp)
 	jmp	2b
 SYM_CODE_END(ret_from_fork)
+.popsection
 
 /*
  * Build the entry stubs with some assembler magic.
@@ -1042,6 +1046,7 @@ idtentry simd_coprocessor_error		do_simd
 	 * Reload gs selector with exception handling
 	 * edi:  new selector
 	 */
+.pushsection .text, "ax"
 SYM_FUNC_START(native_load_gs_index)
 	FRAME_BEGIN
 	pushfq
@@ -1058,6 +1063,7 @@ SYM_FUNC_START(native_load_gs_index)
 	ret
 SYM_FUNC_END(native_load_gs_index)
 EXPORT_SYMBOL(native_load_gs_index)
+.popsection
 
 	_ASM_EXTABLE(.Lgs_change, .Lbad_gs)
 	.section .fixup, "ax"
@@ -1077,6 +1083,7 @@ SYM_CODE_END(.Lbad_gs)
 	.previous
 
 /* Call softirq on interrupt stack. Interrupts are off. */
+.pushsection .text, "ax"
 SYM_FUNC_START(do_softirq_own_stack)
 	pushq	%rbp
 	mov	%rsp, %rbp
@@ -1086,6 +1093,7 @@ SYM_FUNC_START(do_softirq_own_stack)
 	leaveq
 	ret
 SYM_FUNC_END(do_softirq_own_stack)
+.popsection
 
 #ifdef CONFIG_XEN_PV
 idtentry hypervisor_callback xen_do_hypervisor_callback has_error_code=0
@@ -1730,6 +1738,7 @@ SYM_CODE_START(ignore_sysret)
 SYM_CODE_END(ignore_sysret)
 #endif
 
+.pushsection .text, "ax"
 SYM_CODE_START(rewind_stack_do_exit)
 	UNWIND_HINT_FUNC
 	/* Prevent any naive code from trying to unwind to our caller. */
@@ -1741,3 +1750,4 @@ SYM_CODE_START(rewind_stack_do_exit)
 
 	call	do_exit
 SYM_CODE_END(rewind_stack_do_exit)
+.popsection
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -12,7 +12,7 @@
 #include <asm/frame.h>
 
 	.code64
-	.section .entry.text, "ax"
+	.section .text, "ax"
 
 #ifdef CONFIG_FRAME_POINTER
 /* Save parent and function stack frames (rip and rbp) */


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 02/18] x86/entry/32: Move non entry code into .text section
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-07 13:15   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 03/18] x86/entry: Mark enter_from_user_mode() noinstr Thomas Gleixner
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

All ASM code which is not part of the entry functionality can move out into
the .text section. No reason to keep it in the non-instrumentable entry
section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/entry_32.S |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -729,7 +729,8 @@
 /*
  * %eax: prev task
  * %edx: next task
- */
+*/
+.pushsection .text, "ax"
 SYM_CODE_START(__switch_to_asm)
 	/*
 	 * Save callee-saved registers
@@ -776,6 +777,7 @@ SYM_CODE_START(__switch_to_asm)
 
 	jmp	__switch_to
 SYM_CODE_END(__switch_to_asm)
+.popsection
 
 /*
  * The unwinder expects the last frame on the stack to always be at the same
@@ -784,6 +786,7 @@ SYM_CODE_END(__switch_to_asm)
  * asmlinkage function so its argument has to be pushed on the stack.  This
  * wrapper creates a proper "end of stack" frame header before the call.
  */
+.pushsection .text, "ax"
 SYM_FUNC_START(schedule_tail_wrapper)
 	FRAME_BEGIN
 
@@ -794,6 +797,8 @@ SYM_FUNC_START(schedule_tail_wrapper)
 	FRAME_END
 	ret
 SYM_FUNC_END(schedule_tail_wrapper)
+.popsection
+
 /*
  * A newly forked process directly context switches into this address.
  *
@@ -801,6 +806,7 @@ SYM_FUNC_END(schedule_tail_wrapper)
  * ebx: kernel thread func (NULL for user thread)
  * edi: kernel thread arg
  */
+.pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
 	call	schedule_tail_wrapper
 
@@ -825,6 +831,7 @@ SYM_CODE_START(ret_from_fork)
 	movl	$0, PT_EAX(%esp)
 	jmp	2b
 SYM_CODE_END(ret_from_fork)
+.popsection
 
 /*
  * Return to user mode is not as complex as all this looks,
@@ -1693,6 +1700,7 @@ SYM_CODE_START(general_protection)
 	jmp	common_exception
 SYM_CODE_END(general_protection)
 
+.pushsection .text, "ax"
 SYM_CODE_START(rewind_stack_do_exit)
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
@@ -1703,3 +1711,4 @@ SYM_CODE_START(rewind_stack_do_exit)
 	call	do_exit
 1:	jmp 1b
 SYM_CODE_END(rewind_stack_do_exit)
+.popsection


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 03/18] x86/entry: Mark enter_from_user_mode() noinstr
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 02/18] x86/entry/32: " Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-08  8:21   ` Masami Hiramatsu
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation Thomas Gleixner
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Both the callers in the low level ASM code and __context_tracking_exit()
which is invoked from enter_from_user_mode() via user_exit_irqoff() are
marked NOKPROBE. Allowing enter_from_user_mode() to be probed is
inconsistent at best.

Aside of that while function tracing per se is safe the function trace
entry/exit points can be used via BPF as well which is not safe to use
before context tracking has reached CONTEXT_KERNEL and adjusted RCU.

Mark it noinstr which moves it into the instrumentation protected text
section and includes notrace.

Note, this needs further fixups in context tracking to ensure that the
full call chain is protected. Will be addressed in follow up changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -41,7 +41,7 @@
 
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
-__visible inline void enter_from_user_mode(void)
+__visible inline noinstr void enter_from_user_mode(void)
 {
 	CT_WARN_ON(ct_state() != CONTEXT_USER);
 	user_exit_irqoff();


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (2 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 03/18] x86/entry: Mark enter_from_user_mode() noinstr Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-07 13:39   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code Thomas Gleixner
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Mark the various syscall entries with noinstr to protect them against
instrumentation and add the noinstr_begin()/end() annotations to mark the
parts of the functions which are safe to call out into instrumentable code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c |  135 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 90 insertions(+), 45 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -41,15 +41,26 @@
 
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
-__visible inline noinstr void enter_from_user_mode(void)
+__visible noinstr void enter_from_user_mode(void)
 {
-	CT_WARN_ON(ct_state() != CONTEXT_USER);
+	enum ctx_state state = ct_state();
+
 	user_exit_irqoff();
+
+	instr_begin();
+	CT_WARN_ON(state != CONTEXT_USER);
+	instr_end();
 }
 #else
 static inline void enter_from_user_mode(void) {}
 #endif
 
+static noinstr void exit_to_user_mode(void)
+{
+	user_enter_irqoff();
+	mds_user_clear_cpu_buffers();
+}
+
 static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
 {
 #ifdef CONFIG_X86_64
@@ -179,8 +190,7 @@ static void exit_to_usermode_loop(struct
 	}
 }
 
-/* Called with IRQs disabled. */
-__visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
+static void __prepare_exit_to_usermode(struct pt_regs *regs)
 {
 	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
@@ -219,10 +229,14 @@ static void exit_to_usermode_loop(struct
 	 */
 	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
+}
 
-	user_enter_irqoff();
-
-	mds_user_clear_cpu_buffers();
+__visible noinstr void prepare_exit_to_usermode(struct pt_regs *regs)
+{
+	instr_begin();
+	__prepare_exit_to_usermode(regs);
+	instr_end();
+	exit_to_user_mode();
 }
 
 #define SYSCALL_EXIT_WORK_FLAGS				\
@@ -251,11 +265,7 @@ static void syscall_slow_exit_work(struc
 		tracehook_report_syscall_exit(regs, step);
 }
 
-/*
- * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
- * state such that we can immediately switch to user mode.
- */
-__visible inline void syscall_return_slowpath(struct pt_regs *regs)
+static void __syscall_return_slowpath(struct pt_regs *regs)
 {
 	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
@@ -276,15 +286,29 @@ static void syscall_slow_exit_work(struc
 		syscall_slow_exit_work(regs, cached_flags);
 
 	local_irq_disable();
-	prepare_exit_to_usermode(regs);
+	__prepare_exit_to_usermode(regs);
+}
+
+/*
+ * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
+ * state such that we can immediately switch to user mode.
+ */
+__visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
+{
+	instr_begin();
+	__syscall_return_slowpath(regs);
+	instr_end();
+	exit_to_user_mode();
 }
 
 #ifdef CONFIG_X86_64
-__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
 	struct thread_info *ti;
 
 	enter_from_user_mode();
+	instr_begin();
+
 	local_irq_enable();
 	ti = current_thread_info();
 	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
@@ -301,8 +325,10 @@ static void syscall_slow_exit_work(struc
 		regs->ax = x32_sys_call_table[nr](regs);
 #endif
 	}
+	__syscall_return_slowpath(regs);
 
-	syscall_return_slowpath(regs);
+	instr_end();
+	exit_to_user_mode();
 }
 #endif
 
@@ -310,10 +336,10 @@ static void syscall_slow_exit_work(struc
 /*
  * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
  * all entry and exit work and returns with IRQs off.  This function is
- * extremely hot in workloads that use it, and it's usually called from
+ * ex2tremely hot in workloads that use it, and it's usually called from
  * do_fast_syscall_32, so forcibly inline it to improve performance.
  */
-static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
+static void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
 	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
@@ -337,27 +363,62 @@ static __always_inline void do_syscall_3
 		regs->ax = ia32_sys_call_table[nr](regs);
 	}
 
-	syscall_return_slowpath(regs);
+	__syscall_return_slowpath(regs);
 }
 
 /* Handles int $0x80 */
-__visible void do_int80_syscall_32(struct pt_regs *regs)
+__visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
 	enter_from_user_mode();
+	instr_begin();
+
 	local_irq_enable();
 	do_syscall_32_irqs_on(regs);
+
+	instr_end();
+	exit_to_user_mode();
+}
+
+static bool __do_fast_syscall_32(struct pt_regs *regs)
+{
+	int res;
+
+	/* Fetch EBP from where the vDSO stashed it. */
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		/*
+		 * Micro-optimization: the pointer we're following is
+		 * explicitly 32 bits, so it can't be out of range.
+		 */
+		res = __get_user(*(u32 *)&regs->bp,
+			 (u32 __user __force *)(unsigned long)(u32)regs->sp);
+	} else {
+		res = get_user(*(u32 *)&regs->bp,
+		       (u32 __user __force *)(unsigned long)(u32)regs->sp);
+	}
+
+	if (res) {
+		/* User code screwed up. */
+		regs->ax = -EFAULT;
+		local_irq_disable();
+		__prepare_exit_to_usermode(regs);
+		return false;
+	}
+
+	/* Now this is just like a normal syscall. */
+	do_syscall_32_irqs_on(regs);
+	return true;
 }
 
 /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
-__visible long do_fast_syscall_32(struct pt_regs *regs)
+__visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 {
 	/*
 	 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
 	 * convention.  Adjust regs so it looks like we entered using int80.
 	 */
-
 	unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
-		vdso_image_32.sym_int80_landing_pad;
+					vdso_image_32.sym_int80_landing_pad;
+	bool success;
 
 	/*
 	 * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
@@ -367,33 +428,17 @@ static __always_inline void do_syscall_3
 	regs->ip = landing_pad;
 
 	enter_from_user_mode();
+	instr_begin();
 
 	local_irq_enable();
+	success = __do_fast_syscall_32(regs);
 
-	/* Fetch EBP from where the vDSO stashed it. */
-	if (
-#ifdef CONFIG_X86_64
-		/*
-		 * Micro-optimization: the pointer we're following is explicitly
-		 * 32 bits, so it can't be out of range.
-		 */
-		__get_user(*(u32 *)&regs->bp,
-			    (u32 __user __force *)(unsigned long)(u32)regs->sp)
-#else
-		get_user(*(u32 *)&regs->bp,
-			 (u32 __user __force *)(unsigned long)(u32)regs->sp)
-#endif
-		) {
-
-		/* User code screwed up. */
-		local_irq_disable();
-		regs->ax = -EFAULT;
-		prepare_exit_to_usermode(regs);
-		return 0;	/* Keep it simple: use IRET. */
-	}
+	instr_end();
+	exit_to_user_mode();
 
-	/* Now this is just like a normal syscall. */
-	do_syscall_32_irqs_on(regs);
+	/* If it failed, keep it simple: use IRET. */
+	if (!success)
+		return 0;
 
 #ifdef CONFIG_X86_64
 	/*


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (3 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-07 13:55   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode() Thomas Gleixner
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Now that the C entry points are safe, move the irq flags tracing code into
the entry helper:

    - Invoke lockdep before calling into context tracking

    - Use the safe trace_hardirqs_on_prepare() trace function after context
      tracking established state and RCU is watching.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c          |   21 +++++++++++++++++++--
 arch/x86/entry/entry_32.S        |   12 ------------
 arch/x86/entry/entry_64.S        |    2 --
 arch/x86/entry/entry_64_compat.S |   18 ------------------
 4 files changed, 19 insertions(+), 34 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -40,19 +40,36 @@
 #include <trace/events/syscalls.h>
 
 #ifdef CONFIG_CONTEXT_TRACKING
-/* Called on entry from user mode with IRQs off. */
+/**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall entry disables interrupts, but user mode is traced as interrupts
+ * enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ */
 __visible noinstr void enter_from_user_mode(void)
 {
 	enum ctx_state state = ct_state();
 
+	lockdep_hardirqs_off(CALLER_ADDR0);
 	user_exit_irqoff();
 
 	instr_begin();
 	CT_WARN_ON(state != CONTEXT_USER);
+	trace_hardirqs_off_prepare();
 	instr_end();
 }
 #else
-static inline void enter_from_user_mode(void) {}
+static __always_inline void enter_from_user_mode(void)
+{
+	lockdep_hardirqs_off(CALLER_ADDR0);
+	instr_begin();
+	trace_hardirqs_off_prepare();
+	instr_end();
+}
 #endif
 
 static noinstr void exit_to_user_mode(void)
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -967,12 +967,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
 	jnz	.Lsysenter_fix_flags
 .Lsysenter_flags_fixed:
 
-	/*
-	 * User mode is traced as though IRQs are on, and SYSENTER
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movl	%esp, %eax
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -1082,12 +1076,6 @@ SYM_FUNC_START(entry_INT80_32)
 
 	SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1	/* save rest */
 
-	/*
-	 * User mode is traced as though IRQs are on, and the interrupt gate
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movl	%esp, %eax
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -167,8 +167,6 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_h
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
-	TRACE_IRQS_OFF
-
 	/* IRQs are off. */
 	movq	%rax, %rdi
 	movq	%rsp, %rsi
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -129,12 +129,6 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	jnz	.Lsysenter_fix_flags
 .Lsysenter_flags_fixed:
 
-	/*
-	 * User mode is traced as though IRQs are on, and SYSENTER
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -247,12 +241,6 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_aft
 	pushq   $0			/* pt_regs->r15 = 0 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
 
-	/*
-	 * User mode is traced as though IRQs are on, and SYSENTER
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -403,12 +391,6 @@ SYM_CODE_START(entry_INT80_compat)
 	xorl	%r15d, %r15d		/* nospec   r15 */
 	cld
 
-	/*
-	 * User mode is traced as though IRQs are on, and the interrupt
-	 * gate turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movq	%rsp, %rdi
 	call	do_int80_syscall_32
 .Lsyscall_32_done:


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode()
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (4 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-08 23:57   ` Andy Lutomirski
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 07/18] context_tracking: Ensure that the critical path cannot be instrumented Thomas Gleixner
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

This is another step towards more C-code and less convoluted ASM.

Similar to the entry path, invoke the tracer before context tracking which
might turn off RCU and invoke lockdep as the last step before going back to
user space. Annotate the code sections in exit_to_user_mode() accordingly
so objtool won't complain about the tracer invocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3: Convert it to noinstr

V2: New patch simplifying the conversion and addressing Alex' review
    comment of redundant tracing.
---
 arch/x86/entry/common.c          |   19 ++++++++++++++++++-
 arch/x86/entry/entry_32.S        |   12 ++++--------
 arch/x86/entry/entry_64.S        |    4 ----
 arch/x86/entry/entry_64_compat.S |   14 +++++---------
 4 files changed, 27 insertions(+), 22 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -72,10 +72,27 @@ static __always_inline void enter_from_u
 }
 #endif
 
-static noinstr void exit_to_user_mode(void)
+/**
+ * exit_to_user_mode - Fixup state when exiting to user mode
+ *
+ * Syscall exit enables interrupts, but the kernel state is interrupts
+ * disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Clear CPU buffers if CPU is affected by MDS and the migitation is on.
+ * 4) Tell lockdep that interrupts are enabled
+ */
+static __always_inline void exit_to_user_mode(void)
 {
+	instr_begin();
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	instr_end();
+
 	user_enter_irqoff();
 	mds_user_clear_cpu_buffers();
+	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
 static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -817,8 +817,7 @@ SYM_CODE_START(ret_from_fork)
 	/* When we fork, we trace the syscall return in the child, too. */
 	movl    %esp, %eax
 	call    syscall_return_slowpath
-	STACKLEAK_ERASE
-	jmp     restore_all
+	jmp     .Lsyscall_32_done
 
 	/* kernel thread */
 1:	movl	%edi, %eax
@@ -862,7 +861,7 @@ SYM_CODE_START_LOCAL(ret_from_exception)
 	TRACE_IRQS_OFF
 	movl	%esp, %eax
 	call	prepare_exit_to_usermode
-	jmp	restore_all
+	jmp	restore_all_switch_stack
 SYM_CODE_END(ret_from_exception)
 
 SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
@@ -975,8 +974,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
 
 	STACKLEAK_ERASE
 
-/* Opportunistic SYSEXIT */
-	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
+	/* Opportunistic SYSEXIT */
 
 	/*
 	 * Setup entry stack - we keep the pointer in %eax and do the
@@ -1079,11 +1077,9 @@ SYM_FUNC_START(entry_INT80_32)
 	movl	%esp, %eax
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
-
 	STACKLEAK_ERASE
 
-restore_all:
-	TRACE_IRQS_ON
+restore_all_switch_stack:
 	SWITCH_TO_ENTRY_STACK
 	CHECK_AND_APPLY_ESPFIX
 
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -172,8 +172,6 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_h
 	movq	%rsp, %rsi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
-	TRACE_IRQS_ON			/* return enables interrupts */
-
 	/*
 	 * Try to use SYSRET instead of IRET if we're returning to
 	 * a completely clean 64-bit userspace context.  If we're not,
@@ -343,7 +341,6 @@ SYM_CODE_START(ret_from_fork)
 	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
-	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	jmp	swapgs_restore_regs_and_return_to_usermode
 
 1:
@@ -621,7 +618,6 @@ SYM_CODE_START_LOCAL(common_interrupt)
 .Lretint_user:
 	mov	%rsp,%rdi
 	call	prepare_exit_to_usermode
-	TRACE_IRQS_ON
 
 SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 #ifdef CONFIG_DEBUG_ENTRY
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -132,8 +132,8 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
-		    "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
+	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
+		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 	jmp	sysret32_from_system_call
 
 .Lsysenter_fix_flags:
@@ -244,8 +244,8 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_aft
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
-		    "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
+	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
+		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 
 	/* Opportunistic SYSRET */
 sysret32_from_system_call:
@@ -254,7 +254,7 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_aft
 	 * stack. So let's erase the thread stack right now.
 	 */
 	STACKLEAK_ERASE
-	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
+
 	movq	RBX(%rsp), %rbx		/* pt_regs->rbx */
 	movq	RBP(%rsp), %rbp		/* pt_regs->rbp */
 	movq	EFLAGS(%rsp), %r11	/* pt_regs->flags (in r11) */
@@ -393,9 +393,5 @@ SYM_CODE_START(entry_INT80_compat)
 
 	movq	%rsp, %rdi
 	call	do_int80_syscall_32
-.Lsyscall_32_done:
-
-	/* Go back to user mode. */
-	TRACE_IRQS_ON
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(entry_INT80_compat)


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 07/18] context_tracking: Ensure that the critical path cannot be instrumented
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (5 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode() Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-08  8:23   ` Masami Hiramatsu
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 08/18] lib/smp_processor_id: Move it into noinstr section Thomas Gleixner
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

context tracking lacks a few protection mechanisms against instrumentation:

 - While the core functions are marked NOKPROBE they lack protection
   against function tracing which is required as the function entry/exit
   points can be utilized by BPF.

 - static functions invoked from the protected functions need to be marked
   as well as they can be instrumented otherwise.

 - using plain inline allows the compiler to emit traceable and probable
   functions.

Fix this by marking the functions noinstr and converting the plain inlines
to __always_inline.

The NOKPROBE_SYMBOL() annotations are removed as the .noinstr.text section
is already excluded from being probed.

Cures the following objtool warnings:

 vmlinux.o: warning: objtool: enter_from_user_mode()+0x34: call to __context_tracking_exit() leaves .noinstr.text section
 vmlinux.o: warning: objtool: prepare_exit_to_usermode()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: syscall_return_slowpath()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_syscall_64()+0x7f: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_int80_syscall_32()+0x3d: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_fast_syscall_32()+0x9c: call to __context_tracking_enter() leaves .noinstr.text section

and generates new ones...

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/context_tracking.h       |    6 +++---
 include/linux/context_tracking_state.h |    6 +++---
 kernel/context_tracking.c              |   14 ++++++++------
 3 files changed, 14 insertions(+), 12 deletions(-)

--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -33,13 +33,13 @@ static inline void user_exit(void)
 }
 
 /* Called with interrupts disabled.  */
-static inline void user_enter_irqoff(void)
+static __always_inline void user_enter_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_enter(CONTEXT_USER);
 
 }
-static inline void user_exit_irqoff(void)
+static __always_inline void user_exit_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_USER);
@@ -75,7 +75,7 @@ static inline void exception_exit(enum c
  * is enabled.  If context tracking is disabled, returns
  * CONTEXT_DISABLED.  This should be used primarily for debugging.
  */
-static inline enum ctx_state ct_state(void)
+static __always_inline enum ctx_state ct_state(void)
 {
 	return context_tracking_enabled() ?
 		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -26,12 +26,12 @@ struct context_tracking {
 extern struct static_key_false context_tracking_key;
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
 
-static inline bool context_tracking_enabled(void)
+static __always_inline bool context_tracking_enabled(void)
 {
 	return static_branch_unlikely(&context_tracking_key);
 }
 
-static inline bool context_tracking_enabled_cpu(int cpu)
+static __always_inline bool context_tracking_enabled_cpu(int cpu)
 {
 	return context_tracking_enabled() && per_cpu(context_tracking.active, cpu);
 }
@@ -41,7 +41,7 @@ static inline bool context_tracking_enab
 	return context_tracking_enabled() && __this_cpu_read(context_tracking.active);
 }
 
-static inline bool context_tracking_in_user(void)
+static __always_inline bool context_tracking_in_user(void)
 {
 	return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -31,7 +31,7 @@ EXPORT_SYMBOL_GPL(context_tracking_key);
 DEFINE_PER_CPU(struct context_tracking, context_tracking);
 EXPORT_SYMBOL_GPL(context_tracking);
 
-static bool context_tracking_recursion_enter(void)
+static noinstr bool context_tracking_recursion_enter(void)
 {
 	int recursion;
 
@@ -45,7 +45,7 @@ static bool context_tracking_recursion_e
 	return false;
 }
 
-static void context_tracking_recursion_exit(void)
+static __always_inline void context_tracking_recursion_exit(void)
 {
 	__this_cpu_dec(context_tracking.recursion);
 }
@@ -59,7 +59,7 @@ static void context_tracking_recursion_e
  * instructions to execute won't use any RCU read side critical section
  * because this function sets RCU in extended quiescent state.
  */
-void __context_tracking_enter(enum ctx_state state)
+void noinstr __context_tracking_enter(enum ctx_state state)
 {
 	/* Kernel threads aren't supposed to go to userspace */
 	WARN_ON_ONCE(!current->mm);
@@ -77,8 +77,10 @@ void __context_tracking_enter(enum ctx_s
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				instr_begin();
 				trace_user_enter(0);
 				vtime_user_enter(current);
+				instr_end();
 			}
 			rcu_user_enter();
 		}
@@ -99,7 +101,6 @@ void __context_tracking_enter(enum ctx_s
 	}
 	context_tracking_recursion_exit();
 }
-NOKPROBE_SYMBOL(__context_tracking_enter);
 EXPORT_SYMBOL_GPL(__context_tracking_enter);
 
 void context_tracking_enter(enum ctx_state state)
@@ -142,7 +143,7 @@ NOKPROBE_SYMBOL(context_tracking_user_en
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void __context_tracking_exit(enum ctx_state state)
+void noinstr __context_tracking_exit(enum ctx_state state)
 {
 	if (!context_tracking_recursion_enter())
 		return;
@@ -155,15 +156,16 @@ void __context_tracking_exit(enum ctx_st
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				instr_begin();
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				instr_end();
 			}
 		}
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
 	}
 	context_tracking_recursion_exit();
 }
-NOKPROBE_SYMBOL(__context_tracking_exit);
 EXPORT_SYMBOL_GPL(__context_tracking_exit);
 
 void context_tracking_exit(enum ctx_state state)


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 08/18] lib/smp_processor_id: Move it into noinstr section
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (6 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 07/18] context_tracking: Ensure that the critical path cannot be instrumented Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline tip-bot2 for Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] lib/smp_processor_id: Move it into noinstr section tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 09/18] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline Thomas Gleixner
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

That code is already not traceable. Move it into the noinstr section so the
objtool section validation does not trigger.

Annotate the warning code as "safe". While it might be not under all
circumstances, getting the information out is important enough.

Should this ever trigger from the sensitive code which is shielded against
instrumentation, e.g. low level entry, then the printk is the least of the
worries.

Addresses the objtool warnings:
 vmlinux.o: warning: objtool: context_tracking_recursion_enter()+0x7: call to __this_cpu_preempt_check() leaves .noinstr.text section
 vmlinux.o: warning: objtool: __context_tracking_exit()+0x17: call to __this_cpu_preempt_check() leaves .noinstr.text section
 vmlinux.o: warning: objtool: __context_tracking_enter()+0x2a: call to __this_cpu_preempt_check() leaves .noinstr.text section

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 lib/smp_processor_id.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/lib/smp_processor_id.c
+++ b/lib/smp_processor_id.c
@@ -8,7 +8,7 @@
 #include <linux/kprobes.h>
 #include <linux/sched.h>
 
-notrace static nokprobe_inline
+noinstr static
 unsigned int check_preemption_disabled(const char *what1, const char *what2)
 {
 	int this_cpu = raw_smp_processor_id();
@@ -37,6 +37,7 @@ unsigned int check_preemption_disabled(c
 	 */
 	preempt_disable_notrace();
 
+	instr_begin();
 	if (!printk_ratelimit())
 		goto out_enable;
 
@@ -45,6 +46,7 @@ unsigned int check_preemption_disabled(c
 
 	printk("caller is %pS\n", __builtin_return_address(0));
 	dump_stack();
+	instr_end();
 
 out_enable:
 	preempt_enable_no_resched_notrace();
@@ -52,16 +54,14 @@ unsigned int check_preemption_disabled(c
 	return this_cpu;
 }
 
-notrace unsigned int debug_smp_processor_id(void)
+noinstr unsigned int debug_smp_processor_id(void)
 {
 	return check_preemption_disabled("smp_processor_id", "");
 }
 EXPORT_SYMBOL(debug_smp_processor_id);
-NOKPROBE_SYMBOL(debug_smp_processor_id);
 
-notrace void __this_cpu_preempt_check(const char *op)
+noinstr void __this_cpu_preempt_check(const char *op)
 {
 	check_preemption_disabled("__this_cpu_", op);
 }
 EXPORT_SYMBOL(__this_cpu_preempt_check);
-NOKPROBE_SYMBOL(__this_cpu_preempt_check);


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 09/18] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (7 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 08/18] lib/smp_processor_id: Move it into noinstr section Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk Thomas Gleixner
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Prevent the compiler from uninlining and creating traceable/probable
functions as this is invoked _after_ context tracking switched to
CONTEXT_USER and rcu idle.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/nospec-branch.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -262,7 +262,7 @@ DECLARE_STATIC_KEY_FALSE(mds_idle_clear)
  * combination with microcode which triggers a CPU buffer flush when the
  * instruction is executed.
  */
-static inline void mds_clear_cpu_buffers(void)
+static __always_inline void mds_clear_cpu_buffers(void)
 {
 	static const u16 ds = __KERNEL_DS;
 
@@ -283,7 +283,7 @@ static inline void mds_clear_cpu_buffers
  *
  * Clear CPU buffers if the corresponding static key is enabled
  */
-static inline void mds_user_clear_cpu_buffers(void)
+static __always_inline void mds_user_clear_cpu_buffers(void)
 {
 	if (static_branch_likely(&mds_user_clear))
 		mds_clear_cpu_buffers();


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (8 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 09/18] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-07 14:15   ` Alexandre Chartre
                     ` (2 more replies)
  2020-05-05 13:41 ` [patch V4 part 2 11/18] x86/entry/64: Mark ___preempt_schedule_notrace() thunk noinstr Thomas Gleixner
                   ` (7 subsequent siblings)
  17 siblings, 3 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

The preempt_enable_notrace() ASM thunk is called from tracing, entry code
RCU and other places which are already in or going to be in the noinstr
section which protects sensitve code from being instrumented.

Calls out of these sections happen with interrupts disabled, which is
handled in C code, but the push regs, call, pop regs sequence can be
completely avoided in this case.

This is also a preparatory step for annotating the call from the thunk to
preempt_enable_notrace() safe from a noinstr section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/thunk_64.S       |   27 +++++++++++++++++++++++----
 arch/x86/include/asm/irqflags.h |    3 +--
 arch/x86/include/asm/paravirt.h |    3 +--
 3 files changed, 25 insertions(+), 8 deletions(-)

--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -9,10 +9,28 @@
 #include "calling.h"
 #include <asm/asm.h>
 #include <asm/export.h>
+#include <asm/irqflags.h>
+
+.code64
 
 	/* rdi:	arg1 ... normal C conventions. rax is saved/restored. */
-	.macro THUNK name, func, put_ret_addr_in_rdi=0
+	.macro THUNK name, func, put_ret_addr_in_rdi=0, check_if=0
 SYM_FUNC_START_NOALIGN(\name)
+
+	.if \check_if
+	/*
+	 * Check for interrupts disabled right here. No point in
+	 * going all the way down
+	 */
+	pushq	%rax
+	SAVE_FLAGS(CLBR_RAX)
+	testl	$X86_EFLAGS_IF, %eax
+	popq	%rax
+	jnz	1f
+	ret
+1:
+	.endif
+
 	pushq %rbp
 	movq %rsp, %rbp
 
@@ -38,14 +56,15 @@ SYM_FUNC_END(\name)
 	.endm
 
 #ifdef CONFIG_TRACE_IRQFLAGS
-	THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
-	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
+	THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller, put_ret_addr_in_rdi=1
+	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller, put_ret_addr_in_rdi=1
 #endif
 
 #ifdef CONFIG_PREEMPTION
 	THUNK preempt_schedule_thunk, preempt_schedule
-	THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
 	EXPORT_SYMBOL(preempt_schedule_thunk)
+
+	THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace, check_if=1
 	EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
 #endif
 
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -127,9 +127,8 @@ static inline notrace unsigned long arch
 #define DISABLE_INTERRUPTS(x)	cli
 
 #ifdef CONFIG_X86_64
-#ifdef CONFIG_DEBUG_ENTRY
+
 #define SAVE_FLAGS(x)		pushfq; popq %rax
-#endif
 
 #define SWAPGS	swapgs
 /*
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -907,14 +907,13 @@ extern void default_banner(void);
 		  ANNOTATE_RETPOLINE_SAFE;				\
 		  jmp PARA_INDIRECT(pv_ops+PV_CPU_usergs_sysret64);)
 
-#ifdef CONFIG_DEBUG_ENTRY
 #define SAVE_FLAGS(clobbers)                                        \
 	PARA_SITE(PARA_PATCH(PV_IRQ_save_fl),			    \
 		  PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);        \
 		  ANNOTATE_RETPOLINE_SAFE;			    \
 		  call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);	    \
 		  PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);)
-#endif
+
 #endif /* CONFIG_PARAVIRT_XXL */
 #endif	/* CONFIG_X86_64 */
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 11/18] x86/entry/64: Mark ___preempt_schedule_notrace() thunk noinstr
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (9 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean Thomas Gleixner
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Code calling this from noinstr sections, e.g. entry code, has interrupts
disabled, so the actual call into the scheduler code does not happen.

The objtool section check complains nevertheless, so mark the call "safe".

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/thunk_64.S |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -12,6 +12,7 @@
 #include <asm/irqflags.h>
 
 .code64
+.section .noinstr.text, "ax"
 
 	/* rdi:	arg1 ... normal C conventions. rax is saved/restored. */
 	.macro THUNK name, func, put_ret_addr_in_rdi=0, check_if=0
@@ -49,10 +50,24 @@ SYM_FUNC_START_NOALIGN(\name)
 	movq 8(%rbp), %rdi
 	.endif
 
+	/*
+	 * noinstr callers will have interrupts disabled and will thus
+	 * not get here. Annotate the call as objtool does not know about
+	 * this and would complain about leaving the noinstr section.
+	 */
+1:
+	.pushsection .discard.instr_begin
+	.long 1b - .
+	.popsection
+
 	call \func
+2:
+	.pushsection .discard.instr_end
+	.long 2b - .
+	.popsection
+
 	jmp  .L_restore
 SYM_FUNC_END(\name)
-	_ASM_NOKPROBE(\name)
 	.endm
 
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -82,6 +97,5 @@ SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
 	popq %rdi
 	popq %rbp
 	ret
-	_ASM_NOKPROBE(.L_restore)
 SYM_CODE_END(.L_restore)
 #endif


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (10 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 11/18] x86/entry/64: Mark ___preempt_schedule_notrace() thunk noinstr Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-09  0:11   ` Andy Lutomirski
  2020-05-19 19:58   ` [tip: x86/entry] x86/entry: " tip-bot2 for Peter Zijlstra
  2020-05-05 13:41 ` [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs Thomas Gleixner
                   ` (5 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

Currently entry_64_compat is exempt from objtool, but with vmlinux
mode there is no hiding it.

Make the following changes to make it pass:

 - change entry_SYSENTER_compat to STT_NOTYPE; it's not a function
   and doesn't have function type stack setup.

 - mark all STT_NOTYPE symbols with UNWIND_HINT_EMPTY; so we do
   validate them and don't treat them as unreachable.

 - don't abuse RSP as a temp register, this confuses objtool
   mightily as it (rightfully) thinks we're doing unspeakable
   things to the stack.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/Makefile          |    2 --
 arch/x86/entry/entry_64_compat.S |   25 ++++++++++++++++++++-----
 2 files changed, 20 insertions(+), 7 deletions(-)

--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -11,8 +11,6 @@ CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRA
 CFLAGS_REMOVE_syscall_32.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
 CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
 
-OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
-
 CFLAGS_syscall_64.o		+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o		+= $(call cc-option,-Wno-override-init,)
 obj-y				:= entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -46,12 +46,14 @@
  * ebp  user stack
  * 0(%ebp) arg6
  */
-SYM_FUNC_START(entry_SYSENTER_compat)
+SYM_CODE_START(entry_SYSENTER_compat)
+	UNWIND_HINT_EMPTY
 	/* Interrupts are off on entry. */
 	SWAPGS
 
-	/* We are about to clobber %rsp anyway, clobbering here is OK */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+	pushq	%rax
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+	popq	%rax
 
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
@@ -104,6 +106,9 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	xorl	%r14d, %r14d		/* nospec   r14 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
+
+	UNWIND_HINT_REGS
+
 	cld
 
 	/*
@@ -141,7 +146,7 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	popfq
 	jmp	.Lsysenter_flags_fixed
 SYM_INNER_LABEL(__end_entry_SYSENTER_compat, SYM_L_GLOBAL)
-SYM_FUNC_END(entry_SYSENTER_compat)
+SYM_CODE_END(entry_SYSENTER_compat)
 
 /*
  * 32-bit SYSCALL entry.
@@ -191,6 +196,7 @@ SYM_FUNC_END(entry_SYSENTER_compat)
  * 0(%esp) arg6
  */
 SYM_CODE_START(entry_SYSCALL_compat)
+	UNWIND_HINT_EMPTY
 	/* Interrupts are off on entry. */
 	swapgs
 
@@ -241,6 +247,8 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_aft
 	pushq   $0			/* pt_regs->r15 = 0 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
 
+	UNWIND_HINT_REGS
+
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -328,6 +336,7 @@ SYM_CODE_END(entry_SYSCALL_compat)
  * ebp  arg6
  */
 SYM_CODE_START(entry_INT80_compat)
+	UNWIND_HINT_EMPTY
 	/*
 	 * Interrupts are off on entry.
 	 */
@@ -349,8 +358,11 @@ SYM_CODE_START(entry_INT80_compat)
 
 	/* Need to switch before accessing the thread stack. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/* In the Xen PV case we already run on the thread stack. */
-	ALTERNATIVE "movq %rsp, %rdi", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV
+	ALTERNATIVE "", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV
+
+	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	pushq	6*8(%rdi)		/* regs->ss */
@@ -389,6 +401,9 @@ SYM_CODE_START(entry_INT80_compat)
 	xorl	%r14d, %r14d		/* nospec   r14 */
 	pushq   %r15                    /* pt_regs->r15 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
+
+	UNWIND_HINT_REGS
+
 	cld
 
 	movq	%rsp, %rdi


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (11 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-06  7:42   ` Paolo Bonzini
  2020-05-09  0:14   ` Andy Lutomirski
  2020-05-05 13:41 ` [patch V4 part 2 14/18] x86/kvm/vmx: Add hardirq tracing to guest enter/exit Thomas Gleixner
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Context tracking for KVM happens way too early in the vcpu_run()
code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
cannot use RCU and should also be not instrumented.

The current way of doing this covers way too much code. Move it closer to
the actual vmenter/exit code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/svm/svm.c |   16 ++++++++++++++++
 arch/x86/kvm/vmx/vmx.c |   10 ++++++++++
 arch/x86/kvm/x86.c     |    2 --
 3 files changed, 26 insertions(+), 2 deletions(-)

--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3330,6 +3330,14 @@ static void svm_vcpu_run(struct kvm_vcpu
 	 */
 	x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
 
+	/*
+	 * Tell context tracking that this CPU is about to enter guest
+	 * mode. This has to be after x86_spec_ctrl_set_guest() because
+	 * that can take locks (lockdep needs RCU) and calls into world and
+	 * some more.
+	 */
+	guest_enter_irqoff();
+
 	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
 
 #ifdef CONFIG_X86_64
@@ -3340,6 +3348,14 @@ static void svm_vcpu_run(struct kvm_vcpu
 	loadsegment(gs, svm->host.gs);
 #endif
 #endif
+	/*
+	 * Tell context tracking that this CPU is back.
+	 *
+	 * This needs to be done before the below as native_read_msr()
+	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
+	 * into world and some more.
+	 */
+	guest_exit_irqoff();
 
 	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6603,6 +6603,11 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	 */
 	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
 
+	/*
+	 * Tell context tracking that this CPU is about to enter guest mode.
+	 */
+	guest_enter_irqoff();
+
 	/* L1D Flush includes CPU buffer clear to mitigate MDS */
 	if (static_branch_unlikely(&vmx_l1d_should_flush))
 		vmx_l1d_flush(vcpu);
@@ -6618,6 +6623,11 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	vcpu->arch.cr2 = read_cr2();
 
 	/*
+	 * Tell context tracking that this CPU is back.
+	 */
+	guest_exit_irqoff();
+
+	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the
 	 * SPEC_CTRL MSR it may have left it on; save the value and
 	 * turn it off. This is much more efficient than blindly adding
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8404,7 +8404,6 @@ static int vcpu_enter_guest(struct kvm_v
 	}
 
 	trace_kvm_entry(vcpu->vcpu_id);
-	guest_enter_irqoff();
 
 	fpregs_assert_state_consistent();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
@@ -8467,7 +8466,6 @@ static int vcpu_enter_guest(struct kvm_v
 	local_irq_disable();
 	kvm_after_interrupt(vcpu);
 
-	guest_exit_irqoff();
 	if (lapic_in_kernel(vcpu)) {
 		s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
 		if (delta != S64_MIN) {


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 14/18] x86/kvm/vmx: Add hardirq tracing to guest enter/exit
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (12 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-06  7:55   ` Paolo Bonzini
  2020-05-05 13:41 ` [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on " Thomas Gleixner
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Add hardirq tracing to guest enter/exit functions in the same way as it
is done in the user mode enter/exit code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx/vmx.c |   25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6604,9 +6604,19 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
 
 	/*
-	 * Tell context tracking that this CPU is about to enter guest mode.
+	 * VMENTER enables interrupts (host state), but the kernel state is
+	 * interrupts disabled when this is invoked. Also tell RCU about
+	 * it. This is the same logic as for exit_to_user_mode().
+	 *
+	 * 1) Trace interrupts on state
+	 * 2) Prepare lockdep with RCU on
+	 * 3) Invoke context tracking if enabled to adjust RCU state
+	 * 4) Tell lockdep that interrupts are enabled
 	 */
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	guest_enter_irqoff();
+	lockdep_hardirqs_on(CALLER_ADDR0);
 
 	/* L1D Flush includes CPU buffer clear to mitigate MDS */
 	if (static_branch_unlikely(&vmx_l1d_should_flush))
@@ -6623,9 +6633,20 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	vcpu->arch.cr2 = read_cr2();
 
 	/*
-	 * Tell context tracking that this CPU is back.
+	 * VMEXIT disables interrupts (host state), but tracing and lockdep
+	 * have them in state 'on'. Same as enter_from_user_mode().
+	 *
+	 * 1) Tell lockdep that interrupts are disabled
+	 * 2) Invoke context tracking if enabled to reactivate RCU
+	 * 3) Trace interrupts off state
+	 *
+	 * This needs to be done before the below as native_read_msr()
+	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
+	 * into world and some more.
 	 */
+	lockdep_hardirqs_off(CALLER_ADDR0);
 	guest_exit_irqoff();
+	trace_hardirqs_off_prepare();
 
 	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on guest enter/exit
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (13 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 14/18] x86/kvm/vmx: Add hardirq tracing to guest enter/exit Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-06  8:15   ` Paolo Bonzini
  2020-05-05 13:41 ` [patch V4 part 2 16/18] context_tracking: Make guest_enter/exit() .noinstr ready Thomas Gleixner
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Add hardirq tracing to guest enter/exit functions in the same way as it is
done in the user mode enter/exit code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/svm/svm.c |   30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3331,12 +3331,23 @@ static void svm_vcpu_run(struct kvm_vcpu
 	x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
 
 	/*
-	 * Tell context tracking that this CPU is about to enter guest
-	 * mode. This has to be after x86_spec_ctrl_set_guest() because
-	 * that can take locks (lockdep needs RCU) and calls into world and
-	 * some more.
+	 * VMENTER enables interrupts (host state), but the kernel state is
+	 * interrupts disabled when this is invoked. Also tell RCU about
+	 * it. This is the same logic as for exit_to_user_mode().
+	 *
+	 * 1) Trace interrupts on state
+	 * 2) Prepare lockdep with RCU on
+	 * 3) Invoke context tracking if enabled to adjust RCU state
+	 * 4) Tell lockdep that interrupts are enabled
+	 *
+	 * This has to be after x86_spec_ctrl_set_guest() because that can
+	 * take locks (lockdep needs RCU) and calls into world and some
+	 * more.
 	 */
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	guest_enter_irqoff();
+	lockdep_hardirqs_on(CALLER_ADDR0);
 
 	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
 
@@ -3348,14 +3359,23 @@ static void svm_vcpu_run(struct kvm_vcpu
 	loadsegment(gs, svm->host.gs);
 #endif
 #endif
+
 	/*
-	 * Tell context tracking that this CPU is back.
+	 * VMEXIT disables interrupts (host state, see the CLI in the ASM
+	 * above), but tracing and lockdep have them in state 'on'. Same as
+	 * enter_from_user_mode().
+	 *
+	 * 1) Tell lockdep that interrupts are disabled
+	 * 2) Invoke context tracking if enabled to reactivate RCU
+	 * 3) Trace interrupts off state
 	 *
 	 * This needs to be done before the below as native_read_msr()
 	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
 	 * into world and some more.
 	 */
+	lockdep_hardirqs_off(CALLER_ADDR0);
 	guest_exit_irqoff();
+	trace_hardirqs_off_prepare();
 
 	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 16/18] context_tracking: Make guest_enter/exit() .noinstr ready
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (14 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on " Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 17/18] x86/kvm/vmx: Move guest enter/exit into .noinstr.text Thomas Gleixner
  2020-05-05 13:41 ` [patch V4 part 2 18/18] x86/kvm/svm: " Thomas Gleixner
  17 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Force inlining of the helpers and mark the instrumentable parts
accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
---
 include/linux/context_tracking.h |   21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -101,12 +101,14 @@ static inline void context_tracking_init
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 /* must be called with irqs disabled */
-static inline void guest_enter_irqoff(void)
+static __always_inline void guest_enter_irqoff(void)
 {
+	instr_begin();
 	if (vtime_accounting_enabled_this_cpu())
 		vtime_guest_enter(current);
 	else
 		current->flags |= PF_VCPU;
+	instr_end();
 
 	if (context_tracking_enabled())
 		__context_tracking_enter(CONTEXT_GUEST);
@@ -118,39 +120,48 @@ static inline void guest_enter_irqoff(vo
 	 * one time slice). Lets treat guest mode as quiescent state, just like
 	 * we do with user-mode execution.
 	 */
-	if (!context_tracking_enabled_this_cpu())
+	if (!context_tracking_enabled_this_cpu()) {
+		instr_begin();
 		rcu_virt_note_context_switch(smp_processor_id());
+		instr_end();
+	}
 }
 
-static inline void guest_exit_irqoff(void)
+static __always_inline void guest_exit_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_GUEST);
 
+	instr_begin();
 	if (vtime_accounting_enabled_this_cpu())
 		vtime_guest_exit(current);
 	else
 		current->flags &= ~PF_VCPU;
+	instr_end();
 }
 
 #else
-static inline void guest_enter_irqoff(void)
+static __always_inline void guest_enter_irqoff(void)
 {
 	/*
 	 * This is running in ioctl context so its safe
 	 * to assume that it's the stime pending cputime
 	 * to flush.
 	 */
+	instr_begin();
 	vtime_account_kernel(current);
 	current->flags |= PF_VCPU;
 	rcu_virt_note_context_switch(smp_processor_id());
+	instr_end();
 }
 
-static inline void guest_exit_irqoff(void)
+static __always_inline void guest_exit_irqoff(void)
 {
+	instr_begin();
 	/* Flush the guest cputime we spent on the guest */
 	vtime_account_kernel(current);
 	current->flags &= ~PF_VCPU;
+	instr_end();
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 17/18] x86/kvm/vmx: Move guest enter/exit into .noinstr.text
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (15 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 16/18] context_tracking: Make guest_enter/exit() .noinstr ready Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-06  8:17   ` Paolo Bonzini
  2020-05-05 13:41 ` [patch V4 part 2 18/18] x86/kvm/svm: " Thomas Gleixner
  17 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Move the functions which are inside the RCU off region into the
non-instrumentable text section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/asm/hardirq.h  |    4 -
 arch/x86/include/asm/kvm_host.h |    8 +++
 arch/x86/kvm/vmx/ops.h          |    4 +
 arch/x86/kvm/vmx/vmenter.S      |    5 +
 arch/x86/kvm/vmx/vmx.c          |  105 ++++++++++++++++++++++------------------
 arch/x86/kvm/x86.c              |    2 
 6 files changed, 79 insertions(+), 49 deletions(-)

--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -67,12 +67,12 @@ static inline void kvm_set_cpu_l1tf_flus
 	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);
 }
 
-static inline void kvm_clear_cpu_l1tf_flush_l1d(void)
+static __always_inline void kvm_clear_cpu_l1tf_flush_l1d(void)
 {
 	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);
 }
 
-static inline bool kvm_get_cpu_l1tf_flush_l1d(void)
+static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(void)
 {
 	return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);
 }
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1601,7 +1601,15 @@ asmlinkage void kvm_spurious_fault(void)
 	insn "\n\t"							\
 	"jmp	668f \n\t"						\
 	"667: \n\t"							\
+	"1: \n\t"							\
+	".pushsection .discard.instr_begin \n\t"			\
+	".long 1b - . \n\t"						\
+	".popsection \n\t"						\
 	"call	kvm_spurious_fault \n\t"				\
+	"1: \n\t"							\
+	".pushsection .discard.instr_end \n\t"				\
+	".long 1b - . \n\t"						\
+	".popsection \n\t"						\
 	"668: \n\t"							\
 	_ASM_EXTABLE(666b, 667b)
 
--- a/arch/x86/kvm/vmx/ops.h
+++ b/arch/x86/kvm/vmx/ops.h
@@ -146,7 +146,9 @@ do {									\
 			  : : op1 : "cc" : error, fault);		\
 	return;								\
 error:									\
+	instr_begin();							\
 	insn##_error(error_args);					\
+	instr_end();							\
 	return;								\
 fault:									\
 	kvm_spurious_fault();						\
@@ -161,7 +163,9 @@ do {									\
 			  : : op1, op2 : "cc" : error, fault);		\
 	return;								\
 error:									\
+	instr_begin();							\
 	insn##_error(error_args);					\
+	instr_end();							\
 	return;								\
 fault:									\
 	kvm_spurious_fault();						\
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -27,7 +27,7 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
-	.text
+.section .noinstr.text, "ax"
 
 /**
  * vmx_vmenter - VM-Enter the current loaded VMCS
@@ -231,6 +231,9 @@ SYM_FUNC_START(__vmx_vcpu_run)
 	jmp 1b
 SYM_FUNC_END(__vmx_vcpu_run)
 
+
+.section .text, "ax"
+
 /**
  * vmread_error_trampoline - Trampoline from inline asm to vmread_error()
  * @field:	VMCS field encoding that failed
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6000,7 +6000,7 @@ static int vmx_handle_exit(struct kvm_vc
  * information but as all relevant affected CPUs have 32KiB L1D cache size
  * there is no point in doing so.
  */
-static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
+static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
 	int size = PAGE_SIZE << L1D_CACHE_ORDER;
 
@@ -6033,7 +6033,7 @@ static void vmx_l1d_flush(struct kvm_vcp
 	vcpu->stat.l1d_flush++;
 
 	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
-		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+		native_wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
 		return;
 	}
 
@@ -6514,7 +6514,7 @@ static void vmx_update_hv_timer(struct k
 	}
 }
 
-void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
+void noinstr vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
 {
 	if (unlikely(host_rsp != vmx->loaded_vmcs->host_state.rsp)) {
 		vmx->loaded_vmcs->host_state.rsp = host_rsp;
@@ -6524,6 +6524,61 @@ void vmx_update_host_rsp(struct vcpu_vmx
 
 bool __vmx_vcpu_run(struct vcpu_vmx *vmx, unsigned long *regs, bool launched);
 
+static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+					struct vcpu_vmx *vmx)
+{
+	instr_begin();
+	/*
+	 * VMENTER enables interrupts (host state), but the kernel state is
+	 * interrupts disabled when this is invoked. Also tell RCU about
+	 * it. This is the same logic as for exit_to_user_mode().
+	 *
+	 * 1) Trace interrupts on state
+	 * 2) Prepare lockdep with RCU on
+	 * 3) Invoke context tracking if enabled to adjust RCU state
+	 * 4) Tell lockdep that interrupts are enabled
+	 */
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	instr_end();
+
+	guest_enter_irqoff();
+	lockdep_hardirqs_on(CALLER_ADDR0);
+
+	/* L1D Flush includes CPU buffer clear to mitigate MDS */
+	if (static_branch_unlikely(&vmx_l1d_should_flush))
+		vmx_l1d_flush(vcpu);
+	else if (static_branch_unlikely(&mds_user_clear))
+		mds_clear_cpu_buffers();
+
+	if (vcpu->arch.cr2 != read_cr2())
+		write_cr2(vcpu->arch.cr2);
+
+	vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
+				   vmx->loaded_vmcs->launched);
+
+	vcpu->arch.cr2 = read_cr2();
+
+	/*
+	 * VMEXIT disables interrupts (host state), but tracing and lockdep
+	 * have them in state 'on'. Same as enter_from_user_mode().
+	 *
+	 * 1) Tell lockdep that interrupts are disabled
+	 * 2) Invoke context tracking if enabled to reactivate RCU
+	 * 3) Trace interrupts off state
+	 *
+	 * This needs to be done before the below as native_read_msr()
+	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
+	 * into world and some more.
+	 */
+	lockdep_hardirqs_off(CALLER_ADDR0);
+	guest_exit_irqoff();
+
+	instr_begin();
+	trace_hardirqs_off_prepare();
+	instr_end();
+}
+
 static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6604,49 +6659,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
 
 	/*
-	 * VMENTER enables interrupts (host state), but the kernel state is
-	 * interrupts disabled when this is invoked. Also tell RCU about
-	 * it. This is the same logic as for exit_to_user_mode().
-	 *
-	 * 1) Trace interrupts on state
-	 * 2) Prepare lockdep with RCU on
-	 * 3) Invoke context tracking if enabled to adjust RCU state
-	 * 4) Tell lockdep that interrupts are enabled
+	 * The actual VMENTER/EXIT is in the .noinstr.text section.
 	 */
-	trace_hardirqs_on_prepare();
-	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-	guest_enter_irqoff();
-	lockdep_hardirqs_on(CALLER_ADDR0);
-
-	/* L1D Flush includes CPU buffer clear to mitigate MDS */
-	if (static_branch_unlikely(&vmx_l1d_should_flush))
-		vmx_l1d_flush(vcpu);
-	else if (static_branch_unlikely(&mds_user_clear))
-		mds_clear_cpu_buffers();
-
-	if (vcpu->arch.cr2 != read_cr2())
-		write_cr2(vcpu->arch.cr2);
-
-	vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
-				   vmx->loaded_vmcs->launched);
-
-	vcpu->arch.cr2 = read_cr2();
-
-	/*
-	 * VMEXIT disables interrupts (host state), but tracing and lockdep
-	 * have them in state 'on'. Same as enter_from_user_mode().
-	 *
-	 * 1) Tell lockdep that interrupts are disabled
-	 * 2) Invoke context tracking if enabled to reactivate RCU
-	 * 3) Trace interrupts off state
-	 *
-	 * This needs to be done before the below as native_read_msr()
-	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
-	 * into world and some more.
-	 */
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	guest_exit_irqoff();
-	trace_hardirqs_off_prepare();
+	vmx_vcpu_enter_exit(vcpu, vmx);
 
 	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -381,7 +381,7 @@ int kvm_set_apic_base(struct kvm_vcpu *v
 }
 EXPORT_SYMBOL_GPL(kvm_set_apic_base);
 
-asmlinkage __visible void kvm_spurious_fault(void)
+asmlinkage __visible noinstr void kvm_spurious_fault(void)
 {
 	/* Fault while not rebooting.  We want the trace. */
 	BUG_ON(!kvm_rebooting);


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V4 part 2 18/18] x86/kvm/svm: Move guest enter/exit into .noinstr.text
  2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
                   ` (16 preceding siblings ...)
  2020-05-05 13:41 ` [patch V4 part 2 17/18] x86/kvm/vmx: Move guest enter/exit into .noinstr.text Thomas Gleixner
@ 2020-05-05 13:41 ` Thomas Gleixner
  2020-05-06  8:17   ` Paolo Bonzini
  2020-05-07 14:47   ` Alexandre Chartre
  17 siblings, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:41 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Move the functions which are inside the RCU off region into the
non-instrumentable text section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/svm/svm.c     |  102 ++++++++++++++++++++++++---------------------
 arch/x86/kvm/svm/vmenter.S |    2 
 2 files changed, 57 insertions(+), 47 deletions(-)

--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3278,6 +3278,61 @@ static void svm_cancel_injection(struct
 
 void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs);
 
+static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+					struct vcpu_svm *svm)
+{
+	/*
+	 * VMENTER enables interrupts (host state), but the kernel state is
+	 * interrupts disabled when this is invoked. Also tell RCU about
+	 * it. This is the same logic as for exit_to_user_mode().
+	 *
+	 * 1) Trace interrupts on state
+	 * 2) Prepare lockdep with RCU on
+	 * 3) Invoke context tracking if enabled to adjust RCU state
+	 * 4) Tell lockdep that interrupts are enabled
+	 *
+	 * This has to be after x86_spec_ctrl_set_guest() because that can
+	 * take locks (lockdep needs RCU) and calls into world and some
+	 * more.
+	 */
+	instr_begin();
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	instr_end();
+	guest_enter_irqoff();
+	lockdep_hardirqs_on(CALLER_ADDR0);
+
+	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
+
+#ifdef CONFIG_X86_64
+	native_wrmsrl(MSR_GS_BASE, svm->host.gs_base);
+#else
+	loadsegment(fs, svm->host.fs);
+#ifndef CONFIG_X86_32_LAZY_GS
+	loadsegment(gs, svm->host.gs);
+#endif
+#endif
+
+	/*
+	 * VMEXIT disables interrupts (host state, see the CLI in the ASM
+	 * above), but tracing and lockdep have them in state 'on'. Same as
+	 * enter_from_user_mode().
+	 *
+	 * 1) Tell lockdep that interrupts are disabled
+	 * 2) Invoke context tracking if enabled to reactivate RCU
+	 * 3) Trace interrupts off state
+	 *
+	 * This needs to be done before the below as native_read_msr()
+	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
+	 * into world and some more.
+	 */
+	lockdep_hardirqs_off(CALLER_ADDR0);
+	guest_exit_irqoff();
+	instr_begin();
+	trace_hardirqs_off_prepare();
+	instr_end();
+}
+
 static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3330,52 +3385,7 @@ static void svm_vcpu_run(struct kvm_vcpu
 	 */
 	x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
 
-	/*
-	 * VMENTER enables interrupts (host state), but the kernel state is
-	 * interrupts disabled when this is invoked. Also tell RCU about
-	 * it. This is the same logic as for exit_to_user_mode().
-	 *
-	 * 1) Trace interrupts on state
-	 * 2) Prepare lockdep with RCU on
-	 * 3) Invoke context tracking if enabled to adjust RCU state
-	 * 4) Tell lockdep that interrupts are enabled
-	 *
-	 * This has to be after x86_spec_ctrl_set_guest() because that can
-	 * take locks (lockdep needs RCU) and calls into world and some
-	 * more.
-	 */
-	trace_hardirqs_on_prepare();
-	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-	guest_enter_irqoff();
-	lockdep_hardirqs_on(CALLER_ADDR0);
-
-	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
-
-#ifdef CONFIG_X86_64
-	wrmsrl(MSR_GS_BASE, svm->host.gs_base);
-#else
-	loadsegment(fs, svm->host.fs);
-#ifndef CONFIG_X86_32_LAZY_GS
-	loadsegment(gs, svm->host.gs);
-#endif
-#endif
-
-	/*
-	 * VMEXIT disables interrupts (host state, see the CLI in the ASM
-	 * above), but tracing and lockdep have them in state 'on'. Same as
-	 * enter_from_user_mode().
-	 *
-	 * 1) Tell lockdep that interrupts are disabled
-	 * 2) Invoke context tracking if enabled to reactivate RCU
-	 * 3) Trace interrupts off state
-	 *
-	 * This needs to be done before the below as native_read_msr()
-	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
-	 * into world and some more.
-	 */
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	guest_exit_irqoff();
-	trace_hardirqs_off_prepare();
+	svm_vcpu_enter_exit(vcpu, svm);
 
 	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -27,7 +27,7 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
-	.text
+.section .noinstr.text, "ax"
 
 /**
  * __svm_vcpu_run - Run a vCPU via a transition to SVM guest mode


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs
  2020-05-05 13:41 ` [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs Thomas Gleixner
@ 2020-05-06  7:42   ` Paolo Bonzini
  2020-05-09  0:14   ` Andy Lutomirski
  1 sibling, 0 replies; 69+ messages in thread
From: Paolo Bonzini @ 2020-05-06  7:42 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:41, Thomas Gleixner wrote:
> Context tracking for KVM happens way too early in the vcpu_run()
> code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
> cannot use RCU and should also be not instrumented.
> 
> The current way of doing this covers way too much code. Move it closer to
> the actual vmenter/exit code.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/svm/svm.c |   16 ++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c |   10 ++++++++++
>  arch/x86/kvm/x86.c     |    2 --
>  3 files changed, 26 insertions(+), 2 deletions(-)
> 
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3330,6 +3330,14 @@ static void svm_vcpu_run(struct kvm_vcpu
>  	 */
>  	x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
>  
> +	/*
> +	 * Tell context tracking that this CPU is about to enter guest
> +	 * mode. This has to be after x86_spec_ctrl_set_guest() because
> +	 * that can take locks (lockdep needs RCU) and calls into world and
> +	 * some more.
> +	 */
> +	guest_enter_irqoff();
> +
>  	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
>  
>  #ifdef CONFIG_X86_64
> @@ -3340,6 +3348,14 @@ static void svm_vcpu_run(struct kvm_vcpu
>  	loadsegment(gs, svm->host.gs);
>  #endif
>  #endif
> +	/*
> +	 * Tell context tracking that this CPU is back.
> +	 *
> +	 * This needs to be done before the below as native_read_msr()
> +	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
> +	 * into world and some more.
> +	 */
> +	guest_exit_irqoff();
>  
>  	/*
>  	 * We do not use IBRS in the kernel. If this vCPU has used the
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6603,6 +6603,11 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  	 */
>  	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
>  
> +	/*
> +	 * Tell context tracking that this CPU is about to enter guest mode.
> +	 */
> +	guest_enter_irqoff();
> +
>  	/* L1D Flush includes CPU buffer clear to mitigate MDS */
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>  		vmx_l1d_flush(vcpu);
> @@ -6618,6 +6623,11 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  	vcpu->arch.cr2 = read_cr2();
>  
>  	/*
> +	 * Tell context tracking that this CPU is back.
> +	 */
> +	guest_exit_irqoff();
> +
> +	/*
>  	 * We do not use IBRS in the kernel. If this vCPU has used the
>  	 * SPEC_CTRL MSR it may have left it on; save the value and
>  	 * turn it off. This is much more efficient than blindly adding
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8404,7 +8404,6 @@ static int vcpu_enter_guest(struct kvm_v
>  	}
>  
>  	trace_kvm_entry(vcpu->vcpu_id);
> -	guest_enter_irqoff();
>  
>  	fpregs_assert_state_consistent();
>  	if (test_thread_flag(TIF_NEED_FPU_LOAD))
> @@ -8467,7 +8466,6 @@ static int vcpu_enter_guest(struct kvm_v
>  	local_irq_disable();
>  	kvm_after_interrupt(vcpu);
>  
> -	guest_exit_irqoff();
>  	if (lapic_in_kernel(vcpu)) {
>  		s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
>  		if (delta != S64_MIN) {
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 14/18] x86/kvm/vmx: Add hardirq tracing to guest enter/exit
  2020-05-05 13:41 ` [patch V4 part 2 14/18] x86/kvm/vmx: Add hardirq tracing to guest enter/exit Thomas Gleixner
@ 2020-05-06  7:55   ` Paolo Bonzini
  0 siblings, 0 replies; 69+ messages in thread
From: Paolo Bonzini @ 2020-05-06  7:55 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:41, Thomas Gleixner wrote:
> Add hardirq tracing to guest enter/exit functions in the same way as it
> is done in the user mode enter/exit code.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c |   25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6604,9 +6604,19 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
>  
>  	/*
> -	 * Tell context tracking that this CPU is about to enter guest mode.
> +	 * VMENTER enables interrupts (host state), but the kernel state is
> +	 * interrupts disabled when this is invoked. Also tell RCU about
> +	 * it. This is the same logic as for exit_to_user_mode().
> +	 *
> +	 * 1) Trace interrupts on state
> +	 * 2) Prepare lockdep with RCU on
> +	 * 3) Invoke context tracking if enabled to adjust RCU state
> +	 * 4) Tell lockdep that interrupts are enabled
>  	 */
> +	trace_hardirqs_on_prepare();
> +	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>  	guest_enter_irqoff();
> +	lockdep_hardirqs_on(CALLER_ADDR0);
>  
>  	/* L1D Flush includes CPU buffer clear to mitigate MDS */
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> @@ -6623,9 +6633,20 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  	vcpu->arch.cr2 = read_cr2();
>  
>  	/*
> -	 * Tell context tracking that this CPU is back.
> +	 * VMEXIT disables interrupts (host state), but tracing and lockdep
> +	 * have them in state 'on'. Same as enter_from_user_mode().
> +	 *
> +	 * 1) Tell lockdep that interrupts are disabled
> +	 * 2) Invoke context tracking if enabled to reactivate RCU
> +	 * 3) Trace interrupts off state
> +	 *
> +	 * This needs to be done before the below as native_read_msr()
> +	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
> +	 * into world and some more.
>  	 */
> +	lockdep_hardirqs_off(CALLER_ADDR0);
>  	guest_exit_irqoff();
> +	trace_hardirqs_off_prepare();
>  
>  	/*
>  	 * We do not use IBRS in the kernel. If this vCPU has used the
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on guest enter/exit
  2020-05-05 13:41 ` [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on " Thomas Gleixner
@ 2020-05-06  8:15   ` Paolo Bonzini
  2020-05-06  8:48     ` Thomas Gleixner
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Bonzini @ 2020-05-06  8:15 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:41, Thomas Gleixner wrote:
> +	 * VMENTER enables interrupts (host state), but the kernel state is
> +	 * interrupts disabled when this is invoked. Also tell RCU about
> +	 * it. This is the same logic as for exit_to_user_mode().
> +	 *
> +	 * 1) Trace interrupts on state
> +	 * 2) Prepare lockdep with RCU on
> +	 * 3) Invoke context tracking if enabled to adjust RCU state
> +	 * 4) Tell lockdep that interrupts are enabled
> +	 *
> +	 * This has to be after x86_spec_ctrl_set_guest() because that can
> +	 * take locks (lockdep needs RCU) and calls into world and some
> +	 * more.
>  	 */
> +	trace_hardirqs_on_prepare();
> +	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>  	guest_enter_irqoff();
> +	lockdep_hardirqs_on(CALLER_ADDR0);
>  
>  	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
>  
> @@ -3348,14 +3359,23 @@ static void svm_vcpu_run(struct kvm_vcpu
>  	loadsegment(gs, svm->host.gs);
>  #endif
>  #endif
> +
>  	/*
> -	 * Tell context tracking that this CPU is back.
> +	 * VMEXIT disables interrupts (host state, see the CLI in the ASM
> +	 * above),

Apart from the small inaccuracy in that CLI has moved to vmenter.S, the
comments and commit message don't really help my understanding of why
this is needed.

It's true that interrupts cause a vmexit, and therefore from the
processor point of view it's as if they are enabled.  However, the
interrupt remains latched until local_irq_enable() in vcpu_enter_guest,
so from the point of view of the kernel interrupts are still disabled. I
don't understand why it's necessary to inform tracing and lockdep about
a processor-internal state that doesn't percolate up to the kernel.

For VMX indeed some care is necessary, because we the interrupt is eaten
rather than latched.  Therefore, we call the interrupt handler from
handle_external_interrupt_irqoff while EFLAGS.IF is still clear.
However, if informing trace and lockdep turns out to be unnecessary
after all for SVM, it should be okay (and clearer) to place the code in
handle_external_interrupt_irqoff (also in arch/x86/kvm/vmx/vmx.c) .

Instead, if I'm wrong, the four steps above are the same in code and
comment, and same for the three steps in the comment below.  Can you
replace them with the "why" of this change?

Thanks,

Paolo

> +      but tracing and lockdep have them in state 'on'. Same as
> +	 * enter_from_user_mode().
> +	 *
> +	 * 1) Tell lockdep that interrupts are disabled
> +	 * 2) Invoke context tracking if enabled to reactivate RCU
> +	 * 3) Trace interrupts off state
>  	 *
>  	 * This needs to be done before the below as native_read_msr()
>  	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
>  	 * into world and some more.
>  	 */
> +	lockdep_hardirqs_off(CALLER_ADDR0);
>  	guest_exit_irqoff();
> +	trace_hardirqs_off_prepare();
>  


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 17/18] x86/kvm/vmx: Move guest enter/exit into .noinstr.text
  2020-05-05 13:41 ` [patch V4 part 2 17/18] x86/kvm/vmx: Move guest enter/exit into .noinstr.text Thomas Gleixner
@ 2020-05-06  8:17   ` Paolo Bonzini
  0 siblings, 0 replies; 69+ messages in thread
From: Paolo Bonzini @ 2020-05-06  8:17 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:41, Thomas Gleixner wrote:
> Move the functions which are inside the RCU off region into the
> non-instrumentable text section.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/include/asm/hardirq.h  |    4 -
>  arch/x86/include/asm/kvm_host.h |    8 +++
>  arch/x86/kvm/vmx/ops.h          |    4 +
>  arch/x86/kvm/vmx/vmenter.S      |    5 +
>  arch/x86/kvm/vmx/vmx.c          |  105 ++++++++++++++++++++++------------------
>  arch/x86/kvm/x86.c              |    2 
>  6 files changed, 79 insertions(+), 49 deletions(-)
> 
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -67,12 +67,12 @@ static inline void kvm_set_cpu_l1tf_flus
>  	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);
>  }
>  
> -static inline void kvm_clear_cpu_l1tf_flush_l1d(void)
> +static __always_inline void kvm_clear_cpu_l1tf_flush_l1d(void)
>  {
>  	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);
>  }
>  
> -static inline bool kvm_get_cpu_l1tf_flush_l1d(void)
> +static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(void)
>  {
>  	return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);
>  }
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1601,7 +1601,15 @@ asmlinkage void kvm_spurious_fault(void)
>  	insn "\n\t"							\
>  	"jmp	668f \n\t"						\
>  	"667: \n\t"							\
> +	"1: \n\t"							\
> +	".pushsection .discard.instr_begin \n\t"			\
> +	".long 1b - . \n\t"						\
> +	".popsection \n\t"						\
>  	"call	kvm_spurious_fault \n\t"				\
> +	"1: \n\t"							\
> +	".pushsection .discard.instr_end \n\t"				\
> +	".long 1b - . \n\t"						\
> +	".popsection \n\t"						\
>  	"668: \n\t"							\
>  	_ASM_EXTABLE(666b, 667b)
>  
> --- a/arch/x86/kvm/vmx/ops.h
> +++ b/arch/x86/kvm/vmx/ops.h
> @@ -146,7 +146,9 @@ do {									\
>  			  : : op1 : "cc" : error, fault);		\
>  	return;								\
>  error:									\
> +	instr_begin();							\
>  	insn##_error(error_args);					\
> +	instr_end();							\
>  	return;								\
>  fault:									\
>  	kvm_spurious_fault();						\
> @@ -161,7 +163,9 @@ do {									\
>  			  : : op1, op2 : "cc" : error, fault);		\
>  	return;								\
>  error:									\
> +	instr_begin();							\
>  	insn##_error(error_args);					\
> +	instr_end();							\
>  	return;								\
>  fault:									\
>  	kvm_spurious_fault();						\
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -27,7 +27,7 @@
>  #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
>  #endif
>  
> -	.text
> +.section .noinstr.text, "ax"
>  
>  /**
>   * vmx_vmenter - VM-Enter the current loaded VMCS
> @@ -231,6 +231,9 @@ SYM_FUNC_START(__vmx_vcpu_run)
>  	jmp 1b
>  SYM_FUNC_END(__vmx_vcpu_run)
>  
> +
> +.section .text, "ax"
> +
>  /**
>   * vmread_error_trampoline - Trampoline from inline asm to vmread_error()
>   * @field:	VMCS field encoding that failed
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6000,7 +6000,7 @@ static int vmx_handle_exit(struct kvm_vc
>   * information but as all relevant affected CPUs have 32KiB L1D cache size
>   * there is no point in doing so.
>   */
> -static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
> +static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>  {
>  	int size = PAGE_SIZE << L1D_CACHE_ORDER;
>  
> @@ -6033,7 +6033,7 @@ static void vmx_l1d_flush(struct kvm_vcp
>  	vcpu->stat.l1d_flush++;
>  
>  	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> -		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +		native_wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
>  		return;
>  	}
>  
> @@ -6514,7 +6514,7 @@ static void vmx_update_hv_timer(struct k
>  	}
>  }
>  
> -void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
> +void noinstr vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
>  {
>  	if (unlikely(host_rsp != vmx->loaded_vmcs->host_state.rsp)) {
>  		vmx->loaded_vmcs->host_state.rsp = host_rsp;
> @@ -6524,6 +6524,61 @@ void vmx_update_host_rsp(struct vcpu_vmx
>  
>  bool __vmx_vcpu_run(struct vcpu_vmx *vmx, unsigned long *regs, bool launched);
>  
> +static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> +					struct vcpu_vmx *vmx)
> +{
> +	instr_begin();
> +	/*
> +	 * VMENTER enables interrupts (host state), but the kernel state is
> +	 * interrupts disabled when this is invoked. Also tell RCU about
> +	 * it. This is the same logic as for exit_to_user_mode().
> +	 *
> +	 * 1) Trace interrupts on state
> +	 * 2) Prepare lockdep with RCU on
> +	 * 3) Invoke context tracking if enabled to adjust RCU state
> +	 * 4) Tell lockdep that interrupts are enabled
> +	 */
> +	trace_hardirqs_on_prepare();
> +	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> +	instr_end();
> +
> +	guest_enter_irqoff();
> +	lockdep_hardirqs_on(CALLER_ADDR0);
> +
> +	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> +	if (static_branch_unlikely(&vmx_l1d_should_flush))
> +		vmx_l1d_flush(vcpu);
> +	else if (static_branch_unlikely(&mds_user_clear))
> +		mds_clear_cpu_buffers();
> +
> +	if (vcpu->arch.cr2 != read_cr2())
> +		write_cr2(vcpu->arch.cr2);
> +
> +	vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
> +				   vmx->loaded_vmcs->launched);
> +
> +	vcpu->arch.cr2 = read_cr2();
> +
> +	/*
> +	 * VMEXIT disables interrupts (host state), but tracing and lockdep
> +	 * have them in state 'on'. Same as enter_from_user_mode().
> +	 *
> +	 * 1) Tell lockdep that interrupts are disabled
> +	 * 2) Invoke context tracking if enabled to reactivate RCU
> +	 * 3) Trace interrupts off state
> +	 *
> +	 * This needs to be done before the below as native_read_msr()
> +	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
> +	 * into world and some more.
> +	 */
> +	lockdep_hardirqs_off(CALLER_ADDR0);
> +	guest_exit_irqoff();
> +
> +	instr_begin();
> +	trace_hardirqs_off_prepare();
> +	instr_end();
> +}
> +
>  static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -6604,49 +6659,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
>  
>  	/*
> -	 * VMENTER enables interrupts (host state), but the kernel state is
> -	 * interrupts disabled when this is invoked. Also tell RCU about
> -	 * it. This is the same logic as for exit_to_user_mode().
> -	 *
> -	 * 1) Trace interrupts on state
> -	 * 2) Prepare lockdep with RCU on
> -	 * 3) Invoke context tracking if enabled to adjust RCU state
> -	 * 4) Tell lockdep that interrupts are enabled
> +	 * The actual VMENTER/EXIT is in the .noinstr.text section.
>  	 */
> -	trace_hardirqs_on_prepare();
> -	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> -	guest_enter_irqoff();
> -	lockdep_hardirqs_on(CALLER_ADDR0);
> -
> -	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> -	if (static_branch_unlikely(&vmx_l1d_should_flush))
> -		vmx_l1d_flush(vcpu);
> -	else if (static_branch_unlikely(&mds_user_clear))
> -		mds_clear_cpu_buffers();
> -
> -	if (vcpu->arch.cr2 != read_cr2())
> -		write_cr2(vcpu->arch.cr2);
> -
> -	vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
> -				   vmx->loaded_vmcs->launched);
> -
> -	vcpu->arch.cr2 = read_cr2();
> -
> -	/*
> -	 * VMEXIT disables interrupts (host state), but tracing and lockdep
> -	 * have them in state 'on'. Same as enter_from_user_mode().
> -	 *
> -	 * 1) Tell lockdep that interrupts are disabled
> -	 * 2) Invoke context tracking if enabled to reactivate RCU
> -	 * 3) Trace interrupts off state
> -	 *
> -	 * This needs to be done before the below as native_read_msr()
> -	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
> -	 * into world and some more.
> -	 */
> -	lockdep_hardirqs_off(CALLER_ADDR0);
> -	guest_exit_irqoff();
> -	trace_hardirqs_off_prepare();
> +	vmx_vcpu_enter_exit(vcpu, vmx);
>  
>  	/*
>  	 * We do not use IBRS in the kernel. If this vCPU has used the
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -381,7 +381,7 @@ int kvm_set_apic_base(struct kvm_vcpu *v
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_apic_base);
>  
> -asmlinkage __visible void kvm_spurious_fault(void)
> +asmlinkage __visible noinstr void kvm_spurious_fault(void)
>  {
>  	/* Fault while not rebooting.  We want the trace. */
>  	BUG_ON(!kvm_rebooting);
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 18/18] x86/kvm/svm: Move guest enter/exit into .noinstr.text
  2020-05-05 13:41 ` [patch V4 part 2 18/18] x86/kvm/svm: " Thomas Gleixner
@ 2020-05-06  8:17   ` Paolo Bonzini
  2020-05-07 14:47   ` Alexandre Chartre
  1 sibling, 0 replies; 69+ messages in thread
From: Paolo Bonzini @ 2020-05-06  8:17 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:41, Thomas Gleixner wrote:
> Move the functions which are inside the RCU off region into the
> non-instrumentable text section.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/svm/svm.c     |  102 ++++++++++++++++++++++++---------------------
>  arch/x86/kvm/svm/vmenter.S |    2 
>  2 files changed, 57 insertions(+), 47 deletions(-)
> 
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3278,6 +3278,61 @@ static void svm_cancel_injection(struct
>  
>  void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs);
>  
> +static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> +					struct vcpu_svm *svm)
> +{
> +	/*
> +	 * VMENTER enables interrupts (host state), but the kernel state is
> +	 * interrupts disabled when this is invoked. Also tell RCU about
> +	 * it. This is the same logic as for exit_to_user_mode().
> +	 *
> +	 * 1) Trace interrupts on state
> +	 * 2) Prepare lockdep with RCU on
> +	 * 3) Invoke context tracking if enabled to adjust RCU state
> +	 * 4) Tell lockdep that interrupts are enabled
> +	 *
> +	 * This has to be after x86_spec_ctrl_set_guest() because that can
> +	 * take locks (lockdep needs RCU) and calls into world and some
> +	 * more.
> +	 */
> +	instr_begin();
> +	trace_hardirqs_on_prepare();
> +	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> +	instr_end();
> +	guest_enter_irqoff();
> +	lockdep_hardirqs_on(CALLER_ADDR0);
> +
> +	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
> +
> +#ifdef CONFIG_X86_64
> +	native_wrmsrl(MSR_GS_BASE, svm->host.gs_base);
> +#else
> +	loadsegment(fs, svm->host.fs);
> +#ifndef CONFIG_X86_32_LAZY_GS
> +	loadsegment(gs, svm->host.gs);
> +#endif
> +#endif
> +
> +	/*
> +	 * VMEXIT disables interrupts (host state, see the CLI in the ASM
> +	 * above), but tracing and lockdep have them in state 'on'. Same as
> +	 * enter_from_user_mode().
> +	 *
> +	 * 1) Tell lockdep that interrupts are disabled
> +	 * 2) Invoke context tracking if enabled to reactivate RCU
> +	 * 3) Trace interrupts off state
> +	 *
> +	 * This needs to be done before the below as native_read_msr()
> +	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
> +	 * into world and some more.
> +	 */
> +	lockdep_hardirqs_off(CALLER_ADDR0);
> +	guest_exit_irqoff();
> +	instr_begin();
> +	trace_hardirqs_off_prepare();
> +	instr_end();
> +}
> +
>  static void svm_vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -3330,52 +3385,7 @@ static void svm_vcpu_run(struct kvm_vcpu
>  	 */
>  	x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
>  
> -	/*
> -	 * VMENTER enables interrupts (host state), but the kernel state is
> -	 * interrupts disabled when this is invoked. Also tell RCU about
> -	 * it. This is the same logic as for exit_to_user_mode().
> -	 *
> -	 * 1) Trace interrupts on state
> -	 * 2) Prepare lockdep with RCU on
> -	 * 3) Invoke context tracking if enabled to adjust RCU state
> -	 * 4) Tell lockdep that interrupts are enabled
> -	 *
> -	 * This has to be after x86_spec_ctrl_set_guest() because that can
> -	 * take locks (lockdep needs RCU) and calls into world and some
> -	 * more.
> -	 */
> -	trace_hardirqs_on_prepare();
> -	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> -	guest_enter_irqoff();
> -	lockdep_hardirqs_on(CALLER_ADDR0);
> -
> -	__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
> -
> -#ifdef CONFIG_X86_64
> -	wrmsrl(MSR_GS_BASE, svm->host.gs_base);
> -#else
> -	loadsegment(fs, svm->host.fs);
> -#ifndef CONFIG_X86_32_LAZY_GS
> -	loadsegment(gs, svm->host.gs);
> -#endif
> -#endif
> -
> -	/*
> -	 * VMEXIT disables interrupts (host state, see the CLI in the ASM
> -	 * above), but tracing and lockdep have them in state 'on'. Same as
> -	 * enter_from_user_mode().
> -	 *
> -	 * 1) Tell lockdep that interrupts are disabled
> -	 * 2) Invoke context tracking if enabled to reactivate RCU
> -	 * 3) Trace interrupts off state
> -	 *
> -	 * This needs to be done before the below as native_read_msr()
> -	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
> -	 * into world and some more.
> -	 */
> -	lockdep_hardirqs_off(CALLER_ADDR0);
> -	guest_exit_irqoff();
> -	trace_hardirqs_off_prepare();
> +	svm_vcpu_enter_exit(vcpu, svm);
>  
>  	/*
>  	 * We do not use IBRS in the kernel. If this vCPU has used the
> --- a/arch/x86/kvm/svm/vmenter.S
> +++ b/arch/x86/kvm/svm/vmenter.S
> @@ -27,7 +27,7 @@
>  #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
>  #endif
>  
> -	.text
> +.section .noinstr.text, "ax"
>  
>  /**
>   * __svm_vcpu_run - Run a vCPU via a transition to SVM guest mode
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on guest enter/exit
  2020-05-06  8:15   ` Paolo Bonzini
@ 2020-05-06  8:48     ` Thomas Gleixner
  2020-05-06  9:21       ` Paolo Bonzini
  0 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-06  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 05/05/20 15:41, Thomas Gleixner wrote:
>>  	/*
>> -	 * Tell context tracking that this CPU is back.
>> +	 * VMEXIT disables interrupts (host state, see the CLI in the ASM
>> +	 * above),
>
> Apart from the small inaccuracy in that CLI has moved to vmenter.S, the

yes, that's a leftover from an earlier version.

> comments and commit message don't really help my understanding of why
> this is needed.
>
> It's true that interrupts cause a vmexit, and therefore from the
> processor point of view it's as if they are enabled.  However, the
> interrupt remains latched until local_irq_enable() in vcpu_enter_guest,
> so from the point of view of the kernel interrupts are still disabled. I
> don't understand why it's necessary to inform tracing and lockdep about
> a processor-internal state that doesn't percolate up to the kernel.
>
> For VMX indeed some care is necessary, because we the interrupt is eaten
> rather than latched.  Therefore, we call the interrupt handler from
> handle_external_interrupt_irqoff while EFLAGS.IF is still clear.
> However, if informing trace and lockdep turns out to be unnecessary
> after all for SVM, it should be okay (and clearer) to place the code in
> handle_external_interrupt_irqoff (also in arch/x86/kvm/vmx/vmx.c) .
>
> Instead, if I'm wrong, the four steps above are the same in code and
> comment, and same for the three steps in the comment below.  Can you
> replace them with the "why" of this change?

Sorry, yes the changelog and the comments are not really helpful.

From an instrumentation point of view, entering guest mode or returning
to user mode is more or less the same.

On return to user mode the kernel disables interrupts and the
sysret/iret reenables them. When entering the kernel from user mode via
syscall/exception the entry disables interrupts again. So for
instrumentation, especially interrupt disabled tracing we must track
that change otherwise a latency analysis would claim that interrupts
were disabled for the full time a task spent in user mode.

For guest mode this is practically the same. Before we enter the guest
the host state has to flip back to 'interrupts enabled' and on vmexit
reestablish the interrupt disabled state. The reason of the vmexit
(interrupt, trapped access, halt) is irrelevant from a host state
perspective so the tracking really needs to be right at the edge like we
do for the user mode transitions.

I'll sit down and write up more coherent comments and changelog.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on guest enter/exit
  2020-05-06  8:48     ` Thomas Gleixner
@ 2020-05-06  9:21       ` Paolo Bonzini
  2020-05-07 14:44         ` [patch V5 " Thomas Gleixner
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Bonzini @ 2020-05-06  9:21 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 06/05/20 10:48, Thomas Gleixner wrote:
> So for instrumentation, especially interrupt disabled tracing we must
> track that change otherwise a latency analysis would claim that
> interrupts were disabled for the full time a task spent in user
> mode.

Oh okay, that's clear now.  I would just replace the four bullets in the
first comment with this sentence, and remove altogether the three
bullets in the second comment.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section
  2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
@ 2020-05-06 15:51   ` Peter Zijlstra
  2020-05-08  1:31   ` Steven Rostedt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 69+ messages in thread
From: Peter Zijlstra @ 2020-05-06 15:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Tue, May 05, 2020 at 03:41:13PM +0200, Thomas Gleixner wrote:
> All ASM code which is not part of the entry functionality can move out into
> the .text section. No reason to keep it in the non-instrumentable entry
> section.

Just to note to self (or others), I'm planning to move all this into
arch/x86/kernel/kernel.S (bike-shed away) sometime after all this lands.
These things simply do not belong in entry.S at all.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 02/18] x86/entry/32: Move non entry code into .text section
  2020-05-05 13:41 ` [patch V4 part 2 02/18] x86/entry/32: " Thomas Gleixner
@ 2020-05-07 13:15   ` Alexandre Chartre
  2020-05-07 14:14     ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 69+ messages in thread
From: Alexandre Chartre @ 2020-05-07 13:15 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:41 PM, Thomas Gleixner wrote:
> All ASM code which is not part of the entry functionality can move out into
> the .text section. No reason to keep it in the non-instrumentable entry
> section.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/entry_32.S |   11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -729,7 +729,8 @@
>   /*
>    * %eax: prev task
>    * %edx: next task
> - */
> +*/

Misaligned comment end, this line shouldn't change.

alex.

> +.pushsection .text, "ax"
>   SYM_CODE_START(__switch_to_asm)
>   	/*
>   	 * Save callee-saved registers
> @@ -776,6 +777,7 @@ SYM_CODE_START(__switch_to_asm)
>   
>   	jmp	__switch_to
>   SYM_CODE_END(__switch_to_asm)
> +.popsection
>   
>   /*
>    * The unwinder expects the last frame on the stack to always be at the same
> @@ -784,6 +786,7 @@ SYM_CODE_END(__switch_to_asm)
>    * asmlinkage function so its argument has to be pushed on the stack.  This
>    * wrapper creates a proper "end of stack" frame header before the call.
>    */
> +.pushsection .text, "ax"
>   SYM_FUNC_START(schedule_tail_wrapper)
>   	FRAME_BEGIN
>   
> @@ -794,6 +797,8 @@ SYM_FUNC_START(schedule_tail_wrapper)
>   	FRAME_END
>   	ret
>   SYM_FUNC_END(schedule_tail_wrapper)
> +.popsection
> +
>   /*
>    * A newly forked process directly context switches into this address.
>    *
> @@ -801,6 +806,7 @@ SYM_FUNC_END(schedule_tail_wrapper)
>    * ebx: kernel thread func (NULL for user thread)
>    * edi: kernel thread arg
>    */
> +.pushsection .text, "ax"
>   SYM_CODE_START(ret_from_fork)
>   	call	schedule_tail_wrapper
>   
> @@ -825,6 +831,7 @@ SYM_CODE_START(ret_from_fork)
>   	movl	$0, PT_EAX(%esp)
>   	jmp	2b
>   SYM_CODE_END(ret_from_fork)
> +.popsection
>   
>   /*
>    * Return to user mode is not as complex as all this looks,
> @@ -1693,6 +1700,7 @@ SYM_CODE_START(general_protection)
>   	jmp	common_exception
>   SYM_CODE_END(general_protection)
>   
> +.pushsection .text, "ax"
>   SYM_CODE_START(rewind_stack_do_exit)
>   	/* Prevent any naive code from trying to unwind to our caller. */
>   	xorl	%ebp, %ebp
> @@ -1703,3 +1711,4 @@ SYM_CODE_START(rewind_stack_do_exit)
>   	call	do_exit
>   1:	jmp 1b
>   SYM_CODE_END(rewind_stack_do_exit)
> +.popsection
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation
  2020-05-05 13:41 ` [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation Thomas Gleixner
@ 2020-05-07 13:39   ` Alexandre Chartre
  2020-05-07 14:13     ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 69+ messages in thread
From: Alexandre Chartre @ 2020-05-07 13:39 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:41 PM, Thomas Gleixner wrote:
> Mark the various syscall entries with noinstr to protect them against
> instrumentation and add the noinstr_begin()/end() annotations to mark the
> parts of the functions which are safe to call out into instrumentable code.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/common.c |  135 ++++++++++++++++++++++++++++++++----------------
>   1 file changed, 90 insertions(+), 45 deletions(-)
> 
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -41,15 +41,26 @@
>   
>   #ifdef CONFIG_CONTEXT_TRACKING
>   /* Called on entry from user mode with IRQs off. */
> -__visible inline noinstr void enter_from_user_mode(void)
> +__visible noinstr void enter_from_user_mode(void)
>   {
> -	CT_WARN_ON(ct_state() != CONTEXT_USER);
> +	enum ctx_state state = ct_state();
> +
>   	user_exit_irqoff();
> +
> +	instr_begin();
> +	CT_WARN_ON(state != CONTEXT_USER);
> +	instr_end();
>   }
>   #else
>   static inline void enter_from_user_mode(void) {}
>   #endif
>   
> +static noinstr void exit_to_user_mode(void)
> +{
> +	user_enter_irqoff();
> +	mds_user_clear_cpu_buffers();
> +}
> +
>   static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>   {
>   #ifdef CONFIG_X86_64
> @@ -179,8 +190,7 @@ static void exit_to_usermode_loop(struct
>   	}
>   }
>   
> -/* Called with IRQs disabled. */
> -__visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
> +static void __prepare_exit_to_usermode(struct pt_regs *regs)
>   {
>   	struct thread_info *ti = current_thread_info();
>   	u32 cached_flags;
> @@ -219,10 +229,14 @@ static void exit_to_usermode_loop(struct
>   	 */
>   	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
>   #endif
> +}
>   
> -	user_enter_irqoff();
> -
> -	mds_user_clear_cpu_buffers();
> +__visible noinstr void prepare_exit_to_usermode(struct pt_regs *regs)
> +{
> +	instr_begin();
> +	__prepare_exit_to_usermode(regs);
> +	instr_end();
> +	exit_to_user_mode();
>   }
>   
>   #define SYSCALL_EXIT_WORK_FLAGS				\
> @@ -251,11 +265,7 @@ static void syscall_slow_exit_work(struc
>   		tracehook_report_syscall_exit(regs, step);
>   }
>   
> -/*
> - * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
> - * state such that we can immediately switch to user mode.
> - */
> -__visible inline void syscall_return_slowpath(struct pt_regs *regs)
> +static void __syscall_return_slowpath(struct pt_regs *regs)
>   {
>   	struct thread_info *ti = current_thread_info();
>   	u32 cached_flags = READ_ONCE(ti->flags);
> @@ -276,15 +286,29 @@ static void syscall_slow_exit_work(struc
>   		syscall_slow_exit_work(regs, cached_flags);
>   
>   	local_irq_disable();
> -	prepare_exit_to_usermode(regs);
> +	__prepare_exit_to_usermode(regs);
> +}
> +
> +/*
> + * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
> + * state such that we can immediately switch to user mode.
> + */
> +__visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
> +{
> +	instr_begin();
> +	__syscall_return_slowpath(regs);
> +	instr_end();
> +	exit_to_user_mode();
>   }
>   
>   #ifdef CONFIG_X86_64
> -__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
> +__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>   {
>   	struct thread_info *ti;
>   
>   	enter_from_user_mode();
> +	instr_begin();
> +
>   	local_irq_enable();
>   	ti = current_thread_info();
>   	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
> @@ -301,8 +325,10 @@ static void syscall_slow_exit_work(struc
>   		regs->ax = x32_sys_call_table[nr](regs);
>   #endif
>   	}
> +	__syscall_return_slowpath(regs);
>   
> -	syscall_return_slowpath(regs);
> +	instr_end();
> +	exit_to_user_mode();
>   }
>   #endif
>   
> @@ -310,10 +336,10 @@ static void syscall_slow_exit_work(struc
>   /*
>    * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
>    * all entry and exit work and returns with IRQs off.  This function is
> - * extremely hot in workloads that use it, and it's usually called from
> + * ex2tremely hot in workloads that use it, and it's usually called from

typo: "ex2tremely"

alex.


>    * do_fast_syscall_32, so forcibly inline it to improve performance.
>    */
> -static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
> +static void do_syscall_32_irqs_on(struct pt_regs *regs)
>   {
>   	struct thread_info *ti = current_thread_info();
>   	unsigned int nr = (unsigned int)regs->orig_ax;
> @@ -337,27 +363,62 @@ static __always_inline void do_syscall_3
>   		regs->ax = ia32_sys_call_table[nr](regs);
>   	}
>   
> -	syscall_return_slowpath(regs);
> +	__syscall_return_slowpath(regs);
>   }
>   
>   /* Handles int $0x80 */
> -__visible void do_int80_syscall_32(struct pt_regs *regs)
> +__visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
>   {
>   	enter_from_user_mode();
> +	instr_begin();
> +
>   	local_irq_enable();
>   	do_syscall_32_irqs_on(regs);
> +
> +	instr_end();
> +	exit_to_user_mode();
> +}
> +
> +static bool __do_fast_syscall_32(struct pt_regs *regs)
> +{
> +	int res;
> +
> +	/* Fetch EBP from where the vDSO stashed it. */
> +	if (IS_ENABLED(CONFIG_X86_64)) {
> +		/*
> +		 * Micro-optimization: the pointer we're following is
> +		 * explicitly 32 bits, so it can't be out of range.
> +		 */
> +		res = __get_user(*(u32 *)&regs->bp,
> +			 (u32 __user __force *)(unsigned long)(u32)regs->sp);
> +	} else {
> +		res = get_user(*(u32 *)&regs->bp,
> +		       (u32 __user __force *)(unsigned long)(u32)regs->sp);
> +	}
> +
> +	if (res) {
> +		/* User code screwed up. */
> +		regs->ax = -EFAULT;
> +		local_irq_disable();
> +		__prepare_exit_to_usermode(regs);
> +		return false;
> +	}
> +
> +	/* Now this is just like a normal syscall. */
> +	do_syscall_32_irqs_on(regs);
> +	return true;
>   }
>   
>   /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
> -__visible long do_fast_syscall_32(struct pt_regs *regs)
> +__visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
>   {
>   	/*
>   	 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
>   	 * convention.  Adjust regs so it looks like we entered using int80.
>   	 */
> -
>   	unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
> -		vdso_image_32.sym_int80_landing_pad;
> +					vdso_image_32.sym_int80_landing_pad;
> +	bool success;
>   
>   	/*
>   	 * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
> @@ -367,33 +428,17 @@ static __always_inline void do_syscall_3
>   	regs->ip = landing_pad;
>   
>   	enter_from_user_mode();
> +	instr_begin();
>   
>   	local_irq_enable();
> +	success = __do_fast_syscall_32(regs);
>   
> -	/* Fetch EBP from where the vDSO stashed it. */
> -	if (
> -#ifdef CONFIG_X86_64
> -		/*
> -		 * Micro-optimization: the pointer we're following is explicitly
> -		 * 32 bits, so it can't be out of range.
> -		 */
> -		__get_user(*(u32 *)&regs->bp,
> -			    (u32 __user __force *)(unsigned long)(u32)regs->sp)
> -#else
> -		get_user(*(u32 *)&regs->bp,
> -			 (u32 __user __force *)(unsigned long)(u32)regs->sp)
> -#endif
> -		) {
> -
> -		/* User code screwed up. */
> -		local_irq_disable();
> -		regs->ax = -EFAULT;
> -		prepare_exit_to_usermode(regs);
> -		return 0;	/* Keep it simple: use IRET. */
> -	}
> +	instr_end();
> +	exit_to_user_mode();
>   
> -	/* Now this is just like a normal syscall. */
> -	do_syscall_32_irqs_on(regs);
> +	/* If it failed, keep it simple: use IRET. */
> +	if (!success)
> +		return 0;
>   
>   #ifdef CONFIG_X86_64
>   	/*
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code
  2020-05-05 13:41 ` [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code Thomas Gleixner
@ 2020-05-07 13:55   ` Alexandre Chartre
  2020-05-07 14:10     ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 69+ messages in thread
From: Alexandre Chartre @ 2020-05-07 13:55 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:41 PM, Thomas Gleixner wrote:
> Now that the C entry points are safe, move the irq flags tracing code into
> the entry helper:
> 
>      - Invoke lockdep before calling into context tracking
> 
>      - Use the safe trace_hardirqs_on_prepare() trace function after context
>        tracking established state and RCU is watching.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/common.c          |   21 +++++++++++++++++++--
>   arch/x86/entry/entry_32.S        |   12 ------------
>   arch/x86/entry/entry_64.S        |    2 --
>   arch/x86/entry/entry_64_compat.S |   18 ------------------
>   4 files changed, 19 insertions(+), 34 deletions(-)
> 
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -40,19 +40,36 @@
>   #include <trace/events/syscalls.h>
>   
>   #ifdef CONFIG_CONTEXT_TRACKING
> -/* Called on entry from user mode with IRQs off. */
> +/**
> + * enter_from_user_mode - Establish state when coming from user mode
> + *
> + * Syscall entry disables interrupts, but user mode is traced as interrupts
> + * enabled. Also with NO_HZ_FULL RCU might be idle.
> + *
> + * 1) Tell lockdep that interrupts are disabled
> + * 2) Invoke context tracking if enabled to reactivate RCU
> + * 3) Trace interrupts off state
> + */
>   __visible noinstr void enter_from_user_mode(void)
>   {
>   	enum ctx_state state = ct_state();
>   
> +	lockdep_hardirqs_off(CALLER_ADDR0);
>   	user_exit_irqoff();
>   
>   	instr_begin();
>   	CT_WARN_ON(state != CONTEXT_USER);
> +	trace_hardirqs_off_prepare();
>   	instr_end();
>   }
>   #else
> -static inline void enter_from_user_mode(void) {}
> +static __always_inline void enter_from_user_mode(void)
> +{
> +	lockdep_hardirqs_off(CALLER_ADDR0);
> +	instr_begin();
> +	trace_hardirqs_off_prepare();
> +	instr_end();
> +}
>   #endif
>   
>   static noinstr void exit_to_user_mode(void)
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -967,12 +967,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
>   	jnz	.Lsysenter_fix_flags
>   .Lsysenter_flags_fixed:
>   
> -	/*
> -	 * User mode is traced as though IRQs are on, and SYSENTER
> -	 * turned them off.
> -	 */
> -	TRACE_IRQS_OFF
> -
>   	movl	%esp, %eax
>   	call	do_fast_syscall_32
>   	/* XEN PV guests always use IRET path */
> @@ -1082,12 +1076,6 @@ SYM_FUNC_START(entry_INT80_32)
>   
>   	SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1	/* save rest */
>   
> -	/*
> -	 * User mode is traced as though IRQs are on, and the interrupt gate
> -	 * turned them off.
> -	 */
> -	TRACE_IRQS_OFF
> -
>   	movl	%esp, %eax
>   	call	do_int80_syscall_32
>   .Lsyscall_32_done:
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -167,8 +167,6 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_h
>   
>   	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
>   
> -	TRACE_IRQS_OFF
> -
>   	/* IRQs are off. */
>   	movq	%rax, %rdi
>   	movq	%rsp, %rsi
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -129,12 +129,6 @@ SYM_FUNC_START(entry_SYSENTER_compat)
>   	jnz	.Lsysenter_fix_flags
>   .Lsysenter_flags_fixed:
>   
> -	/*
> -	 * User mode is traced as though IRQs are on, and SYSENTER
> -	 * turned them off.
> -	 */
> -	TRACE_IRQS_OFF
> -
>   	movq	%rsp, %rdi
>   	call	do_fast_syscall_32
>   	/* XEN PV guests always use IRET path */
> @@ -247,12 +241,6 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_aft
>   	pushq   $0			/* pt_regs->r15 = 0 */
>   	xorl	%r15d, %r15d		/* nospec   r15 */
>   
> -	/*
> -	 * User mode is traced as though IRQs are on, and SYSENTER
> -	 * turned them off.
> -	 */
> -	TRACE_IRQS_OFF
> -
>   	movq	%rsp, %rdi
>   	call	do_fast_syscall_32
>   	/* XEN PV guests always use IRET path */
> @@ -403,12 +391,6 @@ SYM_CODE_START(entry_INT80_compat)
>   	xorl	%r15d, %r15d		/* nospec   r15 */
>   	cld
>   
> -	/*
> -	 * User mode is traced as though IRQs are on, and the interrupt
> -	 * gate turned them off.
> -	 */
> -	TRACE_IRQS_OFF
> -
>   	movq	%rsp, %rdi
>   	call	do_int80_syscall_32
>   .Lsyscall_32_done:
> 

enter_from_user_mode() is also called with the CALL_enter_from_user_mode macro,
which is used in interrupt_entry() and identry. Don't you need to also remove
the TRACE_IRQS_OFF there now?

alex.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code
  2020-05-07 13:55   ` Alexandre Chartre
@ 2020-05-07 14:10     ` Thomas Gleixner
  2020-05-07 15:03       ` Thomas Gleixner
  0 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-07 14:10 UTC (permalink / raw)
  To: Alexandre Chartre, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Alexandre Chartre <alexandre.chartre@oracle.com> writes:
> On 5/5/20 3:41 PM, Thomas Gleixner wrote:
>> -	/*
>> -	 * User mode is traced as though IRQs are on, and the interrupt
>> -	 * gate turned them off.
>> -	 */
>> -	TRACE_IRQS_OFF
>> -
>>   	movq	%rsp, %rdi
>>   	call	do_int80_syscall_32
>>   .Lsyscall_32_done:
>> 
>
> enter_from_user_mode() is also called with the CALL_enter_from_user_mode macro,
> which is used in interrupt_entry() and identry. Don't you need to also remove
> the TRACE_IRQS_OFF there now?

Hrm. right. OTOH, it's just redundant and should be no harm, but let me have a
look at that again.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation
  2020-05-07 13:39   ` Alexandre Chartre
@ 2020-05-07 14:13     ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-07 14:13 UTC (permalink / raw)
  To: Alexandre Chartre, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Alexandre Chartre <alexandre.chartre@oracle.com> writes:
>> @@ -310,10 +336,10 @@ static void syscall_slow_exit_work(struc
>>   /*
>>    * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
>>    * all entry and exit work and returns with IRQs off.  This function is
>> - * extremely hot in workloads that use it, and it's usually called from
>> + * ex2tremely hot in workloads that use it, and it's usually called from
>
> typo: "ex2tremely"

Fixed. Btw, can you please trim your replies?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 02/18] x86/entry/32: Move non entry code into .text section
  2020-05-07 13:15   ` Alexandre Chartre
@ 2020-05-07 14:14     ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-07 14:14 UTC (permalink / raw)
  To: Alexandre Chartre, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Alexandre Chartre <alexandre.chartre@oracle.com> writes:
> On 5/5/20 3:41 PM, Thomas Gleixner wrote:
>>   /*
>>    * %eax: prev task
>>    * %edx: next task
>> - */
>> +*/
>
> Misaligned comment end, this line shouldn't change.

Done.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-05 13:41 ` [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk Thomas Gleixner
@ 2020-05-07 14:15   ` Alexandre Chartre
  2020-05-09  0:10   ` Andy Lutomirski
  2020-05-12  1:51   ` Steven Rostedt
  2 siblings, 0 replies; 69+ messages in thread
From: Alexandre Chartre @ 2020-05-07 14:15 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon



On 5/5/20 3:41 PM, Thomas Gleixner wrote:
> The preempt_enable_notrace() ASM thunk is called from tracing, entry code
> RCU and other places which are already in or going to be in the noinstr
> section which protects sensitve code from being instrumented.

typo: "sensitve"

alex.

> Calls out of these sections happen with interrupts disabled, which is
> handled in C code, but the push regs, call, pop regs sequence can be
> completely avoided in this case.
> 
> This is also a preparatory step for annotating the call from the thunk to
> preempt_enable_notrace() safe from a noinstr section.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/thunk_64.S       |   27 +++++++++++++++++++++++----
>   arch/x86/include/asm/irqflags.h |    3 +--
>   arch/x86/include/asm/paravirt.h |    3 +--
>   3 files changed, 25 insertions(+), 8 deletions(-)
> 
> --- a/arch/x86/entry/thunk_64.S
> +++ b/arch/x86/entry/thunk_64.S
> @@ -9,10 +9,28 @@
>   #include "calling.h"
>   #include <asm/asm.h>
>   #include <asm/export.h>
> +#include <asm/irqflags.h>
> +
> +.code64
>   
>   	/* rdi:	arg1 ... normal C conventions. rax is saved/restored. */
> -	.macro THUNK name, func, put_ret_addr_in_rdi=0
> +	.macro THUNK name, func, put_ret_addr_in_rdi=0, check_if=0
>   SYM_FUNC_START_NOALIGN(\name)
> +
> +	.if \check_if
> +	/*
> +	 * Check for interrupts disabled right here. No point in
> +	 * going all the way down
> +	 */
> +	pushq	%rax
> +	SAVE_FLAGS(CLBR_RAX)
> +	testl	$X86_EFLAGS_IF, %eax
> +	popq	%rax
> +	jnz	1f
> +	ret
> +1:
> +	.endif
> +
>   	pushq %rbp
>   	movq %rsp, %rbp
>   
> @@ -38,14 +56,15 @@ SYM_FUNC_END(\name)
>   	.endm
>   
>   #ifdef CONFIG_TRACE_IRQFLAGS
> -	THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
> -	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
> +	THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller, put_ret_addr_in_rdi=1
> +	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller, put_ret_addr_in_rdi=1
>   #endif
>   
>   #ifdef CONFIG_PREEMPTION
>   	THUNK preempt_schedule_thunk, preempt_schedule
> -	THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
>   	EXPORT_SYMBOL(preempt_schedule_thunk)
> +
> +	THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace, check_if=1
>   	EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
>   #endif
>   
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -127,9 +127,8 @@ static inline notrace unsigned long arch
>   #define DISABLE_INTERRUPTS(x)	cli
>   
>   #ifdef CONFIG_X86_64
> -#ifdef CONFIG_DEBUG_ENTRY
> +
>   #define SAVE_FLAGS(x)		pushfq; popq %rax
> -#endif
>   
>   #define SWAPGS	swapgs
>   /*
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -907,14 +907,13 @@ extern void default_banner(void);
>   		  ANNOTATE_RETPOLINE_SAFE;				\
>   		  jmp PARA_INDIRECT(pv_ops+PV_CPU_usergs_sysret64);)
>   
> -#ifdef CONFIG_DEBUG_ENTRY
>   #define SAVE_FLAGS(clobbers)                                        \
>   	PARA_SITE(PARA_PATCH(PV_IRQ_save_fl),			    \
>   		  PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);        \
>   		  ANNOTATE_RETPOLINE_SAFE;			    \
>   		  call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);	    \
>   		  PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);)
> -#endif
> +
>   #endif /* CONFIG_PARAVIRT_XXL */
>   #endif	/* CONFIG_X86_64 */
>   
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch V5 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on guest enter/exit
  2020-05-06  9:21       ` Paolo Bonzini
@ 2020-05-07 14:44         ` Thomas Gleixner
  2020-05-08 13:45           ` Paolo Bonzini
  0 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-07 14:44 UTC (permalink / raw)
  To: Paolo Bonzini, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Entering guest mode is more or less the same as returning to user
space. From an instrumentation point of view both leave kernel mode and the
transition to guest or user mode reenables interrupts on the host. In user
mode an interrupt is served directly and in guest mode it causes a VM exit
which then handles or reinjects the interrupt.

The transition from guest mode or user mode to kernel mode disables
interrupts, which needs to be recorded in instrumentation to set the
correct state again.

This is important for e.g. latency analysis because otherwise the execution
time in guest or user mode would be wrongly accounted as interrupt disabled
and could trigger false positives.

Add hardirq tracing to guest enter/exit functions in the same way as it
is done in the user mode enter/exit code, respecting the RCU requirements.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
---
V5: Adjust comments and changelog
---
 arch/x86/kvm/vmx/vmx.c |   27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6604,9 +6604,21 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
 
 	/*
-	 * Tell context tracking that this CPU is about to enter guest mode.
+	 * VMENTER enables interrupts (host state), but the kernel state is
+	 * interrupts disabled when this is invoked. Also tell RCU about
+	 * it. This is the same logic as for exit_to_user_mode().
+	 *
+	 * This ensures that e.g. latency analysis on the host observes
+	 * guest mode as interrupt enabled.
+	 *
+	 * guest_enter_irqoff() informs context tracking about the
+	 * transition to guest mode and if enabled adjusts RCU state
+	 * accordingly.
 	 */
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	guest_enter_irqoff();
+	lockdep_hardirqs_on(CALLER_ADDR0);
 
 	/* L1D Flush includes CPU buffer clear to mitigate MDS */
 	if (static_branch_unlikely(&vmx_l1d_should_flush))
@@ -6623,9 +6635,20 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	vcpu->arch.cr2 = read_cr2();
 
 	/*
-	 * Tell context tracking that this CPU is back.
+	 * VMEXIT disables interrupts (host state), but tracing and lockdep
+	 * have them in state 'on' as recorded before entering guest mode.
+	 * Same as enter_from_user_mode().
+	 *
+	 * guest_exit_irqoff() restores host context and reinstates RCU if
+	 * enabled and required.
+	 *
+	 * This needs to be done before the below as native_read_msr()
+	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
+	 * into world and some more.
 	 */
+	lockdep_hardirqs_off(CALLER_ADDR0);
 	guest_exit_irqoff();
+	trace_hardirqs_off_prepare();
 
 	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 18/18] x86/kvm/svm: Move guest enter/exit into .noinstr.text
  2020-05-05 13:41 ` [patch V4 part 2 18/18] x86/kvm/svm: " Thomas Gleixner
  2020-05-06  8:17   ` Paolo Bonzini
@ 2020-05-07 14:47   ` Alexandre Chartre
  1 sibling, 0 replies; 69+ messages in thread
From: Alexandre Chartre @ 2020-05-07 14:47 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:41 PM, Thomas Gleixner wrote:
> Move the functions which are inside the RCU off region into the
> non-instrumentable text section.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>   arch/x86/kvm/svm/svm.c     |  102 ++++++++++++++++++++++++---------------------
>   arch/x86/kvm/svm/vmenter.S |    2
>   2 files changed, 57 insertions(+), 47 deletions(-)
> 

I have reviewed this series and only sent minor comments. So for
all patches of part 2:

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code
  2020-05-07 14:10     ` Thomas Gleixner
@ 2020-05-07 15:03       ` Thomas Gleixner
  2020-05-07 17:06         ` Thomas Gleixner
  0 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-07 15:03 UTC (permalink / raw)
  To: Alexandre Chartre, LKML

Thomas Gleixner <tglx@linutronix.de> writes:
> Alexandre Chartre <alexandre.chartre@oracle.com> writes:
>> On 5/5/20 3:41 PM, Thomas Gleixner wrote:
>>> -	/*
>>> -	 * User mode is traced as though IRQs are on, and the interrupt
>>> -	 * gate turned them off.
>>> -	 */
>>> -	TRACE_IRQS_OFF
>>> -
>>>   	movq	%rsp, %rdi
>>>   	call	do_int80_syscall_32
>>>   .Lsyscall_32_done:
>>> 
>>
>> enter_from_user_mode() is also called with the CALL_enter_from_user_mode macro,
>> which is used in interrupt_entry() and identry. Don't you need to also remove
>> the TRACE_IRQS_OFF there now?
>
> Hrm. right. OTOH, it's just redundant and should be no harm, but let me have a
> look at that again.

Grr, no. It'll trigger the warnon when context tracking is enabled. /me
scratches head and goes to fix.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code
  2020-05-07 15:03       ` Thomas Gleixner
@ 2020-05-07 17:06         ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-07 17:06 UTC (permalink / raw)
  To: Alexandre Chartre, LKML

Thomas Gleixner <tglx@linutronix.de> writes:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> Alexandre Chartre <alexandre.chartre@oracle.com> writes:
>>> On 5/5/20 3:41 PM, Thomas Gleixner wrote:
>>>> -	/*
>>>> -	 * User mode is traced as though IRQs are on, and the interrupt
>>>> -	 * gate turned them off.
>>>> -	 */
>>>> -	TRACE_IRQS_OFF
>>>> -
>>>>   	movq	%rsp, %rdi
>>>>   	call	do_int80_syscall_32
>>>>   .Lsyscall_32_done:
>>>> 
>>>
>>> enter_from_user_mode() is also called with the CALL_enter_from_user_mode macro,
>>> which is used in interrupt_entry() and identry. Don't you need to also remove
>>> the TRACE_IRQS_OFF there now?
>>
>> Hrm. right. OTOH, it's just redundant and should be no harm, but let me have a
>> look at that again.
>
> Grr, no. It'll trigger the warnon when context tracking is enabled. /me
> scratches head and goes to fix.

Scratch that. After unfrying my brain by walking the dogs for an hour,
it's really just redundant calls into lockdep and tracing and both are
happy about it.

I could do a temporary function for that or just mention it in the
changelog.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section
  2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
  2020-05-06 15:51   ` Peter Zijlstra
@ 2020-05-08  1:31   ` Steven Rostedt
  2020-05-08 23:53   ` Andy Lutomirski
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  3 siblings, 0 replies; 69+ messages in thread
From: Steven Rostedt @ 2020-05-08  1:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On Tue, 05 May 2020 15:41:13 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> --- a/arch/x86/kernel/ftrace_64.S
> +++ b/arch/x86/kernel/ftrace_64.S
> @@ -12,7 +12,7 @@
>  #include <asm/frame.h>
>  
>  	.code64
> -	.section .entry.text, "ax"
> +	.section .text, "ax"
>  
>  #ifdef CONFIG_FRAME_POINTER
>  /* Save parent and function stack frames (rip and rbp) */

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 03/18] x86/entry: Mark enter_from_user_mode() noinstr
  2020-05-05 13:41 ` [patch V4 part 2 03/18] x86/entry: Mark enter_from_user_mode() noinstr Thomas Gleixner
@ 2020-05-08  8:21   ` Masami Hiramatsu
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: Masami Hiramatsu @ 2020-05-08  8:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Tue, 05 May 2020 15:41:15 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Both the callers in the low level ASM code and __context_tracking_exit()
> which is invoked from enter_from_user_mode() via user_exit_irqoff() are
> marked NOKPROBE. Allowing enter_from_user_mode() to be probed is
> inconsistent at best.
> 
> Aside of that while function tracing per se is safe the function trace
> entry/exit points can be used via BPF as well which is not safe to use
> before context tracking has reached CONTEXT_KERNEL and adjusted RCU.
> 
> Mark it noinstr which moves it into the instrumentation protected text
> section and includes notrace.
> 
> Note, this needs further fixups in context tracking to ensure that the
> full call chain is protected. Will be addressed in follow up changes.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Looks good to me.

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you,

> ---
>  arch/x86/entry/common.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -41,7 +41,7 @@
>  
>  #ifdef CONFIG_CONTEXT_TRACKING
>  /* Called on entry from user mode with IRQs off. */
> -__visible inline void enter_from_user_mode(void)
> +__visible inline noinstr void enter_from_user_mode(void)
>  {
>  	CT_WARN_ON(ct_state() != CONTEXT_USER);
>  	user_exit_irqoff();
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 07/18] context_tracking: Ensure that the critical path cannot be instrumented
  2020-05-05 13:41 ` [patch V4 part 2 07/18] context_tracking: Ensure that the critical path cannot be instrumented Thomas Gleixner
@ 2020-05-08  8:23   ` Masami Hiramatsu
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: Masami Hiramatsu @ 2020-05-08  8:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Tue, 05 May 2020 15:41:19 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> context tracking lacks a few protection mechanisms against instrumentation:
> 
>  - While the core functions are marked NOKPROBE they lack protection
>    against function tracing which is required as the function entry/exit
>    points can be utilized by BPF.
> 
>  - static functions invoked from the protected functions need to be marked
>    as well as they can be instrumented otherwise.
> 
>  - using plain inline allows the compiler to emit traceable and probable
>    functions.
> 
> Fix this by marking the functions noinstr and converting the plain inlines
> to __always_inline.
> 
> The NOKPROBE_SYMBOL() annotations are removed as the .noinstr.text section
> is already excluded from being probed.
> 
> Cures the following objtool warnings:
> 
>  vmlinux.o: warning: objtool: enter_from_user_mode()+0x34: call to __context_tracking_exit() leaves .noinstr.text section
>  vmlinux.o: warning: objtool: prepare_exit_to_usermode()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
>  vmlinux.o: warning: objtool: syscall_return_slowpath()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
>  vmlinux.o: warning: objtool: do_syscall_64()+0x7f: call to __context_tracking_enter() leaves .noinstr.text section
>  vmlinux.o: warning: objtool: do_int80_syscall_32()+0x3d: call to __context_tracking_enter() leaves .noinstr.text section
>  vmlinux.o: warning: objtool: do_fast_syscall_32()+0x9c: call to __context_tracking_enter() leaves .noinstr.text section
> 
> and generates new ones...
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Looks good to me.

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thanks!

> ---
>  include/linux/context_tracking.h       |    6 +++---
>  include/linux/context_tracking_state.h |    6 +++---
>  kernel/context_tracking.c              |   14 ++++++++------
>  3 files changed, 14 insertions(+), 12 deletions(-)
> 
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -33,13 +33,13 @@ static inline void user_exit(void)
>  }
>  
>  /* Called with interrupts disabled.  */
> -static inline void user_enter_irqoff(void)
> +static __always_inline void user_enter_irqoff(void)
>  {
>  	if (context_tracking_enabled())
>  		__context_tracking_enter(CONTEXT_USER);
>  
>  }
> -static inline void user_exit_irqoff(void)
> +static __always_inline void user_exit_irqoff(void)
>  {
>  	if (context_tracking_enabled())
>  		__context_tracking_exit(CONTEXT_USER);
> @@ -75,7 +75,7 @@ static inline void exception_exit(enum c
>   * is enabled.  If context tracking is disabled, returns
>   * CONTEXT_DISABLED.  This should be used primarily for debugging.
>   */
> -static inline enum ctx_state ct_state(void)
> +static __always_inline enum ctx_state ct_state(void)
>  {
>  	return context_tracking_enabled() ?
>  		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -26,12 +26,12 @@ struct context_tracking {
>  extern struct static_key_false context_tracking_key;
>  DECLARE_PER_CPU(struct context_tracking, context_tracking);
>  
> -static inline bool context_tracking_enabled(void)
> +static __always_inline bool context_tracking_enabled(void)
>  {
>  	return static_branch_unlikely(&context_tracking_key);
>  }
>  
> -static inline bool context_tracking_enabled_cpu(int cpu)
> +static __always_inline bool context_tracking_enabled_cpu(int cpu)
>  {
>  	return context_tracking_enabled() && per_cpu(context_tracking.active, cpu);
>  }
> @@ -41,7 +41,7 @@ static inline bool context_tracking_enab
>  	return context_tracking_enabled() && __this_cpu_read(context_tracking.active);
>  }
>  
> -static inline bool context_tracking_in_user(void)
> +static __always_inline bool context_tracking_in_user(void)
>  {
>  	return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
>  }
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -31,7 +31,7 @@ EXPORT_SYMBOL_GPL(context_tracking_key);
>  DEFINE_PER_CPU(struct context_tracking, context_tracking);
>  EXPORT_SYMBOL_GPL(context_tracking);
>  
> -static bool context_tracking_recursion_enter(void)
> +static noinstr bool context_tracking_recursion_enter(void)
>  {
>  	int recursion;
>  
> @@ -45,7 +45,7 @@ static bool context_tracking_recursion_e
>  	return false;
>  }
>  
> -static void context_tracking_recursion_exit(void)
> +static __always_inline void context_tracking_recursion_exit(void)
>  {
>  	__this_cpu_dec(context_tracking.recursion);
>  }
> @@ -59,7 +59,7 @@ static void context_tracking_recursion_e
>   * instructions to execute won't use any RCU read side critical section
>   * because this function sets RCU in extended quiescent state.
>   */
> -void __context_tracking_enter(enum ctx_state state)
> +void noinstr __context_tracking_enter(enum ctx_state state)
>  {
>  	/* Kernel threads aren't supposed to go to userspace */
>  	WARN_ON_ONCE(!current->mm);
> @@ -77,8 +77,10 @@ void __context_tracking_enter(enum ctx_s
>  			 * on the tick.
>  			 */
>  			if (state == CONTEXT_USER) {
> +				instr_begin();
>  				trace_user_enter(0);
>  				vtime_user_enter(current);
> +				instr_end();
>  			}
>  			rcu_user_enter();
>  		}
> @@ -99,7 +101,6 @@ void __context_tracking_enter(enum ctx_s
>  	}
>  	context_tracking_recursion_exit();
>  }
> -NOKPROBE_SYMBOL(__context_tracking_enter);
>  EXPORT_SYMBOL_GPL(__context_tracking_enter);
>  
>  void context_tracking_enter(enum ctx_state state)
> @@ -142,7 +143,7 @@ NOKPROBE_SYMBOL(context_tracking_user_en
>   * This call supports re-entrancy. This way it can be called from any exception
>   * handler without needing to know if we came from userspace or not.
>   */
> -void __context_tracking_exit(enum ctx_state state)
> +void noinstr __context_tracking_exit(enum ctx_state state)
>  {
>  	if (!context_tracking_recursion_enter())
>  		return;
> @@ -155,15 +156,16 @@ void __context_tracking_exit(enum ctx_st
>  			 */
>  			rcu_user_exit();
>  			if (state == CONTEXT_USER) {
> +				instr_begin();
>  				vtime_user_exit(current);
>  				trace_user_exit(0);
> +				instr_end();
>  			}
>  		}
>  		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
>  	}
>  	context_tracking_recursion_exit();
>  }
> -NOKPROBE_SYMBOL(__context_tracking_exit);
>  EXPORT_SYMBOL_GPL(__context_tracking_exit);
>  
>  void context_tracking_exit(enum ctx_state state)
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V5 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on guest enter/exit
  2020-05-07 14:44         ` [patch V5 " Thomas Gleixner
@ 2020-05-08 13:45           ` Paolo Bonzini
  2020-05-08 14:01             ` Thomas Gleixner
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Bonzini @ 2020-05-08 13:45 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 07/05/20 16:44, Thomas Gleixner wrote:
> Entering guest mode is more or less the same as returning to user
> space. From an instrumentation point of view both leave kernel mode and the
> transition to guest or user mode reenables interrupts on the host. In user
> mode an interrupt is served directly and in guest mode it causes a VM exit
> which then handles or reinjects the interrupt.
> 
> The transition from guest mode or user mode to kernel mode disables
> interrupts, which needs to be recorded in instrumentation to set the
> correct state again.
> 
> This is important for e.g. latency analysis because otherwise the execution
> time in guest or user mode would be wrongly accounted as interrupt disabled
> and could trigger false positives.
> 
> Add hardirq tracing to guest enter/exit functions in the same way as it
> is done in the user mode enter/exit code, respecting the RCU requirements.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> V5: Adjust comments and changelog

Apart from the subject being svm and not vmx, it looks great.  Thanks!

Paolo

> ---
>  arch/x86/kvm/vmx/vmx.c |   27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6604,9 +6604,21 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
>  
>  	/*
> -	 * Tell context tracking that this CPU is about to enter guest mode.
> +	 * VMENTER enables interrupts (host state), but the kernel state is
> +	 * interrupts disabled when this is invoked. Also tell RCU about
> +	 * it. This is the same logic as for exit_to_user_mode().
> +	 *
> +	 * This ensures that e.g. latency analysis on the host observes
> +	 * guest mode as interrupt enabled.
> +	 *
> +	 * guest_enter_irqoff() informs context tracking about the
> +	 * transition to guest mode and if enabled adjusts RCU state
> +	 * accordingly.
>  	 */
> +	trace_hardirqs_on_prepare();
> +	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>  	guest_enter_irqoff();
> +	lockdep_hardirqs_on(CALLER_ADDR0);
>  
>  	/* L1D Flush includes CPU buffer clear to mitigate MDS */
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> @@ -6623,9 +6635,20 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  	vcpu->arch.cr2 = read_cr2();
>  
>  	/*
> -	 * Tell context tracking that this CPU is back.
> +	 * VMEXIT disables interrupts (host state), but tracing and lockdep
> +	 * have them in state 'on' as recorded before entering guest mode.
> +	 * Same as enter_from_user_mode().
> +	 *
> +	 * guest_exit_irqoff() restores host context and reinstates RCU if
> +	 * enabled and required.
> +	 *
> +	 * This needs to be done before the below as native_read_msr()
> +	 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
> +	 * into world and some more.
>  	 */
> +	lockdep_hardirqs_off(CALLER_ADDR0);
>  	guest_exit_irqoff();
> +	trace_hardirqs_off_prepare();
>  
>  	/*
>  	 * We do not use IBRS in the kernel. If this vCPU has used the
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V5 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on guest enter/exit
  2020-05-08 13:45           ` Paolo Bonzini
@ 2020-05-08 14:01             ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-08 14:01 UTC (permalink / raw)
  To: Paolo Bonzini, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Paolo Bonzini <pbonzini@redhat.com> writes:
> On 07/05/20 16:44, Thomas Gleixner wrote:
>> Add hardirq tracing to guest enter/exit functions in the same way as it
>> is done in the user mode enter/exit code, respecting the RCU requirements.
>> 
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
>> ---
>> V5: Adjust comments and changelog
>
> Apart from the subject being svm and not vmx, it looks great.  Thanks!

Yeah, stupid me. I have the same change locally for SVM of course.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section
  2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
  2020-05-06 15:51   ` Peter Zijlstra
  2020-05-08  1:31   ` Steven Rostedt
@ 2020-05-08 23:53   ` Andy Lutomirski
  2020-05-10 13:39     ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  3 siblings, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2020-05-08 23:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> All ASM code which is not part of the entry functionality can move out into
> the .text section. No reason to keep it in the non-instrumentable entry
> section.

Ick.  How about just moving that code into another file altogether?

> +.pushsection .text, "ax"
>  SYM_FUNC_START(native_load_gs_index)
>         FRAME_BEGIN
>         pushfq
> @@ -1058,6 +1063,7 @@ SYM_FUNC_START(native_load_gs_index)
>         ret
>  SYM_FUNC_END(native_load_gs_index)
>  EXPORT_SYMBOL(native_load_gs_index)
> +.popsection

native_load_gs_index is toast if it gets instrumented in the wrong way.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode()
  2020-05-05 13:41 ` [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode() Thomas Gleixner
@ 2020-05-08 23:57   ` Andy Lutomirski
  2020-05-09 10:16     ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2020-05-08 23:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This is another step towards more C-code and less convoluted ASM.
>
> Similar to the entry path, invoke the tracer before context tracking which
> might turn off RCU and invoke lockdep as the last step before going back to
> user space. Annotate the code sections in exit_to_user_mode() accordingly
> so objtool won't complain about the tracer invocation.

Acked-by: Andy Lutomirski <luto@kernel.org>

Note to self: the nmi code needs to be reworked to go through
prepare_exit_to_usermode(), too.  I'll do this once this whole pile
lands.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-05 13:41 ` [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk Thomas Gleixner
  2020-05-07 14:15   ` Alexandre Chartre
@ 2020-05-09  0:10   ` Andy Lutomirski
  2020-05-09 10:25     ` Thomas Gleixner
  2020-05-12  1:48     ` Steven Rostedt
  2020-05-12  1:51   ` Steven Rostedt
  2 siblings, 2 replies; 69+ messages in thread
From: Andy Lutomirski @ 2020-05-09  0:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The preempt_enable_notrace() ASM thunk is called from tracing, entry code
> RCU and other places which are already in or going to be in the noinstr
> section which protects sensitve code from being instrumented.

This text and $SUBJECT agree that you're talking about
preempt_enable_notrace(), but:

> +       THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace, check_if=1

You actually seem to be changing preempt_schedule_notrace().

The actual code in question has this comment:

/**
 * preempt_schedule_notrace - preempt_schedule called by tracing
 *
 * The tracing infrastructure uses preempt_enable_notrace to prevent
 * recursion and tracing preempt enabling caused by the tracing
 * infrastructure itself. But as tracing can happen in areas coming
 * from userspace or just about to enter userspace, a preempt enable
 * can occur before user_exit() is called. This will cause the scheduler
 * to be called when the system is still in usermode.
 *
 * To prevent this, the preempt_enable_notrace will use this function
 * instead of preempt_schedule() to exit user context if needed before
 * calling the scheduler.
 */

Which is no longer really applicable to x86 -- in the state that this
comment nonsensically refers to as "userspace", x86 *always* has IRQs
off, which means that preempt_enable() will not schedule.

So I'm guessing that the issue you're solving is that we have
redundant preempt disable/enable pairs somewhere in the bowels of
tracing code that is called with IRQs off, and objtool is now
complaining.  Could the actual code in question be fixed to assert
that IRQs are off instead of disabling preemption?  If not, can you
fix the $SUBJECT and changelog and perhaps add a comment to the code
as to *why* you're checking IF?  Otherwise some intrepid programmer is
going to notice it down the road, wonder if it's optimizing anything
useful at all, and get rid of it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean
  2020-05-05 13:41 ` [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean Thomas Gleixner
@ 2020-05-09  0:11   ` Andy Lutomirski
  2020-05-09 10:06     ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] x86/entry: " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2020-05-09  0:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Currently entry_64_compat is exempt from objtool, but with vmlinux
> mode there is no hiding it.
>
> Make the following changes to make it pass:
>
>  - change entry_SYSENTER_compat to STT_NOTYPE; it's not a function
>    and doesn't have function type stack setup.
>
>  - mark all STT_NOTYPE symbols with UNWIND_HINT_EMPTY; so we do
>    validate them and don't treat them as unreachable.
>
>  - don't abuse RSP as a temp register, this confuses objtool
>    mightily as it (rightfully) thinks we're doing unspeakable
>    things to the stack.
>

Acked-by: Andy Lutomirski <luto@kernel.org>

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Did a From line get eaten?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs
  2020-05-05 13:41 ` [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs Thomas Gleixner
  2020-05-06  7:42   ` Paolo Bonzini
@ 2020-05-09  0:14   ` Andy Lutomirski
  2020-05-09 10:12     ` Thomas Gleixner
  1 sibling, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2020-05-09  0:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Context tracking for KVM happens way too early in the vcpu_run()
> code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
> cannot use RCU and should also be not instrumented.
>
> The current way of doing this covers way too much code. Move it closer to
> the actual vmenter/exit code.

Now you've made me wonder what happens if someone traces
vmx_vcpu_run().  I'm not sure I really want to think about this.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean
  2020-05-09  0:11   ` Andy Lutomirski
@ 2020-05-09 10:06     ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-09 10:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:

> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Currently entry_64_compat is exempt from objtool, but with vmlinux
>> mode there is no hiding it.
>>
>> Make the following changes to make it pass:
>>
>>  - change entry_SYSENTER_compat to STT_NOTYPE; it's not a function
>>    and doesn't have function type stack setup.
>>
>>  - mark all STT_NOTYPE symbols with UNWIND_HINT_EMPTY; so we do
>>    validate them and don't treat them as unreachable.
>>
>>  - don't abuse RSP as a temp register, this confuses objtool
>>    mightily as it (rightfully) thinks we're doing unspeakable
>>    things to the stack.
>>
>
> Acked-by: Andy Lutomirski <luto@kernel.org>
>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> Did a From line get eaten?

Yes. A couple of patches which were just handed back and forth between
me and Peter lost them. Fixed them all up localy already.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs
  2020-05-09  0:14   ` Andy Lutomirski
@ 2020-05-09 10:12     ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-09 10:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

Andy Lutomirski <luto@kernel.org> writes:
> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Context tracking for KVM happens way too early in the vcpu_run()
>> code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
>> cannot use RCU and should also be not instrumented.
>>
>> The current way of doing this covers way too much code. Move it closer to
>> the actual vmenter/exit code.
>
> Now you've made me wonder what happens if someone traces
> vmx_vcpu_run().  I'm not sure I really want to think about this.

Been there, done that. Kinda worked but adding a kprobe into the guts of
it made it go sideways very fast.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode()
  2020-05-08 23:57   ` Andy Lutomirski
@ 2020-05-09 10:16     ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-09 10:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

Andy Lutomirski <luto@kernel.org> writes:
> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> This is another step towards more C-code and less convoluted ASM.
>>
>> Similar to the entry path, invoke the tracer before context tracking which
>> might turn off RCU and invoke lockdep as the last step before going back to
>> user space. Annotate the code sections in exit_to_user_mode() accordingly
>> so objtool won't complain about the tracer invocation.
>
> Acked-by: Andy Lutomirski <luto@kernel.org>
>
> Note to self: the nmi code needs to be reworked to go through
> prepare_exit_to_usermode(), too.  I'll do this once this whole pile
> lands.

Why? NMI does not set any work stuff or preemption. If something needs
to be done then NMI raises irq_work which uses the regular path.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-09  0:10   ` Andy Lutomirski
@ 2020-05-09 10:25     ` Thomas Gleixner
  2020-05-10 18:47       ` Thomas Gleixner
  2020-05-12  1:48     ` Steven Rostedt
  1 sibling, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-09 10:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

Andy Lutomirski <luto@kernel.org> writes:
> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> The preempt_enable_notrace() ASM thunk is called from tracing, entry code
>> RCU and other places which are already in or going to be in the noinstr
>> section which protects sensitve code from being instrumented.
>
> This text and $SUBJECT agree that you're talking about
> preempt_enable_notrace(), but:
>
>> +       THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace, check_if=1
>
> You actually seem to be changing preempt_schedule_notrace().

Duh, yes.

> The actual code in question has this comment:
>
> /**
>  * preempt_schedule_notrace - preempt_schedule called by tracing
>  *
>  * The tracing infrastructure uses preempt_enable_notrace to prevent
>  * recursion and tracing preempt enabling caused by the tracing
>  * infrastructure itself. But as tracing can happen in areas coming
>  * from userspace or just about to enter userspace, a preempt enable
>  * can occur before user_exit() is called. This will cause the scheduler
>  * to be called when the system is still in usermode.
>  *
>  * To prevent this, the preempt_enable_notrace will use this function
>  * instead of preempt_schedule() to exit user context if needed before
>  * calling the scheduler.
>  */
>
> Which is no longer really applicable to x86 -- in the state that this
> comment nonsensically refers to as "userspace", x86 *always* has IRQs
> off, which means that preempt_enable() will not schedule.
>
> So I'm guessing that the issue you're solving is that we have
> redundant preempt disable/enable pairs somewhere in the bowels of
> tracing code that is called with IRQs off, and objtool is now
> complaining.  Could the actual code in question be fixed to assert
> that IRQs are off instead of disabling preemption?  If not, can you
> fix the $SUBJECT and changelog and perhaps add a comment to the code
> as to *why* you're checking IF?  Otherwise some intrepid programmer is
> going to notice it down the road, wonder if it's optimizing anything
> useful at all, and get rid of it.

Let me stare into that again.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section
  2020-05-08 23:53   ` Andy Lutomirski
@ 2020-05-10 13:39     ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-10 13:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

Andy Lutomirski <luto@kernel.org> writes:
> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> All ASM code which is not part of the entry functionality can move out into
>> the .text section. No reason to keep it in the non-instrumentable entry
>> section.
>
> Ick.  How about just moving that code into another file altogether?

Peter wanted to do that separately.

>> +.pushsection .text, "ax"
>>  SYM_FUNC_START(native_load_gs_index)
>>         FRAME_BEGIN
>>         pushfq
>> @@ -1058,6 +1063,7 @@ SYM_FUNC_START(native_load_gs_index)
>>         ret
>>  SYM_FUNC_END(native_load_gs_index)
>>  EXPORT_SYMBOL(native_load_gs_index)
>> +.popsection
>
> native_load_gs_index is toast if it gets instrumented in the wrong way.

I'll keep it in the noinstr section then.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-09 10:25     ` Thomas Gleixner
@ 2020-05-10 18:47       ` Thomas Gleixner
  2020-05-11 18:27         ` Thomas Gleixner
  0 siblings, 1 reply; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-10 18:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

Thomas Gleixner <tglx@linutronix.de> writes:
> Andy Lutomirski <luto@kernel.org> writes:
>> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> /**
>>  * preempt_schedule_notrace - preempt_schedule called by tracing
>>  *
>>  * The tracing infrastructure uses preempt_enable_notrace to prevent
>>  * recursion and tracing preempt enabling caused by the tracing
>>  * infrastructure itself. But as tracing can happen in areas coming
>>  * from userspace or just about to enter userspace, a preempt enable
>>  * can occur before user_exit() is called. This will cause the scheduler
>>  * to be called when the system is still in usermode.
>>  *
>>  * To prevent this, the preempt_enable_notrace will use this function
>>  * instead of preempt_schedule() to exit user context if needed before
>>  * calling the scheduler.
>>  */
>>
>> Which is no longer really applicable to x86 -- in the state that this
>> comment nonsensically refers to as "userspace", x86 *always* has IRQs
>> off, which means that preempt_enable() will not schedule.

Yeah.

>> So I'm guessing that the issue you're solving is that we have
>> redundant preempt disable/enable pairs somewhere in the bowels of
>> tracing code that is called with IRQs off, and objtool is now
>> complaining.  Could the actual code in question be fixed to assert
>> that IRQs are off instead of disabling preemption?  If not, can you
>> fix the $SUBJECT and changelog and perhaps add a comment to the code
>> as to *why* you're checking IF?  Otherwise some intrepid programmer is
>> going to notice it down the road, wonder if it's optimizing anything
>> useful at all, and get rid of it.
>
> Let me stare into that again.

There are a few preempt_disable/enable() pairs in some of the helper
functions which are called in various places. That means we would have
to chase all of them and provide 'naked' helpers for these particular
call chains. I'll fix the changelog and add a comment to make clear what
this is about.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-10 18:47       ` Thomas Gleixner
@ 2020-05-11 18:27         ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-11 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

Thomas Gleixner <tglx@linutronix.de> writes:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> Let me stare into that again.
>
> There are a few preempt_disable/enable() pairs in some of the helper
> functions which are called in various places. That means we would have
> to chase all of them and provide 'naked' helpers for these particular
> call chains. I'll fix the changelog and add a comment to make clear what
> this is about.

I actually sat down and chased it. It's mostly the tracing code - again,
particularly the hardware latency tracer. There is really no point to
invoke that from the guts of nmi_enter() and nmi_exit().

Neither for #DB, #BP nor #MCE there is a reason to invoke that at
all. If someone does hardware latency analysis then #DB and #BP should
not be in use at all. If so, shrug. If #MCE hits, then the hardware
induced latency is the least of the worries.

So the only relevant place is actually NMI which wants to be tracked to
avoid false positives. But that tracking really can wait to the point
where the NMI has actually reached halfways stable state.

The other place which as preempt_disable/enable_notrace() in it is
rcu_is_watching() but it's trivial enough to provide a naked version for
that.

Thanks,

        tglx

8<------------------
Subject: nmi, tracing: Provide nmi_enter/exit_notrace()
From: Thomas Gleixner <tglx@linutronix.de>
Date: Mon, 11 May 2020 10:57:16 +0200

To fully isolate #DB and #BP from instrumentable code it's necessary to
avoid invoking the hardware latency tracer on nmi_enter/exit().

Provide nmi_enter/exit() variants which are not invoking the hardware
latency tracer. That allows to put calls explicitely into the call sites
outside of the kprobe handling.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: New patch
---
 include/linux/hardirq.h |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -77,28 +77,38 @@ extern void irq_exit(void);
 /*
  * nmi_enter() can nest up to 15 times; see NMI_BITS.
  */
-#define nmi_enter()						\
+#define nmi_enter_notrace()					\
 	do {							\
 		arch_nmi_enter();				\
 		printk_nmi_enter();				\
 		lockdep_off();					\
-		ftrace_nmi_enter();				\
 		BUG_ON(in_nmi() == NMI_MASK);			\
 		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		rcu_nmi_enter();				\
 		lockdep_hardirq_enter();			\
 	} while (0)
 
-#define nmi_exit()						\
+#define nmi_enter()						\
+	do {							\
+		nmi_enter_notrace();				\
+		ftrace_nmi_enter();				\
+	} while (0)
+
+#define nmi_exit_notrace()					\
 	do {							\
 		lockdep_hardirq_exit();				\
 		rcu_nmi_exit();					\
 		BUG_ON(!in_nmi());				\
 		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
-		ftrace_nmi_exit();				\
 		lockdep_on();					\
 		printk_nmi_exit();				\
 		arch_nmi_exit();				\
 	} while (0)
 
+#define nmi_exit()						\
+	do {							\
+		ftrace_nmi_exit();				\
+		nmi_exit_notrace();				\
+	} while (0)
+
 #endif /* LINUX_HARDIRQ_H */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-09  0:10   ` Andy Lutomirski
  2020-05-09 10:25     ` Thomas Gleixner
@ 2020-05-12  1:48     ` Steven Rostedt
  1 sibling, 0 replies; 69+ messages in thread
From: Steven Rostedt @ 2020-05-12  1:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Boris Ostrovsky, Juergen Gross, Brian Gerst,
	Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Fri, 8 May 2020 17:10:09 -0700
Andy Lutomirski <luto@kernel.org> wrote:

> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > The preempt_enable_notrace() ASM thunk is called from tracing, entry code
> > RCU and other places which are already in or going to be in the noinstr
> > section which protects sensitve code from being instrumented.  
> 
> This text and $SUBJECT agree that you're talking about
> preempt_enable_notrace(), but:
> 
> > +       THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace, check_if=1  
> 
> You actually seem to be changing preempt_schedule_notrace().
> 
> The actual code in question has this comment:
> 
> /**
>  * preempt_schedule_notrace - preempt_schedule called by tracing
>  *
>  * The tracing infrastructure uses preempt_enable_notrace to prevent
>  * recursion and tracing preempt enabling caused by the tracing
>  * infrastructure itself. But as tracing can happen in areas coming
>  * from userspace or just about to enter userspace, a preempt enable
>  * can occur before user_exit() is called. This will cause the scheduler
>  * to be called when the system is still in usermode.
>  *
>  * To prevent this, the preempt_enable_notrace will use this function
>  * instead of preempt_schedule() to exit user context if needed before
>  * calling the scheduler.
>  */
> 
> Which is no longer really applicable to x86 -- in the state that this
> comment nonsensically refers to as "userspace", x86 *always* has IRQs
> off, which means that preempt_enable() will not schedule.
> 
> So I'm guessing that the issue you're solving is that we have
> redundant preempt disable/enable pairs somewhere in the bowels of
> tracing code that is called with IRQs off, and objtool is now
> complaining.  Could the actual code in question be fixed to assert
> that IRQs are off instead of disabling preemption?  If not, can you
> fix the $SUBJECT and changelog and perhaps add a comment to the code
> as to *why* you're checking IF?  Otherwise some intrepid programmer is
> going to notice it down the road, wonder if it's optimizing anything
> useful at all, and get rid of it.

The commit that added that code is this:

  29bb9e5a75684106a37593ad75ec75ff8312731b

And it may not be applicable anymore, especially after Thomas's
patches. I'll go and stare at that some more. A lot has changed since
2013 ;-)

-- Steve

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-05 13:41 ` [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk Thomas Gleixner
  2020-05-07 14:15   ` Alexandre Chartre
  2020-05-09  0:10   ` Andy Lutomirski
@ 2020-05-12  1:51   ` Steven Rostedt
  2020-05-12  8:14     ` Thomas Gleixner
  2 siblings, 1 reply; 69+ messages in thread
From: Steven Rostedt @ 2020-05-12  1:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

[-- Attachment #1: Type: text/plain, Size: 1150 bytes --]

On Tue, 05 May 2020 15:41:22 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> The preempt_enable_notrace() ASM thunk is called from tracing, entry code
> RCU and other places which are already in or going to be in the noinstr
> section which protects sensitve code from being instrumented.
> 
> Calls out of these sections happen with interrupts disabled, which is
> handled in C code, but the push regs, call, pop regs sequence can be
> completely avoided in this case.
> 
> This is also a preparatory step for annotating the call from the thunk to
> preempt_enable_notrace() safe from a noinstr section.
> 

BTW, after applying this patch, I get the following error:

/work/git/linux-test.git/arch/x86/entry/thunk_64.S: Assembler messages:
/work/git/linux-test.git/arch/x86/entry/thunk_64.S:67: Error: invalid operands (*UND* and *UND* sections) for `+'
/work/git/linux-test.git/arch/x86/entry/thunk_64.S:67: Error: invalid operands (*UND* and *ABS* sections) for `/'
make[3]: *** [/work/git/linux-test.git/scripts/Makefile.build:349: arch/x86/entry/thunk_64.o] Error 1
make[3]: *** Waiting for unfinished jobs....

Config attached.

-- Steve


[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 34653 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk
  2020-05-12  1:51   ` Steven Rostedt
@ 2020-05-12  8:14     ` Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: Thomas Gleixner @ 2020-05-12  8:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Steven Rostedt <rostedt@goodmis.org> writes:
> On Tue, 05 May 2020 15:41:22 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> The preempt_enable_notrace() ASM thunk is called from tracing, entry code
>> RCU and other places which are already in or going to be in the noinstr
>> section which protects sensitve code from being instrumented.
>> 
>> Calls out of these sections happen with interrupts disabled, which is
>> handled in C code, but the push regs, call, pop regs sequence can be
>> completely avoided in this case.
>> 
>> This is also a preparatory step for annotating the call from the thunk to
>> preempt_enable_notrace() safe from a noinstr section.
>> 
>
> BTW, after applying this patch, I get the following error:
>
> /work/git/linux-test.git/arch/x86/entry/thunk_64.S: Assembler messages:
> /work/git/linux-test.git/arch/x86/entry/thunk_64.S:67: Error: invalid operands (*UND* and *UND* sections) for `+'
> /work/git/linux-test.git/arch/x86/entry/thunk_64.S:67: Error: invalid operands (*UND* and *ABS* sections) for `/'
> make[3]: *** [/work/git/linux-test.git/scripts/Makefile.build:349: arch/x86/entry/thunk_64.o] Error 1
> make[3]: *** Waiting for unfinished jobs....

Yes I know, but I'm going to drop that patch completely.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/entry: Make entry_64_compat.S objtool clean
  2020-05-05 13:41 ` [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean Thomas Gleixner
  2020-05-09  0:11   ` Andy Lutomirski
@ 2020-05-19 19:58   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Thomas Gleixner, Alexandre Chartre, Andy Lutomirski, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     b5f7e5db3839c5e67af6544872f35e2d70359518
Gitweb:        https://git.kernel.org/tip/b5f7e5db3839c5e67af6544872f35e2d70359518
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 12 May 2020 18:17:12 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:52 +02:00

x86/entry: Make entry_64_compat.S objtool clean

Currently entry_64_compat is exempt from objtool, but with vmlinux
mode there is no hiding it.

Make the following changes to make it pass:

 - change entry_SYSENTER_compat to STT_NOTYPE; it's not a function
   and doesn't have function type stack setup.

 - mark all STT_NOTYPE symbols with UNWIND_HINT_EMPTY; so we do
   validate them and don't treat them as unreachable.

 - don't abuse RSP as a temp register, this confuses objtool
   mightily as it (rightfully) thinks we're doing unspeakable
   things to the stack.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134341.272248024@linutronix.de


---
 arch/x86/entry/Makefile          |  2 --
 arch/x86/entry/entry_64_compat.S | 25 ++++++++++++++++++++-----
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index cdf45ff..b7a5790 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -11,8 +11,6 @@ CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-
 CFLAGS_REMOVE_syscall_32.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
 CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
 
-OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
-
 CFLAGS_syscall_64.o		+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o		+= $(call cc-option,-Wno-override-init,)
 obj-y				:= entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 7c29ed8..0f974ae 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -46,12 +46,14 @@
  * ebp  user stack
  * 0(%ebp) arg6
  */
-SYM_FUNC_START(entry_SYSENTER_compat)
+SYM_CODE_START(entry_SYSENTER_compat)
+	UNWIND_HINT_EMPTY
 	/* Interrupts are off on entry. */
 	SWAPGS
 
-	/* We are about to clobber %rsp anyway, clobbering here is OK */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+	pushq	%rax
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+	popq	%rax
 
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
@@ -104,6 +106,9 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	xorl	%r14d, %r14d		/* nospec   r14 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
+
+	UNWIND_HINT_REGS
+
 	cld
 
 	/*
@@ -141,7 +146,7 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	popfq
 	jmp	.Lsysenter_flags_fixed
 SYM_INNER_LABEL(__end_entry_SYSENTER_compat, SYM_L_GLOBAL)
-SYM_FUNC_END(entry_SYSENTER_compat)
+SYM_CODE_END(entry_SYSENTER_compat)
 
 /*
  * 32-bit SYSCALL entry.
@@ -191,6 +196,7 @@ SYM_FUNC_END(entry_SYSENTER_compat)
  * 0(%esp) arg6
  */
 SYM_CODE_START(entry_SYSCALL_compat)
+	UNWIND_HINT_EMPTY
 	/* Interrupts are off on entry. */
 	swapgs
 
@@ -241,6 +247,8 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	pushq   $0			/* pt_regs->r15 = 0 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
 
+	UNWIND_HINT_REGS
+
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -328,6 +336,7 @@ SYM_CODE_END(entry_SYSCALL_compat)
  * ebp  arg6
  */
 SYM_CODE_START(entry_INT80_compat)
+	UNWIND_HINT_EMPTY
 	/*
 	 * Interrupts are off on entry.
 	 */
@@ -349,8 +358,11 @@ SYM_CODE_START(entry_INT80_compat)
 
 	/* Need to switch before accessing the thread stack. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/* In the Xen PV case we already run on the thread stack. */
-	ALTERNATIVE "movq %rsp, %rdi", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV
+	ALTERNATIVE "", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV
+
+	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	pushq	6*8(%rdi)		/* regs->ss */
@@ -389,6 +401,9 @@ SYM_CODE_START(entry_INT80_compat)
 	xorl	%r14d, %r14d		/* nospec   r14 */
 	pushq   %r15                    /* pt_regs->r15 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
+
+	UNWIND_HINT_REGS
+
 	cld
 
 	movq	%rsp, %rdi

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/entry: Move irq flags tracing to prepare_exit_to_usermode()
  2020-05-05 13:41 ` [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode() Thomas Gleixner
  2020-05-08 23:57   ` Andy Lutomirski
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra,
	Andy Lutomirski, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     59a42be78098ad1e212abb6eea5d05ed429b5a64
Gitweb:        https://git.kernel.org/tip/59a42be78098ad1e212abb6eea5d05ed429b5a64
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 04 Mar 2020 12:51:59 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:51 +02:00

x86/entry: Move irq flags tracing to prepare_exit_to_usermode()

This is another step towards more C-code and less convoluted ASM.

Similar to the entry path, invoke the tracer before context tracking which
might turn off RCU and invoke lockdep as the last step before going back to
user space. Annotate the code sections in exit_to_user_mode() accordingly
so objtool won't complain about the tracer invocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505134340.703783926@linutronix.de


---
 arch/x86/entry/common.c          | 19 ++++++++++++++++++-
 arch/x86/entry/entry_32.S        | 12 ++++--------
 arch/x86/entry/entry_64.S        |  4 ----
 arch/x86/entry/entry_64_compat.S | 14 +++++---------
 4 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7473c12..e4f9f5f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -72,10 +72,27 @@ static __always_inline void enter_from_user_mode(void)
 }
 #endif
 
-static noinstr void exit_to_user_mode(void)
+/**
+ * exit_to_user_mode - Fixup state when exiting to user mode
+ *
+ * Syscall exit enables interrupts, but the kernel state is interrupts
+ * disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Clear CPU buffers if CPU is affected by MDS and the migitation is on.
+ * 4) Tell lockdep that interrupts are enabled
+ */
+static __always_inline void exit_to_user_mode(void)
 {
+	instrumentation_begin();
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	instrumentation_end();
+
 	user_enter_irqoff();
 	mds_user_clear_cpu_buffers();
+	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
 static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 65704e0..d9da0b7 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -817,8 +817,7 @@ SYM_CODE_START(ret_from_fork)
 	/* When we fork, we trace the syscall return in the child, too. */
 	movl    %esp, %eax
 	call    syscall_return_slowpath
-	STACKLEAK_ERASE
-	jmp     restore_all
+	jmp     .Lsyscall_32_done
 
 	/* kernel thread */
 1:	movl	%edi, %eax
@@ -862,7 +861,7 @@ ret_from_intr:
 	TRACE_IRQS_OFF
 	movl	%esp, %eax
 	call	prepare_exit_to_usermode
-	jmp	restore_all
+	jmp	restore_all_switch_stack
 SYM_CODE_END(ret_from_exception)
 
 SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
@@ -975,8 +974,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
 
 	STACKLEAK_ERASE
 
-/* Opportunistic SYSEXIT */
-	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
+	/* Opportunistic SYSEXIT */
 
 	/*
 	 * Setup entry stack - we keep the pointer in %eax and do the
@@ -1079,11 +1077,9 @@ SYM_FUNC_START(entry_INT80_32)
 	movl	%esp, %eax
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
-
 	STACKLEAK_ERASE
 
-restore_all:
-	TRACE_IRQS_ON
+restore_all_switch_stack:
 	SWITCH_TO_ENTRY_STACK
 	CHECK_AND_APPLY_ESPFIX
 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9e34fe8..9866b54 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -172,8 +172,6 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	movq	%rsp, %rsi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
-	TRACE_IRQS_ON			/* return enables interrupts */
-
 	/*
 	 * Try to use SYSRET instead of IRET if we're returning to
 	 * a completely clean 64-bit userspace context.  If we're not,
@@ -342,7 +340,6 @@ SYM_CODE_START(ret_from_fork)
 	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
-	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	jmp	swapgs_restore_regs_and_return_to_usermode
 
 1:
@@ -620,7 +617,6 @@ ret_from_intr:
 .Lretint_user:
 	mov	%rsp,%rdi
 	call	prepare_exit_to_usermode
-	TRACE_IRQS_ON
 
 SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 #ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index e2e8bd7..7c29ed8 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -132,8 +132,8 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
-		    "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
+	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
+		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 	jmp	sysret32_from_system_call
 
 .Lsysenter_fix_flags:
@@ -244,8 +244,8 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
-		    "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
+	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
+		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 
 	/* Opportunistic SYSRET */
 sysret32_from_system_call:
@@ -254,7 +254,7 @@ sysret32_from_system_call:
 	 * stack. So let's erase the thread stack right now.
 	 */
 	STACKLEAK_ERASE
-	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
+
 	movq	RBX(%rsp), %rbx		/* pt_regs->rbx */
 	movq	RBP(%rsp), %rbp		/* pt_regs->rbp */
 	movq	EFLAGS(%rsp), %r11	/* pt_regs->flags (in r11) */
@@ -393,9 +393,5 @@ SYM_CODE_START(entry_INT80_compat)
 
 	movq	%rsp, %rdi
 	call	do_int80_syscall_32
-.Lsyscall_32_done:
-
-	/* Go back to user mode. */
-	TRACE_IRQS_ON
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(entry_INT80_compat)

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline
  2020-05-05 13:41 ` [patch V4 part 2 08/18] lib/smp_processor_id: Move it into noinstr section Thomas Gleixner
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] lib/smp_processor_id: Move it into noinstr section tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     c48dd99ee6c2fe0a5a3e1667eca0cceb57797d21
Gitweb:        https://git.kernel.org/tip/c48dd99ee6c2fe0a5a3e1667eca0cceb57797d21
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 04 Mar 2020 12:49:18 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:51 +02:00

x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline

Prevent the compiler from uninlining and creating traceable/probable
functions as this is invoked _after_ context tracking switched to
CONTEXT_USER and rcu idle.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.902709267@linutronix.de


---
 arch/x86/include/asm/nospec-branch.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index d52d1aa..e7752b4 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -262,7 +262,7 @@ DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
  * combination with microcode which triggers a CPU buffer flush when the
  * instruction is executed.
  */
-static inline void mds_clear_cpu_buffers(void)
+static __always_inline void mds_clear_cpu_buffers(void)
 {
 	static const u16 ds = __KERNEL_DS;
 
@@ -283,7 +283,7 @@ static inline void mds_clear_cpu_buffers(void)
  *
  * Clear CPU buffers if the corresponding static key is enabled
  */
-static inline void mds_user_clear_cpu_buffers(void)
+static __always_inline void mds_user_clear_cpu_buffers(void)
 {
 	if (static_branch_likely(&mds_user_clear))
 		mds_clear_cpu_buffers();

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/entry/common: Protect against instrumentation
  2020-05-05 13:41 ` [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation Thomas Gleixner
  2020-05-07 13:39   ` Alexandre Chartre
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     aa9712e07f82a5458f2f16c100c491d736240d60
Gitweb:        https://git.kernel.org/tip/aa9712e07f82a5458f2f16c100c491d736240d60
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 10 Mar 2020 14:46:27 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:50 +02:00

x86/entry/common: Protect against instrumentation

Mark the various syscall entries with noinstr to protect them against
instrumentation and add the noinstrumentation_begin()/end() annotations to mark the
parts of the functions which are safe to call out into instrumentable code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.520277507@linutronix.de


---
 arch/x86/entry/common.c | 133 ++++++++++++++++++++++++++-------------
 1 file changed, 89 insertions(+), 44 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d862add..9892fb7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -41,15 +41,26 @@
 
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
-__visible inline noinstr void enter_from_user_mode(void)
+__visible noinstr void enter_from_user_mode(void)
 {
-	CT_WARN_ON(ct_state() != CONTEXT_USER);
+	enum ctx_state state = ct_state();
+
 	user_exit_irqoff();
+
+	instrumentation_begin();
+	CT_WARN_ON(state != CONTEXT_USER);
+	instrumentation_end();
 }
 #else
 static inline void enter_from_user_mode(void) {}
 #endif
 
+static noinstr void exit_to_user_mode(void)
+{
+	user_enter_irqoff();
+	mds_user_clear_cpu_buffers();
+}
+
 static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
 {
 #ifdef CONFIG_X86_64
@@ -179,8 +190,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 	}
 }
 
-/* Called with IRQs disabled. */
-__visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
+static void __prepare_exit_to_usermode(struct pt_regs *regs)
 {
 	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
@@ -219,10 +229,14 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	 */
 	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
+}
 
-	user_enter_irqoff();
-
-	mds_user_clear_cpu_buffers();
+__visible noinstr void prepare_exit_to_usermode(struct pt_regs *regs)
+{
+	instrumentation_begin();
+	__prepare_exit_to_usermode(regs);
+	instrumentation_end();
+	exit_to_user_mode();
 }
 
 #define SYSCALL_EXIT_WORK_FLAGS				\
@@ -251,11 +265,7 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
 		tracehook_report_syscall_exit(regs, step);
 }
 
-/*
- * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
- * state such that we can immediately switch to user mode.
- */
-__visible inline void syscall_return_slowpath(struct pt_regs *regs)
+static void __syscall_return_slowpath(struct pt_regs *regs)
 {
 	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
@@ -276,15 +286,29 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 		syscall_slow_exit_work(regs, cached_flags);
 
 	local_irq_disable();
-	prepare_exit_to_usermode(regs);
+	__prepare_exit_to_usermode(regs);
+}
+
+/*
+ * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
+ * state such that we can immediately switch to user mode.
+ */
+__visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
+{
+	instrumentation_begin();
+	__syscall_return_slowpath(regs);
+	instrumentation_end();
+	exit_to_user_mode();
 }
 
 #ifdef CONFIG_X86_64
-__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
 	struct thread_info *ti;
 
 	enter_from_user_mode();
+	instrumentation_begin();
+
 	local_irq_enable();
 	ti = current_thread_info();
 	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
@@ -301,8 +325,10 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 		regs->ax = x32_sys_call_table[nr](regs);
 #endif
 	}
+	__syscall_return_slowpath(regs);
 
-	syscall_return_slowpath(regs);
+	instrumentation_end();
+	exit_to_user_mode();
 }
 #endif
 
@@ -313,7 +339,7 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
  * extremely hot in workloads that use it, and it's usually called from
  * do_fast_syscall_32, so forcibly inline it to improve performance.
  */
-static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
+static void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
 	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
@@ -337,27 +363,62 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 		regs->ax = ia32_sys_call_table[nr](regs);
 	}
 
-	syscall_return_slowpath(regs);
+	__syscall_return_slowpath(regs);
 }
 
 /* Handles int $0x80 */
-__visible void do_int80_syscall_32(struct pt_regs *regs)
+__visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
 	enter_from_user_mode();
+	instrumentation_begin();
+
 	local_irq_enable();
 	do_syscall_32_irqs_on(regs);
+
+	instrumentation_end();
+	exit_to_user_mode();
+}
+
+static bool __do_fast_syscall_32(struct pt_regs *regs)
+{
+	int res;
+
+	/* Fetch EBP from where the vDSO stashed it. */
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		/*
+		 * Micro-optimization: the pointer we're following is
+		 * explicitly 32 bits, so it can't be out of range.
+		 */
+		res = __get_user(*(u32 *)&regs->bp,
+			 (u32 __user __force *)(unsigned long)(u32)regs->sp);
+	} else {
+		res = get_user(*(u32 *)&regs->bp,
+		       (u32 __user __force *)(unsigned long)(u32)regs->sp);
+	}
+
+	if (res) {
+		/* User code screwed up. */
+		regs->ax = -EFAULT;
+		local_irq_disable();
+		__prepare_exit_to_usermode(regs);
+		return false;
+	}
+
+	/* Now this is just like a normal syscall. */
+	do_syscall_32_irqs_on(regs);
+	return true;
 }
 
 /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
-__visible long do_fast_syscall_32(struct pt_regs *regs)
+__visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 {
 	/*
 	 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
 	 * convention.  Adjust regs so it looks like we entered using int80.
 	 */
-
 	unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
-		vdso_image_32.sym_int80_landing_pad;
+					vdso_image_32.sym_int80_landing_pad;
+	bool success;
 
 	/*
 	 * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
@@ -367,33 +428,17 @@ __visible long do_fast_syscall_32(struct pt_regs *regs)
 	regs->ip = landing_pad;
 
 	enter_from_user_mode();
+	instrumentation_begin();
 
 	local_irq_enable();
+	success = __do_fast_syscall_32(regs);
 
-	/* Fetch EBP from where the vDSO stashed it. */
-	if (
-#ifdef CONFIG_X86_64
-		/*
-		 * Micro-optimization: the pointer we're following is explicitly
-		 * 32 bits, so it can't be out of range.
-		 */
-		__get_user(*(u32 *)&regs->bp,
-			    (u32 __user __force *)(unsigned long)(u32)regs->sp)
-#else
-		get_user(*(u32 *)&regs->bp,
-			 (u32 __user __force *)(unsigned long)(u32)regs->sp)
-#endif
-		) {
-
-		/* User code screwed up. */
-		local_irq_disable();
-		regs->ax = -EFAULT;
-		prepare_exit_to_usermode(regs);
-		return 0;	/* Keep it simple: use IRET. */
-	}
+	instrumentation_end();
+	exit_to_user_mode();
 
-	/* Now this is just like a normal syscall. */
-	do_syscall_32_irqs_on(regs);
+	/* If it failed, keep it simple: use IRET. */
+	if (!success)
+		return 0;
 
 #ifdef CONFIG_X86_64
 	/*

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/entry: Move irq tracing on syscall entry to C-code
  2020-05-05 13:41 ` [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code Thomas Gleixner
  2020-05-07 13:55   ` Alexandre Chartre
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     f0fd87b82db7b0102ba98991fa36c2318d2e2894
Gitweb:        https://git.kernel.org/tip/f0fd87b82db7b0102ba98991fa36c2318d2e2894
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 25 Feb 2020 23:08:05 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:51 +02:00

x86/entry: Move irq tracing on syscall entry to C-code

Now that the C entry points are safe, move the irq flags tracing code into
the entry helper:

    - Invoke lockdep before calling into context tracking

    - Use the safe trace_hardirqs_on_prepare() trace function after context
      tracking established state and RCU is watching.

enter_from_user_mode() is also still invoked from the exception/interrupt
entry code which still contains the ASM irq flags tracing. So this is just
a redundant and harmless invocation of tracing / lockdep until these are
removed as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.611961721@linutronix.de


---
 arch/x86/entry/common.c          | 21 +++++++++++++++++++--
 arch/x86/entry/entry_32.S        | 12 ------------
 arch/x86/entry/entry_64.S        |  2 --
 arch/x86/entry/entry_64_compat.S | 18 ------------------
 4 files changed, 19 insertions(+), 34 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9892fb7..7473c12 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -40,19 +40,36 @@
 #include <trace/events/syscalls.h>
 
 #ifdef CONFIG_CONTEXT_TRACKING
-/* Called on entry from user mode with IRQs off. */
+/**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall entry disables interrupts, but user mode is traced as interrupts
+ * enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ */
 __visible noinstr void enter_from_user_mode(void)
 {
 	enum ctx_state state = ct_state();
 
+	lockdep_hardirqs_off(CALLER_ADDR0);
 	user_exit_irqoff();
 
 	instrumentation_begin();
 	CT_WARN_ON(state != CONTEXT_USER);
+	trace_hardirqs_off_prepare();
 	instrumentation_end();
 }
 #else
-static inline void enter_from_user_mode(void) {}
+static __always_inline void enter_from_user_mode(void)
+{
+	lockdep_hardirqs_off(CALLER_ADDR0);
+	instrumentation_begin();
+	trace_hardirqs_off_prepare();
+	instrumentation_end();
+}
 #endif
 
 static noinstr void exit_to_user_mode(void)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index bf0082b..65704e0 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -967,12 +967,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
 	jnz	.Lsysenter_fix_flags
 .Lsysenter_flags_fixed:
 
-	/*
-	 * User mode is traced as though IRQs are on, and SYSENTER
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movl	%esp, %eax
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -1082,12 +1076,6 @@ SYM_FUNC_START(entry_INT80_32)
 
 	SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1	/* save rest */
 
-	/*
-	 * User mode is traced as though IRQs are on, and the interrupt gate
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movl	%esp, %eax
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index b199f43..9e34fe8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -167,8 +167,6 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
-	TRACE_IRQS_OFF
-
 	/* IRQs are off. */
 	movq	%rax, %rdi
 	movq	%rsp, %rsi
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index f1d3cca..e2e8bd7 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -129,12 +129,6 @@ SYM_FUNC_START(entry_SYSENTER_compat)
 	jnz	.Lsysenter_fix_flags
 .Lsysenter_flags_fixed:
 
-	/*
-	 * User mode is traced as though IRQs are on, and SYSENTER
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -247,12 +241,6 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	pushq   $0			/* pt_regs->r15 = 0 */
 	xorl	%r15d, %r15d		/* nospec   r15 */
 
-	/*
-	 * User mode is traced as though IRQs are on, and SYSENTER
-	 * turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -403,12 +391,6 @@ SYM_CODE_START(entry_INT80_compat)
 	xorl	%r15d, %r15d		/* nospec   r15 */
 	cld
 
-	/*
-	 * User mode is traced as though IRQs are on, and the interrupt
-	 * gate turned them off.
-	 */
-	TRACE_IRQS_OFF
-
 	movq	%rsp, %rdi
 	call	do_int80_syscall_32
 .Lsyscall_32_done:

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/entry/32: Move non entry code into .text section
  2020-05-05 13:41 ` [patch V4 part 2 02/18] x86/entry/32: " Thomas Gleixner
  2020-05-07 13:15   ` Alexandre Chartre
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     cd809a7a917164820bdb20a7d41f3d1ce98ddc83
Gitweb:        https://git.kernel.org/tip/cd809a7a917164820bdb20a7d41f3d1ce98ddc83
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 25 Mar 2020 19:47:40 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:49 +02:00

x86/entry/32: Move non entry code into .text section

All ASM code which is not part of the entry functionality can move out into
the .text section. No reason to keep it in the non-instrumentable entry
section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.320164650@linutronix.de


---
 arch/x86/entry/entry_32.S |  9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index a5eed84..bf0082b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -730,6 +730,7 @@
  * %eax: prev task
  * %edx: next task
  */
+.pushsection .text, "ax"
 SYM_CODE_START(__switch_to_asm)
 	/*
 	 * Save callee-saved registers
@@ -776,6 +777,7 @@ SYM_CODE_START(__switch_to_asm)
 
 	jmp	__switch_to
 SYM_CODE_END(__switch_to_asm)
+.popsection
 
 /*
  * The unwinder expects the last frame on the stack to always be at the same
@@ -784,6 +786,7 @@ SYM_CODE_END(__switch_to_asm)
  * asmlinkage function so its argument has to be pushed on the stack.  This
  * wrapper creates a proper "end of stack" frame header before the call.
  */
+.pushsection .text, "ax"
 SYM_FUNC_START(schedule_tail_wrapper)
 	FRAME_BEGIN
 
@@ -794,6 +797,8 @@ SYM_FUNC_START(schedule_tail_wrapper)
 	FRAME_END
 	ret
 SYM_FUNC_END(schedule_tail_wrapper)
+.popsection
+
 /*
  * A newly forked process directly context switches into this address.
  *
@@ -801,6 +806,7 @@ SYM_FUNC_END(schedule_tail_wrapper)
  * ebx: kernel thread func (NULL for user thread)
  * edi: kernel thread arg
  */
+.pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
 	call	schedule_tail_wrapper
 
@@ -825,6 +831,7 @@ SYM_CODE_START(ret_from_fork)
 	movl	$0, PT_EAX(%esp)
 	jmp	2b
 SYM_CODE_END(ret_from_fork)
+.popsection
 
 /*
  * Return to user mode is not as complex as all this looks,
@@ -1691,6 +1698,7 @@ SYM_CODE_START(general_protection)
 	jmp	common_exception
 SYM_CODE_END(general_protection)
 
+.pushsection .text, "ax"
 SYM_CODE_START(rewind_stack_do_exit)
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
@@ -1701,3 +1709,4 @@ SYM_CODE_START(rewind_stack_do_exit)
 	call	do_exit
 1:	jmp 1b
 SYM_CODE_END(rewind_stack_do_exit)
+.popsection

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/entry: Mark enter_from_user_mode() noinstr
  2020-05-05 13:41 ` [patch V4 part 2 03/18] x86/entry: Mark enter_from_user_mode() noinstr Thomas Gleixner
  2020-05-08  8:21   ` Masami Hiramatsu
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Masami Hiramatsu, Alexandre Chartre,
	Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     8bd73999307ba1bba5152efd327b020ad38f8c13
Gitweb:        https://git.kernel.org/tip/8bd73999307ba1bba5152efd327b020ad38f8c13
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 29 Feb 2020 15:12:33 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:50 +02:00

x86/entry: Mark enter_from_user_mode() noinstr

Both the callers in the low level ASM code and __context_tracking_exit()
which is invoked from enter_from_user_mode() via user_exit_irqoff() are
marked NOKPROBE. Allowing enter_from_user_mode() to be probed is
inconsistent at best.

Aside of that while function tracing per se is safe the function trace
entry/exit points can be used via BPF as well which is not safe to use
before context tracking has reached CONTEXT_KERNEL and adjusted RCU.

Mark it noinstr which moves it into the instrumentation protected text
section and includes notrace.

Note, this needs further fixups in context tracking to ensure that the
full call chain is protected. Will be addressed in follow up changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.429059405@linutronix.de


---
 arch/x86/entry/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 76735ec..d862add 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -41,7 +41,7 @@
 
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
-__visible inline void enter_from_user_mode(void)
+__visible inline noinstr void enter_from_user_mode(void)
 {
 	CT_WARN_ON(ct_state() != CONTEXT_USER);
 	user_exit_irqoff();

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] x86/entry/64: Move non entry code into .text section
  2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
                     ` (2 preceding siblings ...)
  2020-05-08 23:53   ` Andy Lutomirski
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  3 siblings, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Steven Rostedt (VMware),
	Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     be06832a9a628bca72b0c6ceb447e5f5f529cd30
Gitweb:        https://git.kernel.org/tip/be06832a9a628bca72b0c6ceb447e5f5f529cd30
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 25 Mar 2020 19:45:26 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 16:03:49 +02:00

x86/entry/64: Move non entry code into .text section

All ASM code which is not part of the entry functionality can move out into
the .text section. No reason to keep it in the non-instrumentable entry
section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.227579223@linutronix.de


---
 arch/x86/entry/entry_64.S   | 18 ++++++++++++++----
 arch/x86/kernel/ftrace_64.S |  2 +-
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a15b70a..b199f43 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -279,6 +279,7 @@ SYM_CODE_END(entry_SYSCALL_64)
  * %rdi: prev task
  * %rsi: next task
  */
+.pushsection .text, "ax"
 SYM_FUNC_START(__switch_to_asm)
 	/*
 	 * Save callee-saved registers
@@ -321,6 +322,7 @@ SYM_FUNC_START(__switch_to_asm)
 
 	jmp	__switch_to
 SYM_FUNC_END(__switch_to_asm)
+.popsection
 
 /*
  * A newly forked process directly context switches into this address.
@@ -329,6 +331,7 @@ SYM_FUNC_END(__switch_to_asm)
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
  */
+.pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
 	UNWIND_HINT_EMPTY
 	movq	%rax, %rdi
@@ -357,6 +360,7 @@ SYM_CODE_START(ret_from_fork)
 	movq	$0, RAX(%rsp)
 	jmp	2b
 SYM_CODE_END(ret_from_fork)
+.popsection
 
 /*
  * Build the entry stubs with some assembler magic.
@@ -1037,10 +1041,12 @@ idtentry alignment_check		do_alignment_check		has_error_code=1
 idtentry simd_coprocessor_error		do_simd_coprocessor_error	has_error_code=0
 
 
-	/*
-	 * Reload gs selector with exception handling
-	 * edi:  new selector
-	 */
+/*
+ * Reload gs selector with exception handling
+ * edi:  new selector
+ *
+ * Is in entry.text as it shouldn't be instrumented.
+ */
 SYM_FUNC_START(native_load_gs_index)
 	FRAME_BEGIN
 	pushfq
@@ -1076,6 +1082,7 @@ SYM_CODE_END(.Lbad_gs)
 	.previous
 
 /* Call softirq on interrupt stack. Interrupts are off. */
+.pushsection .text, "ax"
 SYM_FUNC_START(do_softirq_own_stack)
 	pushq	%rbp
 	mov	%rsp, %rbp
@@ -1085,6 +1092,7 @@ SYM_FUNC_START(do_softirq_own_stack)
 	leaveq
 	ret
 SYM_FUNC_END(do_softirq_own_stack)
+.popsection
 
 #ifdef CONFIG_XEN_PV
 idtentry hypervisor_callback xen_do_hypervisor_callback has_error_code=0
@@ -1728,6 +1736,7 @@ SYM_CODE_START(ignore_sysret)
 SYM_CODE_END(ignore_sysret)
 #endif
 
+.pushsection .text, "ax"
 SYM_CODE_START(rewind_stack_do_exit)
 	UNWIND_HINT_FUNC
 	/* Prevent any naive code from trying to unwind to our caller. */
@@ -1739,3 +1748,4 @@ SYM_CODE_START(rewind_stack_do_exit)
 
 	call	do_exit
 SYM_CODE_END(rewind_stack_do_exit)
+.popsection
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index aa5d28a..083a3da 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -12,7 +12,7 @@
 #include <asm/frame.h>
 
 	.code64
-	.section .entry.text, "ax"
+	.section .text, "ax"
 
 #ifdef CONFIG_FRAME_POINTER
 /* Save parent and function stack frames (rip and rbp) */

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] lib/smp_processor_id: Move it into noinstr section
  2020-05-05 13:41 ` [patch V4 part 2 08/18] lib/smp_processor_id: Move it into noinstr section Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline tip-bot2 for Thomas Gleixner
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     85e04f67df29915ef76c2db0fe1b7a8b44988c41
Gitweb:        https://git.kernel.org/tip/85e04f67df29915ef76c2db0fe1b7a8b44988c41
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 10 Mar 2020 23:47:39 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:47:22 +02:00

lib/smp_processor_id: Move it into noinstr section

That code is already not traceable. Move it into the noinstr section so the
objtool section validation does not trigger.

Annotate the warning code as "safe". While it might be not under all
circumstances, getting the information out is important enough.

Should this ever trigger from the sensitive code which is shielded against
instrumentation, e.g. low level entry, then the printk is the least of the
worries.

Addresses the objtool warnings:
 vmlinux.o: warning: objtool: context_tracking_recursion_enter()+0x7: call to __this_cpu_preempt_check() leaves .noinstr.text section
 vmlinux.o: warning: objtool: __context_tracking_exit()+0x17: call to __this_cpu_preempt_check() leaves .noinstr.text section
 vmlinux.o: warning: objtool: __context_tracking_enter()+0x2a: call to __this_cpu_preempt_check() leaves .noinstr.text section

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.902709267@linutronix.de


---
 lib/smp_processor_id.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/smp_processor_id.c b/lib/smp_processor_id.c
index bd95716..525222e 100644
--- a/lib/smp_processor_id.c
+++ b/lib/smp_processor_id.c
@@ -8,7 +8,7 @@
 #include <linux/kprobes.h>
 #include <linux/sched.h>
 
-notrace static nokprobe_inline
+noinstr static
 unsigned int check_preemption_disabled(const char *what1, const char *what2)
 {
 	int this_cpu = raw_smp_processor_id();
@@ -37,6 +37,7 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
 	 */
 	preempt_disable_notrace();
 
+	instrumentation_begin();
 	if (!printk_ratelimit())
 		goto out_enable;
 
@@ -45,6 +46,7 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
 
 	printk("caller is %pS\n", __builtin_return_address(0));
 	dump_stack();
+	instrumentation_end();
 
 out_enable:
 	preempt_enable_no_resched_notrace();
@@ -52,16 +54,14 @@ out:
 	return this_cpu;
 }
 
-notrace unsigned int debug_smp_processor_id(void)
+noinstr unsigned int debug_smp_processor_id(void)
 {
 	return check_preemption_disabled("smp_processor_id", "");
 }
 EXPORT_SYMBOL(debug_smp_processor_id);
-NOKPROBE_SYMBOL(debug_smp_processor_id);
 
-notrace void __this_cpu_preempt_check(const char *op)
+noinstr void __this_cpu_preempt_check(const char *op)
 {
 	check_preemption_disabled("__this_cpu_", op);
 }
 EXPORT_SYMBOL(__this_cpu_preempt_check);
-NOKPROBE_SYMBOL(__this_cpu_preempt_check);

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] context_tracking: Ensure that the critical path cannot be instrumented
  2020-05-05 13:41 ` [patch V4 part 2 07/18] context_tracking: Ensure that the critical path cannot be instrumented Thomas Gleixner
  2020-05-08  8:23   ` Masami Hiramatsu
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Masami Hiramatsu, Alexandre Chartre,
	Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     fcb10ef4544407ecdb536f9563fe63afcad44209
Gitweb:        https://git.kernel.org/tip/fcb10ef4544407ecdb536f9563fe63afcad44209
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 04 Mar 2020 11:05:22 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:47:22 +02:00

context_tracking: Ensure that the critical path cannot be instrumented

context tracking lacks a few protection mechanisms against instrumentation:

 - While the core functions are marked NOKPROBE they lack protection
   against function tracing which is required as the function entry/exit
   points can be utilized by BPF.

 - static functions invoked from the protected functions need to be marked
   as well as they can be instrumented otherwise.

 - using plain inline allows the compiler to emit traceable and probable
   functions.

Fix this by marking the functions noinstr and converting the plain inlines
to __always_inline.

The NOKPROBE_SYMBOL() annotations are removed as the .noinstr.text section
is already excluded from being probed.

Cures the following objtool warnings:

 vmlinux.o: warning: objtool: enter_from_user_mode()+0x34: call to __context_tracking_exit() leaves .noinstr.text section
 vmlinux.o: warning: objtool: prepare_exit_to_usermode()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: syscall_return_slowpath()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_syscall_64()+0x7f: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_int80_syscall_32()+0x3d: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_fast_syscall_32()+0x9c: call to __context_tracking_enter() leaves .noinstr.text section

and generates new ones...

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.811520478@linutronix.de


---
 include/linux/context_tracking.h       |  6 +++---
 include/linux/context_tracking_state.h |  6 +++---
 kernel/context_tracking.c              | 14 ++++++++------
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 8cac62e..981b880 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -33,13 +33,13 @@ static inline void user_exit(void)
 }
 
 /* Called with interrupts disabled.  */
-static inline void user_enter_irqoff(void)
+static __always_inline void user_enter_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_enter(CONTEXT_USER);
 
 }
-static inline void user_exit_irqoff(void)
+static __always_inline void user_exit_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_USER);
@@ -75,7 +75,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
  * is enabled.  If context tracking is disabled, returns
  * CONTEXT_DISABLED.  This should be used primarily for debugging.
  */
-static inline enum ctx_state ct_state(void)
+static __always_inline enum ctx_state ct_state(void)
 {
 	return context_tracking_enabled() ?
 		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index e7fe667..65a60d3 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -26,12 +26,12 @@ struct context_tracking {
 extern struct static_key_false context_tracking_key;
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
 
-static inline bool context_tracking_enabled(void)
+static __always_inline bool context_tracking_enabled(void)
 {
 	return static_branch_unlikely(&context_tracking_key);
 }
 
-static inline bool context_tracking_enabled_cpu(int cpu)
+static __always_inline bool context_tracking_enabled_cpu(int cpu)
 {
 	return context_tracking_enabled() && per_cpu(context_tracking.active, cpu);
 }
@@ -41,7 +41,7 @@ static inline bool context_tracking_enabled_this_cpu(void)
 	return context_tracking_enabled() && __this_cpu_read(context_tracking.active);
 }
 
-static inline bool context_tracking_in_user(void)
+static __always_inline bool context_tracking_in_user(void)
 {
 	return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index ce43088..36a98c4 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -31,7 +31,7 @@ EXPORT_SYMBOL_GPL(context_tracking_key);
 DEFINE_PER_CPU(struct context_tracking, context_tracking);
 EXPORT_SYMBOL_GPL(context_tracking);
 
-static bool context_tracking_recursion_enter(void)
+static noinstr bool context_tracking_recursion_enter(void)
 {
 	int recursion;
 
@@ -45,7 +45,7 @@ static bool context_tracking_recursion_enter(void)
 	return false;
 }
 
-static void context_tracking_recursion_exit(void)
+static __always_inline void context_tracking_recursion_exit(void)
 {
 	__this_cpu_dec(context_tracking.recursion);
 }
@@ -59,7 +59,7 @@ static void context_tracking_recursion_exit(void)
  * instructions to execute won't use any RCU read side critical section
  * because this function sets RCU in extended quiescent state.
  */
-void __context_tracking_enter(enum ctx_state state)
+void noinstr __context_tracking_enter(enum ctx_state state)
 {
 	/* Kernel threads aren't supposed to go to userspace */
 	WARN_ON_ONCE(!current->mm);
@@ -77,8 +77,10 @@ void __context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				instrumentation_begin();
 				trace_user_enter(0);
 				vtime_user_enter(current);
+				instrumentation_end();
 			}
 			rcu_user_enter();
 		}
@@ -99,7 +101,6 @@ void __context_tracking_enter(enum ctx_state state)
 	}
 	context_tracking_recursion_exit();
 }
-NOKPROBE_SYMBOL(__context_tracking_enter);
 EXPORT_SYMBOL_GPL(__context_tracking_enter);
 
 void context_tracking_enter(enum ctx_state state)
@@ -142,7 +143,7 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void __context_tracking_exit(enum ctx_state state)
+void noinstr __context_tracking_exit(enum ctx_state state)
 {
 	if (!context_tracking_recursion_enter())
 		return;
@@ -155,15 +156,16 @@ void __context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				instrumentation_begin();
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				instrumentation_end();
 			}
 		}
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
 	}
 	context_tracking_recursion_exit();
 }
-NOKPROBE_SYMBOL(__context_tracking_exit);
 EXPORT_SYMBOL_GPL(__context_tracking_exit);
 
 void context_tracking_exit(enum ctx_state state)

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [tip: x86/entry] context_tracking: Make guest_enter/exit() .noinstr ready
  2020-05-05 13:41 ` [patch V4 part 2 16/18] context_tracking: Make guest_enter/exit() .noinstr ready Thomas Gleixner
@ 2020-05-19 19:58   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 69+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     af1e56b78534c38bb0e0c712ca70e59f816b74e9
Gitweb:        https://git.kernel.org/tip/af1e56b78534c38bb0e0c712ca70e59f816b74e9
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 19 Mar 2020 14:53:56 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:47:21 +02:00

context_tracking: Make guest_enter/exit() .noinstr ready

Force inlining of the helpers and mark the instrumentable parts
accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134341.672545766@linutronix.de


---
 include/linux/context_tracking.h | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 8150f5a..8cac62e 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -101,12 +101,14 @@ static inline void context_tracking_init(void) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 /* must be called with irqs disabled */
-static inline void guest_enter_irqoff(void)
+static __always_inline void guest_enter_irqoff(void)
 {
+	instrumentation_begin();
 	if (vtime_accounting_enabled_this_cpu())
 		vtime_guest_enter(current);
 	else
 		current->flags |= PF_VCPU;
+	instrumentation_end();
 
 	if (context_tracking_enabled())
 		__context_tracking_enter(CONTEXT_GUEST);
@@ -118,39 +120,48 @@ static inline void guest_enter_irqoff(void)
 	 * one time slice). Lets treat guest mode as quiescent state, just like
 	 * we do with user-mode execution.
 	 */
-	if (!context_tracking_enabled_this_cpu())
+	if (!context_tracking_enabled_this_cpu()) {
+		instrumentation_begin();
 		rcu_virt_note_context_switch(smp_processor_id());
+		instrumentation_end();
+	}
 }
 
-static inline void guest_exit_irqoff(void)
+static __always_inline void guest_exit_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_GUEST);
 
+	instrumentation_begin();
 	if (vtime_accounting_enabled_this_cpu())
 		vtime_guest_exit(current);
 	else
 		current->flags &= ~PF_VCPU;
+	instrumentation_end();
 }
 
 #else
-static inline void guest_enter_irqoff(void)
+static __always_inline void guest_enter_irqoff(void)
 {
 	/*
 	 * This is running in ioctl context so its safe
 	 * to assume that it's the stime pending cputime
 	 * to flush.
 	 */
+	instrumentation_begin();
 	vtime_account_kernel(current);
 	current->flags |= PF_VCPU;
 	rcu_virt_note_context_switch(smp_processor_id());
+	instrumentation_end();
 }
 
-static inline void guest_exit_irqoff(void)
+static __always_inline void guest_exit_irqoff(void)
 {
+	instrumentation_begin();
 	/* Flush the guest cputime we spent on the guest */
 	vtime_account_kernel(current);
 	current->flags &= ~PF_VCPU;
+	instrumentation_end();
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 

^ permalink raw reply related	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2020-05-19 20:01 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-05 13:41 [patch V4 part 2 00/18] x86/entry: Entry/exception code rework, syscall and KVM changes Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 01/18] x86/entry/64: Move non entry code into .text section Thomas Gleixner
2020-05-06 15:51   ` Peter Zijlstra
2020-05-08  1:31   ` Steven Rostedt
2020-05-08 23:53   ` Andy Lutomirski
2020-05-10 13:39     ` Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 02/18] x86/entry/32: " Thomas Gleixner
2020-05-07 13:15   ` Alexandre Chartre
2020-05-07 14:14     ` Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 03/18] x86/entry: Mark enter_from_user_mode() noinstr Thomas Gleixner
2020-05-08  8:21   ` Masami Hiramatsu
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 04/18] x86/entry/common: Protect against instrumentation Thomas Gleixner
2020-05-07 13:39   ` Alexandre Chartre
2020-05-07 14:13     ` Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 05/18] x86/entry: Move irq tracing on syscall entry to C-code Thomas Gleixner
2020-05-07 13:55   ` Alexandre Chartre
2020-05-07 14:10     ` Thomas Gleixner
2020-05-07 15:03       ` Thomas Gleixner
2020-05-07 17:06         ` Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 06/18] x86/entry: Move irq flags tracing to prepare_exit_to_usermode() Thomas Gleixner
2020-05-08 23:57   ` Andy Lutomirski
2020-05-09 10:16     ` Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 07/18] context_tracking: Ensure that the critical path cannot be instrumented Thomas Gleixner
2020-05-08  8:23   ` Masami Hiramatsu
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 08/18] lib/smp_processor_id: Move it into noinstr section Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline tip-bot2 for Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] lib/smp_processor_id: Move it into noinstr section tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 09/18] x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 10/18] x86/entry/64: Check IF in __preempt_enable_notrace() thunk Thomas Gleixner
2020-05-07 14:15   ` Alexandre Chartre
2020-05-09  0:10   ` Andy Lutomirski
2020-05-09 10:25     ` Thomas Gleixner
2020-05-10 18:47       ` Thomas Gleixner
2020-05-11 18:27         ` Thomas Gleixner
2020-05-12  1:48     ` Steven Rostedt
2020-05-12  1:51   ` Steven Rostedt
2020-05-12  8:14     ` Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 11/18] x86/entry/64: Mark ___preempt_schedule_notrace() thunk noinstr Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 12/18] x86,objtool: Make entry_64_compat.S objtool clean Thomas Gleixner
2020-05-09  0:11   ` Andy Lutomirski
2020-05-09 10:06     ` Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] x86/entry: " tip-bot2 for Peter Zijlstra
2020-05-05 13:41 ` [patch V4 part 2 13/18] x86/kvm: Move context tracking where it belongs Thomas Gleixner
2020-05-06  7:42   ` Paolo Bonzini
2020-05-09  0:14   ` Andy Lutomirski
2020-05-09 10:12     ` Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 14/18] x86/kvm/vmx: Add hardirq tracing to guest enter/exit Thomas Gleixner
2020-05-06  7:55   ` Paolo Bonzini
2020-05-05 13:41 ` [patch V4 part 2 15/18] x86/kvm/svm: Handle hardirqs proper on " Thomas Gleixner
2020-05-06  8:15   ` Paolo Bonzini
2020-05-06  8:48     ` Thomas Gleixner
2020-05-06  9:21       ` Paolo Bonzini
2020-05-07 14:44         ` [patch V5 " Thomas Gleixner
2020-05-08 13:45           ` Paolo Bonzini
2020-05-08 14:01             ` Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 16/18] context_tracking: Make guest_enter/exit() .noinstr ready Thomas Gleixner
2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-05 13:41 ` [patch V4 part 2 17/18] x86/kvm/vmx: Move guest enter/exit into .noinstr.text Thomas Gleixner
2020-05-06  8:17   ` Paolo Bonzini
2020-05-05 13:41 ` [patch V4 part 2 18/18] x86/kvm/svm: " Thomas Gleixner
2020-05-06  8:17   ` Paolo Bonzini
2020-05-07 14:47   ` Alexandre Chartre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).