linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements
@ 2021-03-17 18:12 Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 1/9] x86/dumpstack: Remove unnecessary range check fetching opcode bytes Andy Lutomirski
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

I noticed a little bug in fast compat syscalls.  I got a bit carried away
fixing it.  This renames the irqentry stuff to kentry, improves (IMNSHO)
the API, and adds lots of debugging.

It also tweaks the unwinder wrt ret_from_fork and rewrites ret_from_fork
in C.  I did this because the kentry work involved a small change to
ret_from_fork, and adjusting the asm is a mess.  So C it is.

Changes from v3: Get rid of arm64 special cases

Changes from v1 and v2: Complete rewrite

Andy Lutomirski (9):
  x86/dumpstack: Remove unnecessary range check fetching opcode bytes
  x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
  x86/entry: Convert ret_from_fork to C
  kentry: Simplify the common syscall API
  kentry: Remove enter_from/exit_to_user_mode()
  entry: Make CONFIG_DEBUG_ENTRY available outside x86
  kentry: Add debugging checks for proper kentry API usage
  kentry: Check that syscall entries and syscall exits match
  kentry: Verify kentry state in instrumentation_begin/end()

 arch/x86/Kconfig.debug           |  10 --
 arch/x86/entry/common.c          |  78 ++++++++++----
 arch/x86/entry/entry_32.S        |  51 ++-------
 arch/x86/entry/entry_64.S        |  33 ++----
 arch/x86/include/asm/switch_to.h |   2 +-
 arch/x86/kernel/dumpstack.c      |  10 +-
 arch/x86/kernel/process.c        |  15 ++-
 arch/x86/kernel/process_32.c     |   2 +-
 arch/x86/kernel/traps.c          |   4 +-
 arch/x86/kernel/unwind_orc.c     |   2 +-
 include/asm-generic/bug.h        |   8 +-
 include/linux/entry-common.h     | 127 +++--------------------
 include/linux/instrumentation.h  |  25 ++++-
 include/linux/sched.h            |   4 +
 init/init_task.c                 |   8 ++
 kernel/entry/common.c            | 173 ++++++++++++++++++++-----------
 lib/Kconfig.debug                |  11 ++
 17 files changed, 267 insertions(+), 296 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/9] x86/dumpstack: Remove unnecessary range check fetching opcode bytes
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

copy_from_user_nmi() validates that the pointer is in the user range,
so there is no need for an extra check in copy_code().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 299c20f0a38b..55cf3c8325c6 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -81,12 +81,6 @@ static int copy_code(struct pt_regs *regs, u8 *buf, unsigned long src,
 	/* The user space code from other tasks cannot be accessed. */
 	if (regs != task_pt_regs(current))
 		return -EPERM;
-	/*
-	 * Make sure userspace isn't trying to trick us into dumping kernel
-	 * memory by pointing the userspace instruction pointer at it.
-	 */
-	if (__chk_range_not_ok(src, nbytes, TASK_SIZE_MAX))
-		return -EINVAL;
 
 	/*
 	 * Even if named copy_from_user_nmi() this can be invoked from
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 1/9] x86/dumpstack: Remove unnecessary range check fetching opcode bytes Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-18  0:36   ` Josh Poimboeuf
  2021-03-17 18:12 ` [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C Andy Lutomirski
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski, Josh Poimboeuf

For kernel threads, task_pt_regs is currently all zeros, a valid user state
(if kernel_execve() has been called), or some combination thereof during
execution of kernel_execve().  If a stack trace is printed, the unwinder
might get confused and treat task_pt_regs as a kernel state.  Indeed,
forcing a stack dump results in an attempt to display _kernel_ code bytes
from a bogus address at the very top of kernel memory.

Fix the confusion by setting cs=3 so that user_mode(task_pt_regs(...)) ==
true for kernel threads.

Also improve the error when failing to fetch code bytes to show which type
of fetch failed.  This makes it much easier to understand what's happening.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c |  4 ++--
 arch/x86/kernel/process.c   | 13 +++++++++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 55cf3c8325c6..9b7b69bb12e5 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -128,8 +128,8 @@ void show_opcodes(struct pt_regs *regs, const char *loglvl)
 		/* No access to the user space stack of other tasks. Ignore. */
 		break;
 	default:
-		printk("%sCode: Unable to access opcode bytes at RIP 0x%lx.\n",
-		       loglvl, prologue);
+		printk("%sCode: Unable to access %s opcode bytes at RIP 0x%lx.\n",
+		       loglvl, user_mode(regs) ? "user" : "kernel", prologue);
 		break;
 	}
 }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 145a7ac0c19a..f6f16df04cb9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -163,6 +163,19 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		memset(childregs, 0, sizeof(struct pt_regs));
+
+		/*
+		 * Even though there is no real user state here, these
+		 * are were user regs belong, and kernel_execve() will
+		 * overwrite them with user regs.  Put an obviously
+		 * invalid value that nonetheless causes user_mode(regs)
+		 * to return true in CS.
+		 *
+		 * This also prevents the unwinder from thinking that there
+		 * is invalid kernel state at the top of the stack.
+		 */
+		childregs->cs = 3;
+
 		kthread_frame_init(frame, sp, arg);
 		return 0;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 1/9] x86/dumpstack: Remove unnecessary range check fetching opcode bytes Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-18  0:38   ` Josh Poimboeuf
  2021-03-19 16:05   ` Thomas Gleixner
  2021-03-17 18:12 ` [PATCH v4 4/9] kentry: Simplify the common syscall API Andy Lutomirski
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski, Josh Poimboeuf

ret_from_fork is written in asm, slightly differently, for x86_32 and
x86_64.  Convert it to C.

This is a straight conversion without any particular cleverness.  As a
further cleanup, the code that sets up the ret_from_fork argument registers
could be adjusted to put the arguments in the correct registers.

This will cause the ORC unwinder to find pt_regs even for kernel threads on
x86_64.  This seems harmless.

The 32-bit comment above the now-deleted schedule_tail_wrapper was
obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
same problem more cleanly.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c          | 23 ++++++++++++++
 arch/x86/entry/entry_32.S        | 51 +++++---------------------------
 arch/x86/entry/entry_64.S        | 33 +++++----------------
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/kernel/process.c        |  2 +-
 arch/x86/kernel/process_32.c     |  2 +-
 arch/x86/kernel/unwind_orc.c     |  2 +-
 7 files changed, 43 insertions(+), 72 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 95776f16c1cb..ef1c65938a6b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -214,6 +214,29 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+void ret_from_fork(struct task_struct *prev,
+		   int (*kernel_thread_fn)(void *),
+		   void *kernel_thread_arg,
+		   struct pt_regs *user_regs);
+
+__visible void noinstr ret_from_fork(struct task_struct *prev,
+				     int (*kernel_thread_fn)(void *),
+				     void *kernel_thread_arg,
+				     struct pt_regs *user_regs)
+{
+	instrumentation_begin();
+
+	schedule_tail(prev);
+
+	if (kernel_thread_fn) {
+		kernel_thread_fn(kernel_thread_arg);
+		user_regs->ax = 0;
+	}
+
+	instrumentation_end();
+	syscall_exit_to_user_mode(user_regs);
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index df8c017e6161..7113d259727f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -805,26 +805,6 @@ SYM_CODE_START(__switch_to_asm)
 SYM_CODE_END(__switch_to_asm)
 .popsection
 
-/*
- * The unwinder expects the last frame on the stack to always be at the same
- * offset from the end of the page, which allows it to validate the stack.
- * Calling schedule_tail() directly would break that convention because its an
- * asmlinkage function so its argument has to be pushed on the stack.  This
- * wrapper creates a proper "end of stack" frame header before the call.
- */
-.pushsection .text, "ax"
-SYM_FUNC_START(schedule_tail_wrapper)
-	FRAME_BEGIN
-
-	pushl	%eax
-	call	schedule_tail
-	popl	%eax
-
-	FRAME_END
-	ret
-SYM_FUNC_END(schedule_tail_wrapper)
-.popsection
-
 /*
  * A newly forked process directly context switches into this address.
  *
@@ -833,29 +813,14 @@ SYM_FUNC_END(schedule_tail_wrapper)
  * edi: kernel thread arg
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-	call	schedule_tail_wrapper
-
-	testl	%ebx, %ebx
-	jnz	1f		/* kernel threads are uncommon */
-
-2:
-	/* When we fork, we trace the syscall return in the child, too. */
-	movl    %esp, %eax
-	call    syscall_exit_to_user_mode
-	jmp     .Lsyscall_32_done
-
-	/* kernel thread */
-1:	movl	%edi, %eax
-	CALL_NOSPEC ebx
-	/*
-	 * A kernel thread is allowed to return here after successfully
-	 * calling kernel_execve().  Exit to userspace to complete the execve()
-	 * syscall.
-	 */
-	movl	$0, PT_EAX(%esp)
-	jmp	2b
-SYM_CODE_END(ret_from_fork)
+SYM_CODE_START(asm_ret_from_fork)
+	movl	%ebx, %edx
+	movl	%edi, %ecx
+	pushl	%esp
+	call	ret_from_fork
+	addl	$4, %esp
+	jmp	.Lsyscall_32_done
+SYM_CODE_END(asm_ret_from_fork)
 .popsection
 
 SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..0f7df8861ac1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -273,35 +273,18 @@ SYM_FUNC_END(__switch_to_asm)
  * rax: prev task we switched from
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
+ * rbp: encoded frame pointer for the fp unwinder
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-	UNWIND_HINT_EMPTY
-	movq	%rax, %rdi
-	call	schedule_tail			/* rdi: 'prev' task parameter */
-
-	testq	%rbx, %rbx			/* from kernel_thread? */
-	jnz	1f				/* kernel threads are uncommon */
-
-2:
+SYM_CODE_START(asm_ret_from_fork)
 	UNWIND_HINT_REGS
-	movq	%rsp, %rdi
-	call	syscall_exit_to_user_mode	/* returns with IRQs disabled */
+	movq	%rax, %rdi
+	movq	%rbx, %rsi
+	movq	%r12, %rdx
+	movq	%rsp, %rcx
+	call	ret_from_fork
 	jmp	swapgs_restore_regs_and_return_to_usermode
-
-1:
-	/* kernel thread */
-	UNWIND_HINT_EMPTY
-	movq	%r12, %rdi
-	CALL_NOSPEC rbx
-	/*
-	 * A kernel thread is allowed to return here after successfully
-	 * calling kernel_execve().  Exit to userspace to complete the execve()
-	 * syscall.
-	 */
-	movq	$0, RAX(%rsp)
-	jmp	2b
-SYM_CODE_END(ret_from_fork)
+SYM_CODE_END(asm_ret_from_fork)
 .popsection
 
 .macro DEBUG_ENTRY_ASSERT_IRQS_OFF
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..fcb9b02a1269 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -12,7 +12,7 @@ struct task_struct *__switch_to_asm(struct task_struct *prev,
 __visible struct task_struct *__switch_to(struct task_struct *prev,
 					  struct task_struct *next);
 
-asmlinkage void ret_from_fork(void);
+asmlinkage void asm_ret_from_fork(void);
 
 /*
  * This is the structure pointed to by thread.sp for an inactive task.  The
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f6f16df04cb9..34efbca08738 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -135,7 +135,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	frame = &fork_frame->frame;
 
 	frame->bp = encode_frame_pointer(childregs);
-	frame->ret_addr = (unsigned long) ret_from_fork;
+	frame->ret_addr = (unsigned long) asm_ret_from_fork;
 	p->thread.sp = (unsigned long) fork_frame;
 	p->thread.io_bitmap = NULL;
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..bf8aa15ac652 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -151,7 +151,7 @@ EXPORT_SYMBOL_GPL(start_thread);
  * more flexibility.
  *
  * The return value (in %ax) will be the "prev" task after
- * the task-switch, and shows up in ret_from_fork in entry.S,
+ * the task-switch, and shows up in asm_ret_from_fork in entry_32.S,
  * for example.
  */
 __visible __notrace_funcgraph struct task_struct *
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 73f800100066..c6e7235c6d9f 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -659,7 +659,7 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 		state->sp = task->thread.sp + sizeof(*frame);
 		state->bp = READ_ONCE_NOCHECK(frame->bp);
 		state->ip = READ_ONCE_NOCHECK(frame->ret_addr);
-		state->signal = (void *)state->ip == ret_from_fork;
+		state->signal = (void *)state->ip == asm_ret_from_fork;
 	}
 
 	if (get_stack_info((unsigned long *)state->sp, state->task,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/9] kentry: Simplify the common syscall API
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (2 preceding siblings ...)
  2021-03-17 18:12 ` [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-19 16:09   ` Thomas Gleixner
  2021-03-19 18:07   ` Thomas Gleixner
  2021-03-17 18:12 ` [PATCH v4 5/9] kentry: Remove enter_from/exit_to_user_mode() Andy Lutomirski
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

The new common syscall API had a large and confusing API surface.  Simplify
it.  Now there is exactly one way to use it.  It's a bit more verbose than
the old way for the simple x86_64 native case, but it's much easier to use
right, and the diffstat should speak for itself.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c      | 57 +++++++++++++++---------
 include/linux/entry-common.h | 86 ++++++------------------------------
 kernel/entry/common.c        | 57 +++---------------------
 3 files changed, 54 insertions(+), 146 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ef1c65938a6b..8710b2300b8d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -38,9 +38,12 @@
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
-	nr = syscall_enter_from_user_mode(regs, nr);
-
+	kentry_enter_from_user_mode(regs);
+	local_irq_enable();
 	instrumentation_begin();
+
+	nr = kentry_syscall_begin(regs, nr);
+
 	if (likely(nr < NR_syscalls)) {
 		nr = array_index_nospec(nr, NR_syscalls);
 		regs->ax = sys_call_table[nr](regs);
@@ -52,8 +55,12 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 		regs->ax = x32_sys_call_table[nr](regs);
 #endif
 	}
+
+	kentry_syscall_end(regs);
+
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 }
 #endif
 
@@ -83,33 +90,34 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
 	unsigned int nr = syscall_32_enter(regs);
 
+	kentry_enter_from_user_mode(regs);
+	local_irq_enable();
+	instrumentation_begin();
+
 	/*
 	 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
 	 * orig_ax, the unsigned int return value truncates it.  This may
 	 * or may not be necessary, but it matches the old asm behavior.
 	 */
-	nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
-	instrumentation_begin();
-
+	nr = (unsigned int)kentry_syscall_begin(regs, nr);
 	do_syscall_32_irqs_on(regs, nr);
+	kentry_syscall_end(regs);
 
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 }
 
 static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 {
 	unsigned int nr = syscall_32_enter(regs);
+	bool ret;
 	int res;
 
-	/*
-	 * This cannot use syscall_enter_from_user_mode() as it has to
-	 * fetch EBP before invoking any of the syscall entry work
-	 * functions.
-	 */
-	syscall_enter_from_user_mode_prepare(regs);
-
+	kentry_enter_from_user_mode(regs);
+	local_irq_enable();
 	instrumentation_begin();
+
 	/* Fetch EBP from where the vDSO stashed it. */
 	if (IS_ENABLED(CONFIG_X86_64)) {
 		/*
@@ -126,21 +134,23 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 	if (res) {
 		/* User code screwed up. */
 		regs->ax = -EFAULT;
-
-		instrumentation_end();
-		local_irq_disable();
-		irqentry_exit_to_user_mode(regs);
-		return false;
+		ret = false;
+		goto out;
 	}
 
 	/* The case truncates any ptrace induced syscall nr > 2^32 -1 */
-	nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+	nr = (unsigned int)kentry_syscall_begin(regs, nr);
 
 	/* Now this is just like a normal syscall. */
 	do_syscall_32_irqs_on(regs, nr);
 
+	kentry_syscall_end(regs);
+	ret = true;
+
+out:
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 	return true;
 }
 
@@ -233,8 +243,11 @@ __visible void noinstr ret_from_fork(struct task_struct *prev,
 		user_regs->ax = 0;
 	}
 
+	kentry_syscall_end(user_regs);
+
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(user_regs);
+	kentry_exit_to_user_mode(user_regs);
 }
 
 #ifdef CONFIG_XEN_PV
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fd2d7c35670a..5287c6c15a66 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -119,31 +119,12 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
 void enter_from_user_mode(struct pt_regs *regs);
 
 /**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs:	Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This handles lockdep, RCU (context tracking) and tracing state, i.e.
- * the functionality provided by enter_from_user_mode().
- *
- * This is invoked when there is extra architecture specific functionality
- * to be done between establishing state and handling user mode entry work.
- */
-void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
-/**
- * syscall_enter_from_user_mode_work - Check and handle work before invoking
- *				       a syscall
+ * kentry_syscall_begin - Prepare to invoke a syscall handler
  * @regs:	Pointer to currents pt_regs
  * @syscall:	The syscall number
  *
  * Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
- * architecture specific work.
+ * enabled after kentry_enter_from_usermode or a similar function.
  *
  * Returns: The original or a modified syscall number
  *
@@ -152,32 +133,16 @@ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
  * syscall_set_return_value() first.  If neither of those are called and -1
  * is returned, then the syscall will fail with ENOSYS.
  *
+ * After calling kentry_syscall_begin(), regardless of the return value,
+ * the caller must call kentry_syscall_end().
+ *
  * It handles the following work items:
  *
  *  1) syscall_work flag dependent invocations of
  *     arch_syscall_enter_tracehook(), __secure_computing(), trace_sys_enter()
  *  2) Invocation of audit_syscall_entry()
  */
-long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
-
-/**
- * syscall_enter_from_user_mode - Establish state and check and handle work
- *				  before invoking a syscall
- * @regs:	Pointer to currents pt_regs
- * @syscall:	The syscall number
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This is combination of syscall_enter_from_user_mode_prepare() and
- * syscall_enter_from_user_mode_work().
- *
- * Returns: The original or a modified syscall number. See
- * syscall_enter_from_user_mode_work() for further explanation.
- */
-long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
+long kentry_syscall_begin(struct pt_regs *regs, long syscall);
 
 /**
  * local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable()
@@ -317,28 +282,16 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
 void exit_to_user_mode(void);
 
 /**
- * syscall_exit_to_user_mode_work - Handle work before returning to user mode
+ * kentry_syscall_end - Finish syscall processing
  * @regs:	Pointer to currents pt_regs
  *
- * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling
- * exit_to_user_mode() to perform the final transition to user mode.
  *
- * Calling convention is the same as for syscall_exit_to_user_mode() and it
- * returns with all work handled and interrupts disabled. The caller must
- * invoke exit_to_user_mode() before actually switching to user mode to
- * make the final state transitions. Interrupts must stay disabled between
- * return from this function and the invocation of exit_to_user_mode().
- */
-void syscall_exit_to_user_mode_work(struct pt_regs *regs);
-
-/**
- * syscall_exit_to_user_mode - Handle work before returning to user mode
- * @regs:	Pointer to currents pt_regs
+ * This must be called after arch code calls kentry_syscall_begin()
+ * and invoking a syscall handler, if any.  This must also be called when
+ * returning from fork() to user mode, since return-from-fork is considered
+ * to be a syscall return.
  *
- * Invoked with interrupts enabled and fully valid regs. Returns with all
- * work handled, interrupts disabled such that the caller can immediately
- * switch to user mode. Called from architecture specific syscall and ret
- * from fork code.
+ * Called with IRQs on.  Returns with IRQs still on.
  *
  * The call order is:
  *  1) One-time syscall exit work:
@@ -346,21 +299,8 @@ void syscall_exit_to_user_mode_work(struct pt_regs *regs);
  *      - audit
  *	- syscall tracing
  *	- tracehook (single stepping)
- *
- *  2) Preparatory work
- *	- Exit to user mode loop (common TIF handling). Invokes
- *	  arch_exit_to_user_mode_work() for architecture specific TIF work
- *	- Architecture specific one time work arch_exit_to_user_mode_prepare()
- *	- Address limit and lockdep checks
- *
- *  3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the
- *     functionality in exit_to_user_mode().
- *
- * This is a combination of syscall_exit_to_user_mode_work() (1,2) and
- * exit_to_user_mode(). This function is preferred unless there is a
- * compelling architectural reason to use the seperate functions.
  */
-void syscall_exit_to_user_mode(struct pt_regs *regs);
+void kentry_syscall_end(struct pt_regs *regs);
 
 /**
  * kentry_enter_from_user_mode - Establish state before invoking the irq handler
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 269766a8f981..800ad406431b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -80,44 +80,19 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 	return ret ? : syscall;
 }
 
-static __always_inline long
-__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
+long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 {
 	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
 
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+	lockdep_assert_irqs_enabled();
+
 	if (work & SYSCALL_WORK_ENTER)
 		syscall = syscall_trace_enter(regs, syscall, work);
 
 	return syscall;
 }
 
-long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
-{
-	return __syscall_enter_from_user_work(regs, syscall);
-}
-
-noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
-{
-	long ret;
-
-	__enter_from_user_mode(regs);
-
-	instrumentation_begin();
-	local_irq_enable();
-	ret = __syscall_enter_from_user_work(regs, syscall);
-	instrumentation_end();
-
-	return ret;
-}
-
-noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
-{
-	__enter_from_user_mode(regs);
-	instrumentation_begin();
-	local_irq_enable();
-	instrumentation_end();
-}
-
 /* See comment for exit_to_user_mode() in entry-common.h */
 static __always_inline void __exit_to_user_mode(void)
 {
@@ -218,7 +193,7 @@ static inline bool report_single_step(unsigned long work)
 /*
  * If SYSCALL_EMU is set, then the only reason to report is when
  * TIF_SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall
- * instruction has been already reported in syscall_enter_from_user_mode().
+ * instruction has been already reported in kentry_syscall_begin().
  */
 static inline bool report_single_step(unsigned long work)
 {
@@ -261,7 +236,7 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
  * Syscall specific exit to user mode preparation. Runs with interrupts
  * enabled.
  */
-static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+void kentry_syscall_end(struct pt_regs *regs)
 {
 	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
 	unsigned long nr = syscall_get_nr(current, regs);
@@ -284,26 +259,6 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 		syscall_exit_work(regs, work);
 }
 
-static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
-{
-	syscall_exit_to_user_mode_prepare(regs);
-	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
-}
-
-void syscall_exit_to_user_mode_work(struct pt_regs *regs)
-{
-	__syscall_exit_to_user_mode_work(regs);
-}
-
-__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
-{
-	instrumentation_begin();
-	__syscall_exit_to_user_mode_work(regs);
-	instrumentation_end();
-	__exit_to_user_mode();
-}
-
 noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
 {
 	__enter_from_user_mode(regs);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 5/9] kentry: Remove enter_from/exit_to_user_mode()
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (3 preceding siblings ...)
  2021-03-17 18:12 ` [PATCH v4 4/9] kentry: Simplify the common syscall API Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-19 18:08   ` Thomas Gleixner
  2021-03-17 18:12 ` [PATCH v4 6/9] entry: Make CONFIG_DEBUG_ENTRY available outside x86 Andy Lutomirski
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

enter_from_user_mode() and exit_to_user_mode() do part, but not all, of the
entry and exit work.  Fortunately, they have no callers, so there's no need
to try to give the precise semantics.  Delete them.

This doesn't affect arm64.  arm64 has identically named functions
in arch/arm64, and kentry isn't built on arm64.  So this patch also
reduces confusion for people who grep the tree and might previously
have thought that arm64 used this code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/entry-common.h | 41 ------------------------------------
 kernel/entry/common.c        | 10 ---------
 2 files changed, 51 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 5287c6c15a66..ba4e8509a20f 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -97,27 +97,6 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
 }
 #endif
 
-/**
- * enter_from_user_mode - Establish state when coming from user mode
- *
- * Syscall/interrupt entry disables interrupts, but user mode is traced as
- * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
- *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct and interrupts are still
- * disabled. The subsequent functions can be instrumented.
- *
- * This is invoked when there is architecture specific functionality to be
- * done between establishing state and enabling interrupts. The caller must
- * enable interrupts before invoking syscall_enter_from_user_mode_work().
- */
-void enter_from_user_mode(struct pt_regs *regs);
-
 /**
  * kentry_syscall_begin - Prepare to invoke a syscall handler
  * @regs:	Pointer to currents pt_regs
@@ -261,26 +240,6 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
 }
 #endif
 
-/**
- * exit_to_user_mode - Fixup state when exiting to user mode
- *
- * Syscall/interrupt exit enables interrupts, but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Invoke architecture specific last minute exit code, e.g. speculation
- *    mitigations, etc.: arch_exit_to_user_mode()
- * 4) Tell lockdep that interrupts are enabled
- *
- * Invoked from architecture specific code when syscall_exit_to_user_mode()
- * is not suitable as the last step before returning to userspace. Must be
- * invoked with interrupts disabled and the caller must be
- * non-instrumentable.
- * The caller has to invoke syscall_exit_to_user_mode_work() before this.
- */
-void exit_to_user_mode(void);
-
 /**
  * kentry_syscall_end - Finish syscall processing
  * @regs:	Pointer to currents pt_regs
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 800ad406431b..6fbe1c109877 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -25,11 +25,6 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 	instrumentation_end();
 }
 
-void noinstr enter_from_user_mode(struct pt_regs *regs)
-{
-	__enter_from_user_mode(regs);
-}
-
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
 	if (unlikely(audit_context())) {
@@ -106,11 +101,6 @@ static __always_inline void __exit_to_user_mode(void)
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
-void noinstr exit_to_user_mode(void)
-{
-	__exit_to_user_mode();
-}
-
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 6/9] entry: Make CONFIG_DEBUG_ENTRY available outside x86
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (4 preceding siblings ...)
  2021-03-17 18:12 ` [PATCH v4 5/9] kentry: Remove enter_from/exit_to_user_mode() Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage Andy Lutomirski
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

In principle, the generic entry code is generic, and the goal is to use it
in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
to the generic config so that it can be used in the generic entry code and
not just in arch/x86.

This currently depends on GENERIC_ENTRY.  If an architecture wants to use
DEBUG_ENTRY without using GENERIC_ENTRY, this could be changed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig.debug | 10 ----------
 lib/Kconfig.debug      | 11 +++++++++++
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 80b57e7f4947..a5a52133730c 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -170,16 +170,6 @@ config CPA_DEBUG
 	help
 	  Do change_page_attr() self-tests every 30 seconds.
 
-config DEBUG_ENTRY
-	bool "Debug low-level entry code"
-	depends on DEBUG_KERNEL
-	help
-	  This option enables sanity checks in x86's low-level entry code.
-	  Some of these sanity checks may slow down kernel entries and
-	  exits or otherwise impact performance.
-
-	  If unsure, say N.
-
 config DEBUG_NMI_SELFTEST
 	bool "NMI Selftest"
 	depends on DEBUG_KERNEL && X86_LOCAL_APIC
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7937265ef879..927d913fd471 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1411,6 +1411,17 @@ config CSD_LOCK_WAIT_DEBUG
 
 endmenu # lock debugging
 
+config DEBUG_ENTRY
+	bool "Debug low-level entry code"
+	depends on DEBUG_KERNEL
+	depends on GENERIC_ENTRY
+	help
+	  This option enables sanity checks in the low-level entry code.
+	  Some of these sanity checks may slow down kernel entries and
+	  exits or otherwise impact performance.
+
+	  If unsure, say N.
+
 config TRACE_IRQFLAGS
 	depends on TRACE_IRQFLAGS_SUPPORT
 	bool
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (5 preceding siblings ...)
  2021-03-17 18:12 ` [PATCH v4 6/9] entry: Make CONFIG_DEBUG_ENTRY available outside x86 Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-19 16:17   ` Thomas Gleixner
  2021-03-17 18:12 ` [PATCH v4 8/9] kentry: Check that syscall entries and syscall exits match Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 9/9] kentry: Verify kentry state in instrumentation_begin/end() Andy Lutomirski
  8 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

It's quite easy to mess up kentry calls.  Add debgging checks that kentry
transitions to and from user mode match up and that kentry_nmi_enter() and
kentry_nmi_exit() match up.

Checking full matching of kentry_enter() with kentry_exit() needs
per-task state.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/entry/common.c | 81 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fbe1c109877..8ba774184e00 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,9 +11,65 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
+/*
+ * kentry_cpu_depth is 0 in user mode, 1 in normal kernel mode, and
+ * 1 + n * KENTRY_DEPTH_NMI in kentry_nmi_enter() mode.  We can't
+ * use a percpu variable to match up kentry_enter() from kernel mode
+ * with the corresponding kentry_exit() because tasks may schedule inside
+ * nested kentry_enter() regions.
+ */
+#define KENTRY_CPU_DEPTH_NMI	1024UL
+
+#ifdef CONFIG_DEBUG_ENTRY
+
+/*
+ * Extra safe WARN_ONCE.  Per-arch optimized WARN_ONCE() implementations
+ * might go through the low-level entry and kentry code even before noticing
+ * that the warning already fired, which could result in recursive warnings.
+ * This carefully avoids any funny business once a given warning has fired.
+ */
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) ({		\
+	static bool __section(".data.once") __warned;		\
+	int __ret_warn_once = !!(condition);			\
+								\
+	if (unlikely(__ret_warn_once && !READ_ONCE(__warned))) {\
+		WRITE_ONCE(__warned, true);			\
+		WARN(1, format);				\
+	}							\
+	unlikely(__ret_warn_once);				\
+})
+
+
+static DEFINE_PER_CPU(unsigned int, kentry_cpu_depth) = 1UL;
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n)
+{
+	this_cpu_add(kentry_cpu_depth, n);
+}
+
+static void kentry_cpu_depth_check(unsigned int n)
+{
+	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) < n, "kentry depth underflow");
+}
+
+static __always_inline void kentry_cpu_depth_sub(unsigned int n)
+{
+	this_cpu_sub(kentry_cpu_depth, n);
+}
+#else
+
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n) {}
+static void kentry_cpu_depth_check(unsigned int n) {}
+static __always_inline void kentry_cpu_depth_sub(unsigned int n) {}
+
+#endif
+
 /* See comment for enter_from_user_mode() in entry-common.h */
 static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
+	kentry_cpu_depth_add(1);
 	arch_check_user_regs(regs);
 	lockdep_hardirqs_off(CALLER_ADDR0);
 
@@ -22,6 +78,14 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
+
+#ifdef CONFIG_DEBUG_ENTRY
+	DEBUG_ENTRY_WARN_ONCE(
+		this_cpu_read(kentry_cpu_depth) != 1,
+		"kentry: __enter_from_user_mode() called while kentry thought the CPU was in the kernel (%u)",
+		this_cpu_read(kentry_cpu_depth));
+#endif
+
 	instrumentation_end();
 }
 
@@ -92,6 +156,11 @@ long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
 	instrumentation_begin();
+#ifdef CONFIG_DEBUG_ENTRY
+	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
+			      "__exit_to_user_mode called at wrong kentry cpu depth (%u)",
+			      this_cpu_read(kentry_cpu_depth));
+#endif
 	trace_hardirqs_on_prepare();
 	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	instrumentation_end();
@@ -99,6 +168,7 @@ static __always_inline void __exit_to_user_mode(void)
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
+	kentry_cpu_depth_sub(1);
 }
 
 /* Workaround to allow gradual conversion of architecture code */
@@ -346,7 +416,12 @@ noinstr void kentry_exit(struct pt_regs *regs, kentry_state_t state)
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		kentry_exit_to_user_mode(regs);
-	} else if (!regs_irqs_disabled(regs)) {
+		return;
+	}
+
+	kentry_cpu_depth_check(1);
+
+	if (!regs_irqs_disabled(regs)) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
@@ -385,6 +460,8 @@ kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs)
 
 	irq_state.lockdep = lockdep_hardirqs_enabled();
 
+	kentry_cpu_depth_add(KENTRY_CPU_DEPTH_NMI);
+
 	__nmi_enter();
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	lockdep_hardirq_enter();
@@ -401,6 +478,7 @@ kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs)
 void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state)
 {
 	instrumentation_begin();
+	kentry_cpu_depth_check(KENTRY_CPU_DEPTH_NMI);
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {
 		trace_hardirqs_on_prepare();
@@ -413,4 +491,5 @@ void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state)
 	if (irq_state.lockdep)
 		lockdep_hardirqs_on(CALLER_ADDR0);
 	__nmi_exit();
+	kentry_cpu_depth_sub(KENTRY_CPU_DEPTH_NMI);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 8/9] kentry: Check that syscall entries and syscall exits match
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (6 preceding siblings ...)
  2021-03-17 18:12 ` [PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  2021-03-17 18:12 ` [PATCH v4 9/9] kentry: Verify kentry state in instrumentation_begin/end() Andy Lutomirski
  8 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

If arch code calls the wrong kernel entry helpers, syscall entries and
exits can get out of sync.  Add a new field to task_struct to track the
syscall state and validate that it transitions correctly.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched.h |  4 ++++
 init/init_task.c      |  8 ++++++++
 kernel/entry/common.c | 24 +++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..95d6d8686d98 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1368,6 +1368,10 @@ struct task_struct {
 	struct llist_head               kretprobe_instances;
 #endif
 
+#ifdef CONFIG_DEBUG_ENTRY
+	bool kentry_in_syscall;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 8a992d73e6fb..de4fdaac4e8b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -212,6 +212,14 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_DEBUG_ENTRY
+	/*
+	 * The init task, and kernel threads in general, are considered
+	 * to be "in a syscall".  This way they can execve() and then exit
+	 * the supposed syscall that they were in to go to user mode.
+	 */
+	.kentry_in_syscall = true,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 8ba774184e00..152e7546be16 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -141,11 +141,21 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 
 long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 {
-	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
+	unsigned long work;
+
+	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
+		DEBUG_ENTRY_WARN_ONCE(
+			current->kentry_in_syscall,
+			"entering syscall %ld while already in a syscall",
+			syscall);
+		current->kentry_in_syscall = true;
+	}
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	lockdep_assert_irqs_enabled();
 
+	work = READ_ONCE(current_thread_info()->syscall_work);
+
 	if (work & SYSCALL_WORK_ENTER)
 		syscall = syscall_trace_enter(regs, syscall, work);
 
@@ -156,11 +166,16 @@ long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
 	instrumentation_begin();
+
 #ifdef CONFIG_DEBUG_ENTRY
 	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
 			      "__exit_to_user_mode called at wrong kentry cpu depth (%u)",
 			      this_cpu_read(kentry_cpu_depth));
+
+	DEBUG_ENTRY_WARN_ONCE(current->kentry_in_syscall,
+			      "exiting to user mode while in syscall context");
 #endif
+
 	trace_hardirqs_on_prepare();
 	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	instrumentation_end();
@@ -317,6 +332,13 @@ void kentry_syscall_end(struct pt_regs *regs)
 	 */
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
+
+#ifdef CONFIG_DEBUG_ENTRY
+	DEBUG_ENTRY_WARN_ONCE(!current->kentry_in_syscall,
+			      "exiting syscall %lu without entering first", nr);
+
+	current->kentry_in_syscall = 0;
+#endif
 }
 
 noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 9/9] kentry: Verify kentry state in instrumentation_begin/end()
  2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (7 preceding siblings ...)
  2021-03-17 18:12 ` [PATCH v4 8/9] kentry: Check that syscall entries and syscall exits match Andy Lutomirski
@ 2021-03-17 18:12 ` Andy Lutomirski
  8 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2021-03-17 18:12 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

Calling instrumentation_begin() and instrumentation_end() when kentry
thinks the CPU is in user mode is an error.  Verify the kentry state when
instrumentation_begin/end() are called.

Add _nocheck() variants to skip verification to avoid WARN() generating
extra kentry warnings.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/traps.c         |  4 ++--
 include/asm-generic/bug.h       |  8 ++++----
 include/linux/instrumentation.h | 25 ++++++++++++++++++++-----
 kernel/entry/common.c           |  7 +++++++
 4 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index be924180005a..983e4be5fdcb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -229,7 +229,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
 	/*
 	 * All lies, just get the WARN/BUG out.
 	 */
-	instrumentation_begin();
+	instrumentation_begin_nocheck();
 	/*
 	 * Since we're emulating a CALL with exceptions, restore the interrupt
 	 * state to what it was at the exception site.
@@ -242,7 +242,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
 	}
 	if (regs->flags & X86_EFLAGS_IF)
 		raw_local_irq_disable();
-	instrumentation_end();
+	instrumentation_end_nocheck();
 
 	return handled;
 }
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 76a10e0dca9f..fc360c463a99 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -85,18 +85,18 @@ void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
 		       const char *fmt, ...);
 #define __WARN()		__WARN_printf(TAINT_WARN, NULL)
 #define __WARN_printf(taint, arg...) do {				\
-		instrumentation_begin();				\
+		instrumentation_begin_nocheck();			\
 		warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);	\
-		instrumentation_end();					\
+		instrumentation_end_nocheck();				\
 	} while (0)
 #else
 extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 #define __WARN()		__WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(taint, arg...) do {				\
-		instrumentation_begin();				\
+		instrumentation_begin_nocheck();			\
 		__warn_printk(arg);					\
 		__WARN_FLAGS(BUGFLAG_NO_CUT_HERE | BUGFLAG_TAINT(taint));\
-		instrumentation_end();					\
+		instrumentation_end_nocheck();				\
 	} while (0)
 #define WARN_ON_ONCE(condition) ({				\
 	int __ret_warn_on = !!(condition);			\
diff --git a/include/linux/instrumentation.h b/include/linux/instrumentation.h
index 93e2ad67fc10..cdf80454f92a 100644
--- a/include/linux/instrumentation.h
+++ b/include/linux/instrumentation.h
@@ -4,14 +4,21 @@
 
 #if defined(CONFIG_DEBUG_ENTRY) && defined(CONFIG_STACK_VALIDATION)
 
+extern void kentry_assert_may_instrument(void);
+
 /* Begin/end of an instrumentation safe region */
-#define instrumentation_begin() ({					\
-	asm volatile("%c0: nop\n\t"						\
+#define instrumentation_begin_nocheck() ({				\
+	asm volatile("%c0: nop\n\t"					\
 		     ".pushsection .discard.instr_begin\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
 
+#define instrumentation_begin() ({					\
+	instrumentation_begin_nocheck();				\
+	kentry_assert_may_instrument();					\
+})
+
 /*
  * Because instrumentation_{begin,end}() can nest, objtool validation considers
  * _begin() a +1 and _end() a -1 and computes a sum over the instructions.
@@ -43,15 +50,23 @@
  * To avoid this, have _end() be a NOP instruction, this ensures it will be
  * part of the condition block and does not escape.
  */
-#define instrumentation_end() ({					\
+#define instrumentation_end_nocheck() ({				\
 	asm volatile("%c0: nop\n\t"					\
 		     ".pushsection .discard.instr_end\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
+
+#define instrumentation_end() ({					\
+	kentry_assert_may_instrument();					\
+	instrumentation_end_nocheck();					\
+})
+
 #else
-# define instrumentation_begin()	do { } while(0)
-# define instrumentation_end()		do { } while(0)
+# define instrumentation_begin_nocheck()	do { } while(0)
+# define instrumentation_begin()		do { } while(0)
+# define instrumentation_end_nocheck()		do { } while(0)
+# define instrumentation_end()			do { } while(0)
 #endif
 
 #endif /* __LINUX_INSTRUMENTATION_H */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 152e7546be16..57036a887790 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -56,6 +56,13 @@ static __always_inline void kentry_cpu_depth_sub(unsigned int n)
 {
 	this_cpu_sub(kentry_cpu_depth, n);
 }
+
+void kentry_assert_may_instrument(void)
+{
+	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) == 0, "instrumentable code is running in the wrong kentry state");
+}
+EXPORT_SYMBOL_GPL(kentry_assert_may_instrument);
+
 #else
 
 #define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
  2021-03-17 18:12 ` [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
@ 2021-03-18  0:36   ` Josh Poimboeuf
  0 siblings, 0 replies; 18+ messages in thread
From: Josh Poimboeuf @ 2021-03-18  0:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, LKML, Mark Rutland, Brian Gerst

On Wed, Mar 17, 2021 at 11:12:41AM -0700, Andy Lutomirski wrote:
> @@ -163,6 +163,19 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  	/* Kernel thread ? */
>  	if (unlikely(p->flags & PF_KTHREAD)) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
> +
> +		/*
> +		 * Even though there is no real user state here, these
> +		 * are were user regs belong, and kernel_execve() will

s/were/where/

> +		 * overwrite them with user regs.  Put an obviously
> +		 * invalid value that nonetheless causes user_mode(regs)
> +		 * to return true in CS.
> +		 *
> +		 * This also prevents the unwinder from thinking that there
> +		 * is invalid kernel state at the top of the stack.
> +		 */
> +		childregs->cs = 3;
> +
>  		kthread_frame_init(frame, sp, arg);
>  		return 0;
>  	}

Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C
  2021-03-17 18:12 ` [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C Andy Lutomirski
@ 2021-03-18  0:38   ` Josh Poimboeuf
  2021-03-19 16:05   ` Thomas Gleixner
  1 sibling, 0 replies; 18+ messages in thread
From: Josh Poimboeuf @ 2021-03-18  0:38 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, LKML, Mark Rutland, Brian Gerst

On Wed, Mar 17, 2021 at 11:12:42AM -0700, Andy Lutomirski wrote:
> ret_from_fork is written in asm, slightly differently, for x86_32 and
> x86_64.  Convert it to C.
> 
> This is a straight conversion without any particular cleverness.  As a
> further cleanup, the code that sets up the ret_from_fork argument registers
> could be adjusted to put the arguments in the correct registers.
> 
> This will cause the ORC unwinder to find pt_regs even for kernel threads on
> x86_64.  This seems harmless.
> 
> The 32-bit comment above the now-deleted schedule_tail_wrapper was
> obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
> same problem more cleanly.
> 
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C
  2021-03-17 18:12 ` [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C Andy Lutomirski
  2021-03-18  0:38   ` Josh Poimboeuf
@ 2021-03-19 16:05   ` Thomas Gleixner
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-19 16:05 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski, Josh Poimboeuf

On Wed, Mar 17 2021 at 11:12, Andy Lutomirski wrote:

> ret_from_fork is written in asm, slightly differently, for x86_32 and
> x86_64.  Convert it to C.
> +__visible void noinstr ret_from_fork(struct task_struct *prev,
> +				     int (*kernel_thread_fn)(void *),
> +				     void *kernel_thread_arg,
> +				     struct pt_regs *user_regs)
> +{
> +	instrumentation_begin();
> +
> +	schedule_tail(prev);
> +
> +	if (kernel_thread_fn) {
> +		kernel_thread_fn(kernel_thread_arg);
> +		user_regs->ax = 0;

If you replace this with:

   syscall_set_return_value(current, user_regs, 0, 0);

then it's architecture agnostic and can move to kernel/entry, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/9] kentry: Simplify the common syscall API
  2021-03-17 18:12 ` [PATCH v4 4/9] kentry: Simplify the common syscall API Andy Lutomirski
@ 2021-03-19 16:09   ` Thomas Gleixner
  2021-03-19 18:07   ` Thomas Gleixner
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-19 16:09 UTC (permalink / raw)
  To: Andy Lutomirski, x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

On Wed, Mar 17 2021 at 11:12, Andy Lutomirski wrote:
> The new common syscall API had a large and confusing API surface.  Simplify
> it.  Now there is exactly one way to use it.  It's a bit more verbose than
> the old way for the simple x86_64 native case, but it's much easier to use
> right,

and therefore much easier to get wrong...

>  __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  {
> -	nr = syscall_enter_from_user_mode(regs, nr);
> -
> +	kentry_enter_from_user_mode(regs);
> +	local_irq_enable();

... That needs to be _after_ instrumentation_begin(). If you fiddle
with this then please make sure that objtool validates noinstr...

> +	kentry_enter_from_user_mode(regs);
> +	local_irq_enable();

Ditto

> +	instrumentation_begin();
>  static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
>  {
>  	unsigned int nr = syscall_32_enter(regs);
> +	bool ret;
>  	int res;
>  
> -	/*
> -	 * This cannot use syscall_enter_from_user_mode() as it has to
> -	 * fetch EBP before invoking any of the syscall entry work
> -	 * functions.
> -	 */
> -	syscall_enter_from_user_mode_prepare(regs);
> -
> +	kentry_enter_from_user_mode(regs);
> +	local_irq_enable();

...

>  	instrumentation_begin();

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage
  2021-03-17 18:12 ` [PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage Andy Lutomirski
@ 2021-03-19 16:17   ` Thomas Gleixner
  2021-03-19 18:16     ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-19 16:17 UTC (permalink / raw)
  To: Andy Lutomirski, x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

On Wed, Mar 17 2021 at 11:12, Andy Lutomirski wrote:
> +
> +#define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)

So we have a stub for !DEBUG

> +static __always_inline void kentry_cpu_depth_add(unsigned int n) {}
> +static void kentry_cpu_depth_check(unsigned int n) {}
> +static __always_inline void kentry_cpu_depth_sub(unsigned int n) {}
> +
> +#endif
> +
>  /* See comment for enter_from_user_mode() in entry-common.h */
>  static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
>  {
> +	kentry_cpu_depth_add(1);
>  	arch_check_user_regs(regs);
>  	lockdep_hardirqs_off(CALLER_ADDR0);
>  
> @@ -22,6 +78,14 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
>  
>  	instrumentation_begin();
>  	trace_hardirqs_off_finish();
> +
> +#ifdef CONFIG_DEBUG_ENTRY

Why do we need that #ifdeffery all over the place?

> +	DEBUG_ENTRY_WARN_ONCE(
> +		this_cpu_read(kentry_cpu_depth) != 1,
> +		"kentry: __enter_from_user_mode() called while kentry thought the CPU was in the kernel (%u)",
> +		this_cpu_read(kentry_cpu_depth));

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/9] kentry: Simplify the common syscall API
  2021-03-17 18:12 ` [PATCH v4 4/9] kentry: Simplify the common syscall API Andy Lutomirski
  2021-03-19 16:09   ` Thomas Gleixner
@ 2021-03-19 18:07   ` Thomas Gleixner
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-19 18:07 UTC (permalink / raw)
  To: Andy Lutomirski, x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

On Wed, Mar 17 2021 at 11:12, Andy Lutomirski wrote:
> @@ -119,31 +119,12 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
>  void enter_from_user_mode(struct pt_regs *regs);
>  
>  /**
> + * kentry_syscall_begin - Prepare to invoke a syscall handler
>   * @regs:	Pointer to currents pt_regs
>   * @syscall:	The syscall number
>   *
>   * Invoked from architecture specific syscall entry code with interrupts
> - * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
> - * architecture specific work.
> + * enabled after kentry_enter_from_usermode or a similar function.

Please write functions with () at the end. Also what the heck means
'similar function' here? I really spent quite some time to document this
stuff and it wants to stay that way.

>   *
> + * Called with IRQs on.  Returns with IRQs still on.

interrupts enabled please. This is technical documentation and not twatter.

> +void kentry_syscall_end(struct pt_regs *regs);

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 5/9] kentry: Remove enter_from/exit_to_user_mode()
  2021-03-17 18:12 ` [PATCH v4 5/9] kentry: Remove enter_from/exit_to_user_mode() Andy Lutomirski
@ 2021-03-19 18:08   ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-19 18:08 UTC (permalink / raw)
  To: Andy Lutomirski, x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

On Wed, Mar 17 2021 at 11:12, Andy Lutomirski wrote:
> -/**
> - * exit_to_user_mode - Fixup state when exiting to user mode
> - *
> - * Syscall/interrupt exit enables interrupts, but the kernel state is
> - * interrupts disabled when this is invoked. Also tell RCU about it.
> - *
> - * 1) Trace interrupts on state
> - * 2) Invoke context tracking if enabled to adjust RCU state
> - * 3) Invoke architecture specific last minute exit code, e.g. speculation
> - *    mitigations, etc.: arch_exit_to_user_mode()
> - * 4) Tell lockdep that interrupts are enabled
> - *
> - * Invoked from architecture specific code when syscall_exit_to_user_mode()
> - * is not suitable as the last step before returning to userspace. Must be
> - * invoked with interrupts disabled and the caller must be
> - * non-instrumentable.
> - * The caller has to invoke syscall_exit_to_user_mode_work() before this.
> - */
> -void exit_to_user_mode(void);

Oh well. There is this in the C code:

>  /* See comment for enter_from_user_mode() in entry-common.h */
>  static __always_inline void __exit_from_user_mode(void)

>  /* See comment for exit_to_user_mode() in entry-common.h */
>  static __always_inline void __exit_to_user_mode(void)

So the comments are now stale and the documentation is lost.

Making the diffstat impressive is not the primary goal here.

And looking at the rest of the comments there are still references to
stuff you deleted. This is really not how it works.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage
  2021-03-19 16:17   ` Thomas Gleixner
@ 2021-03-19 18:16     ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-19 18:16 UTC (permalink / raw)
  To: Andy Lutomirski, x86; +Cc: LKML, Mark Rutland, Brian Gerst, Andy Lutomirski

On Fri, Mar 19 2021 at 17:17, Thomas Gleixner wrote:

> On Wed, Mar 17 2021 at 11:12, Andy Lutomirski wrote:
>> +
>> +#define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)
>
> So we have a stub for !DEBUG
>
>> +static __always_inline void kentry_cpu_depth_add(unsigned int n) {}
>> +static void kentry_cpu_depth_check(unsigned int n) {}
>> +static __always_inline void kentry_cpu_depth_sub(unsigned int n) {}
>> +
>> +#endif
>> +
>>  /* See comment for enter_from_user_mode() in entry-common.h */
>>  static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
>>  {
>> +	kentry_cpu_depth_add(1);
>>  	arch_check_user_regs(regs);
>>  	lockdep_hardirqs_off(CALLER_ADDR0);
>>  
>> @@ -22,6 +78,14 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
>>  
>>  	instrumentation_begin();
>>  	trace_hardirqs_off_finish();
>> +
>> +#ifdef CONFIG_DEBUG_ENTRY
>
> Why do we need that #ifdeffery all over the place?
>
>> +	DEBUG_ENTRY_WARN_ONCE(
>> +		this_cpu_read(kentry_cpu_depth) != 1,
>> +		"kentry: __enter_from_user_mode() called while kentry thought the CPU was in the kernel (%u)",
>> +		this_cpu_read(kentry_cpu_depth));

Because you directly access kentry_cpu_depth which makes the compiler
unhappy.

And of course at the other place where you guard it with IS_ENABLED() it
fails to build with CONFIG_DEBUG_ENTRY=n

kernel/entry/common.c:158:10: error: ‘struct task_struct’ has no member named ‘kentry_in_syscall’
   current->kentry_in_syscall = true;

This is V4 of this series... Oh well.

     tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-03-19 18:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 18:12 [PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
2021-03-17 18:12 ` [PATCH v4 1/9] x86/dumpstack: Remove unnecessary range check fetching opcode bytes Andy Lutomirski
2021-03-17 18:12 ` [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
2021-03-18  0:36   ` Josh Poimboeuf
2021-03-17 18:12 ` [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C Andy Lutomirski
2021-03-18  0:38   ` Josh Poimboeuf
2021-03-19 16:05   ` Thomas Gleixner
2021-03-17 18:12 ` [PATCH v4 4/9] kentry: Simplify the common syscall API Andy Lutomirski
2021-03-19 16:09   ` Thomas Gleixner
2021-03-19 18:07   ` Thomas Gleixner
2021-03-17 18:12 ` [PATCH v4 5/9] kentry: Remove enter_from/exit_to_user_mode() Andy Lutomirski
2021-03-19 18:08   ` Thomas Gleixner
2021-03-17 18:12 ` [PATCH v4 6/9] entry: Make CONFIG_DEBUG_ENTRY available outside x86 Andy Lutomirski
2021-03-17 18:12 ` [PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage Andy Lutomirski
2021-03-19 16:17   ` Thomas Gleixner
2021-03-19 18:16     ` Thomas Gleixner
2021-03-17 18:12 ` [PATCH v4 8/9] kentry: Check that syscall entries and syscall exits match Andy Lutomirski
2021-03-17 18:12 ` [PATCH v4 9/9] kentry: Verify kentry state in instrumentation_begin/end() Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).