[PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements
@ 2021-03-04 19:05 Andy Lutomirski
  2021-03-04 19:05 ` [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls Andy Lutomirski
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:05 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

I noticed a little bug in fast compat syscalls.  I got a bit carried away
fixing it.  This renames the irqentry stuff to kentry, improves (IMNSHO)
the API, and adds lots of debugging.

It also tweaks the unwinder wrt ret_from_fork and rewrites ret_from_fork
in C.  I did this because the kentry work involved a small change to
ret_from_fork, and adjusting the asm is a mess.  So C it is.

Changes from v1 and v2: Complete rewrite

Andy Lutomirski (11):
  x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
  kentry: Rename irqentry to kentry
  x86/dumpstack: Remove unnecessary range check fetching opcode bytes
  x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
  x86/entry: Convert ret_from_fork to C
  kentry: Simplify the common syscall API
  kentry: Make entry/exit_to_user_mode() arm64-only
  entry: Make CONFIG_DEBUG_ENTRY available outside x86
  kentry: Add debugging checks for proper kentry API usage
  kentry: Check that syscall entries and syscall exits match
  kentry: Verify kentry state in instrumentation_begin/end()

 arch/x86/Kconfig.debug           |  10 --
 arch/x86/entry/common.c          |  85 ++++++++++----
 arch/x86/entry/entry_32.S        |  51 ++------
 arch/x86/entry/entry_64.S        |  33 ++----
 arch/x86/include/asm/idtentry.h  |  28 ++---
 arch/x86/include/asm/switch_to.h |   2 +-
 arch/x86/kernel/cpu/mce/core.c   |  10 +-
 arch/x86/kernel/dumpstack.c      |  10 +-
 arch/x86/kernel/kvm.c            |   6 +-
 arch/x86/kernel/nmi.c            |   6 +-
 arch/x86/kernel/process.c        |  15 ++-
 arch/x86/kernel/process_32.c     |   2 +-
 arch/x86/kernel/traps.c          |  32 ++---
 arch/x86/kernel/unwind_orc.c     |   2 +-
 arch/x86/mm/fault.c              |   6 +-
 include/asm-generic/bug.h        |   8 +-
 include/linux/entry-common.h     | 180 ++++++++--------------------
 include/linux/instrumentation.h  |  25 +++-
 include/linux/sched.h            |   4 +
 init/init_task.c                 |   8 ++
 kernel/entry/common.c            | 193 +++++++++++++++++++++----------
 lib/Kconfig.debug                |  11 ++
 22 files changed, 366 insertions(+), 361 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
@ 2021-03-04 19:05 ` Andy Lutomirski
  2021-03-05 10:16   ` [tip: x86/urgent] " tip-bot2 for Andy Lutomirski
                     ` (2 more replies)
  2021-03-04 19:05 ` [PATCH v3 02/11] kentry: Rename irqentry to kentry Andy Lutomirski
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:05 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski, stable

On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work.  This confuses audit and ptrace.  For example:

    $ ./tools/testing/selftests/x86/syscall_arg_fault_32
    ...
    strace: pid 264258: entering, ptrace_syscall_info.op == 2
    ...

This is a minimal fix intended for ease of backporting.  A more
complete cleanup is coming.

Cc: stable@vger.kernel.org
Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0904f5676e4d..8fdb4cb27efe 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 		regs->ax = -EFAULT;
 
 		instrumentation_end();
-		syscall_exit_to_user_mode(regs);
+		local_irq_disable();
+		irqentry_exit_to_user_mode(regs);
 		return false;
 	}
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 02/11] kentry: Rename irqentry to kentry
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
  2021-03-04 19:05 ` [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls Andy Lutomirski
@ 2021-03-04 19:05 ` Andy Lutomirski
  2021-03-08  9:45   ` Mark Rutland
  2021-03-04 19:05 ` [PATCH v3 03/11] x86/dumpstack: Remove unnecessary range check fetching opcode bytes Andy Lutomirski
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:05 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

The common entry functions are mostly named irqentry, and this is
confusing.  They are used for syscalls, exceptions, NMIs and, yes, IRQs.
Call them kentry instead, since they are supposed to be usable for any
entry to the kernel.

This path doesn't touch the .irqentry section -- someone can contemplate
changing that later.  That code predates the common entry code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c         |  8 ++---
 arch/x86/include/asm/idtentry.h | 28 +++++++--------
 arch/x86/kernel/cpu/mce/core.c  | 10 +++---
 arch/x86/kernel/kvm.c           |  6 ++--
 arch/x86/kernel/nmi.c           |  6 ++--
 arch/x86/kernel/traps.c         | 28 +++++++--------
 arch/x86/mm/fault.c             |  6 ++--
 include/linux/entry-common.h    | 60 ++++++++++++++++-----------------
 kernel/entry/common.c           | 26 +++++++-------
 9 files changed, 89 insertions(+), 89 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 8fdb4cb27efe..95776f16c1cb 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -264,9 +264,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 {
 	struct pt_regs *old_regs;
 	bool inhcall;
-	irqentry_state_t state;
+	kentry_state_t state;
 
-	state = irqentry_enter(regs);
+	state = kentry_enter(regs);
 	old_regs = set_irq_regs(regs);
 
 	instrumentation_begin();
@@ -278,11 +278,11 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
 		instrumentation_begin();
-		irqentry_exit_cond_resched();
+		kentry_exit_cond_resched();
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
-		irqentry_exit(regs, state);
+		kentry_exit(regs, state);
 	}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index f656aabd1545..3ac805d24289 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -40,8 +40,8 @@
  * The macro is written so it acts as function definition. Append the
  * body with a pair of curly brackets.
  *
- * irqentry_enter() contains common code which has to be invoked before
- * arbitrary code in the body. irqentry_exit() contains common code
+ * kentry_enter() contains common code which has to be invoked before
+ * arbitrary code in the body. kentry_exit() contains common code
  * which has to run before returning to the low level assembly code.
  */
 #define DEFINE_IDTENTRY(func)						\
@@ -49,12 +49,12 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	kentry_state_t state = kentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__##func (regs);						\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	kentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +96,12 @@ static __always_inline void __##func(struct pt_regs *regs,		\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	kentry_state_t state = kentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__##func (regs, error_code);					\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	kentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs,		\
@@ -156,7 +156,7 @@ __visible noinstr void func(struct pt_regs *regs)
  * body with a pair of curly brackets.
  *
  * Contrary to DEFINE_IDTENTRY_ERRORCODE() this does not invoke the
- * irqentry_enter/exit() helpers before and after the body invocation. This
+ * kentry_enter/exit() helpers before and after the body invocation. This
  * needs to be done in the body itself if applicable. Use if extra work
  * is required before the enter/exit() helpers are invoked.
  */
@@ -192,7 +192,7 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector);	\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	kentry_state_t state = kentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
@@ -200,7 +200,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	__##func (regs, (u8)error_code);				\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	kentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -224,7 +224,7 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector)
  * DEFINE_IDTENTRY_SYSVEC - Emit code for system vector IDT entry points
  * @func:	Function name of the entry point
  *
- * irqentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
+ * kentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
  * function body. KVM L1D flush request is set.
  *
  * Runs the function on the interrupt stack if the entry hit kernel mode
@@ -234,7 +234,7 @@ static void __##func(struct pt_regs *regs);				\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	kentry_state_t state = kentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
@@ -242,7 +242,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	run_sysvec_on_irqstack_cond(__##func, regs);			\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	kentry_exit(regs, state);					\
 }									\
 									\
 static noinline void __##func(struct pt_regs *regs)
@@ -263,7 +263,7 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	kentry_state_t state = kentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__irq_enter_raw();						\
@@ -271,7 +271,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	__##func (regs);						\
 	__irq_exit_raw();						\
 	instrumentation_end();						\
-	irqentry_exit(regs, state);					\
+	kentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 47a0eb57725c..3d4bfd54934e 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1974,7 +1974,7 @@ void (*machine_check_vector)(struct pt_regs *) = unexpected_machine_check;
 
 static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 {
-	irqentry_state_t irq_state;
+	kentry_state_t irq_state;
 
 	WARN_ON_ONCE(user_mode(regs));
 
@@ -1986,7 +1986,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 	    mce_check_crashing_cpu())
 		return;
 
-	irq_state = irqentry_nmi_enter(regs);
+	irq_state = kentry_nmi_enter(regs);
 	/*
 	 * The call targets are marked noinstr, but objtool can't figure
 	 * that out because it's an indirect call. Annotate it.
@@ -1996,18 +1996,18 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 	machine_check_vector(regs);
 
 	instrumentation_end();
-	irqentry_nmi_exit(regs, irq_state);
+	kentry_nmi_exit(regs, irq_state);
 }
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
 {
-	irqentry_enter_from_user_mode(regs);
+	kentry_enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	machine_check_vector(regs);
 
 	instrumentation_end();
-	irqentry_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5e78e01ca3b4..bcecc3e7cd0a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
 noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 {
 	u32 flags = kvm_read_and_reset_apf_flags();
-	irqentry_state_t state;
+	kentry_state_t state;
 
 	if (!flags)
 		return false;
 
-	state = irqentry_enter(regs);
+	state = kentry_enter(regs);
 	instrumentation_begin();
 
 	/*
@@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 	}
 
 	instrumentation_end();
-	irqentry_exit(regs, state);
+	kentry_exit(regs, state);
 	return true;
 }
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bf250a339655..6f46722a6e94 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
-	irqentry_state_t irq_state;
+	kentry_state_t irq_state;
 
 	/*
 	 * Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
 	this_cpu_write(nmi_dr7, local_db_save());
 
-	irq_state = irqentry_nmi_enter(regs);
+	irq_state = kentry_nmi_enter(regs);
 
 	inc_irq_stat(__nmi_count);
 
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 
-	irqentry_nmi_exit(regs, irq_state);
+	kentry_nmi_exit(regs, irq_state);
 
 	local_db_restore(this_cpu_read(nmi_dr7));
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7f5aec758f0e..be924180005a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -249,7 +249,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
 
 DEFINE_IDTENTRY_RAW(exc_invalid_op)
 {
-	irqentry_state_t state;
+	kentry_state_t state;
 
 	/*
 	 * We use UD2 as a short encoding for 'CALL __WARN', as such
@@ -259,11 +259,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 	if (!user_mode(regs) && handle_bug(regs))
 		return;
 
-	state = irqentry_enter(regs);
+	state = kentry_enter(regs);
 	instrumentation_begin();
 	handle_invalid_op(regs);
 	instrumentation_end();
-	irqentry_exit(regs, state);
+	kentry_exit(regs, state);
 }
 
 DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -410,7 +410,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	}
 #endif
 
-	irqentry_nmi_enter(regs);
+	kentry_nmi_enter(regs);
 	instrumentation_begin();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
@@ -646,26 +646,26 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		return;
 
 	/*
-	 * irqentry_enter_from_user_mode() uses static_branch_{,un}likely()
+	 * kentry_enter_from_user_mode() uses static_branch_{,un}likely()
 	 * and therefore can trigger INT3, hence poke_int3_handler() must
 	 * be done before. If the entry came from kernel mode, then use
 	 * nmi_enter() because the INT3 could have been hit in any context
 	 * including NMI.
 	 */
 	if (user_mode(regs)) {
-		irqentry_enter_from_user_mode(regs);
+		kentry_enter_from_user_mode(regs);
 		instrumentation_begin();
 		do_int3_user(regs);
 		instrumentation_end();
-		irqentry_exit_to_user_mode(regs);
+		kentry_exit_to_user_mode(regs);
 	} else {
-		irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+		kentry_state_t irq_state = kentry_nmi_enter(regs);
 
 		instrumentation_begin();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
 		instrumentation_end();
-		irqentry_nmi_exit(regs, irq_state);
+		kentry_nmi_exit(regs, irq_state);
 	}
 }
 
@@ -860,7 +860,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 	 * includes the entry stack is excluded for everything.
 	 */
 	unsigned long dr7 = local_db_save();
-	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+	kentry_state_t irq_state = kentry_nmi_enter(regs);
 	instrumentation_begin();
 
 	/*
@@ -917,7 +917,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 		regs->flags &= ~X86_EFLAGS_TF;
 out:
 	instrumentation_end();
-	irqentry_nmi_exit(regs, irq_state);
+	kentry_nmi_exit(regs, irq_state);
 
 	local_db_restore(dr7);
 }
@@ -935,14 +935,14 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
 
 	/*
 	 * NB: We can't easily clear DR7 here because
-	 * irqentry_exit_to_usermode() can invoke ptrace, schedule, access
+	 * kentry_exit_to_usermode() can invoke ptrace, schedule, access
 	 * user memory, etc.  This means that a recursive #DB is possible.  If
 	 * this happens, that #DB will hit exc_debug_kernel() and clear DR7.
 	 * Since we're not on the IST stack right now, everything will be
 	 * fine.
 	 */
 
-	irqentry_enter_from_user_mode(regs);
+	kentry_enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	/*
@@ -988,7 +988,7 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
 	local_irq_disable();
 out:
 	instrumentation_end();
-	irqentry_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 106b22d1d189..fea6dee3f6f2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1432,7 +1432,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
 	unsigned long address = read_cr2();
-	irqentry_state_t state;
+	kentry_state_t state;
 
 	prefetchw(&current->mm->mmap_lock);
 
@@ -1470,11 +1470,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	 * code reenabled RCU to avoid subsequent wreckage which helps
 	 * debugability.
 	 */
-	state = irqentry_enter(regs);
+	state = kentry_enter(regs);
 
 	instrumentation_begin();
 	handle_page_fault(regs, error_code, address);
 	instrumentation_end();
 
-	irqentry_exit(regs, state);
+	kentry_exit(regs, state);
 }
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index ca86a00abe86..fd2d7c35670a 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -363,7 +363,7 @@ void syscall_exit_to_user_mode_work(struct pt_regs *regs);
 void syscall_exit_to_user_mode(struct pt_regs *regs);
 
 /**
- * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
+ * kentry_enter_from_user_mode - Establish state before invoking the irq handler
  * @regs:	Pointer to currents pt_regs
  *
  * Invoked from architecture specific entry code with interrupts disabled.
@@ -373,10 +373,10 @@ void syscall_exit_to_user_mode(struct pt_regs *regs);
  *
  * The function establishes state (lockdep, RCU (context tracking), tracing)
  */
-void irqentry_enter_from_user_mode(struct pt_regs *regs);
+void kentry_enter_from_user_mode(struct pt_regs *regs);
 
 /**
- * irqentry_exit_to_user_mode - Interrupt exit work
+ * kentry_exit_to_user_mode - Interrupt exit work
  * @regs:	Pointer to current's pt_regs
  *
  * Invoked with interrupts disbled and fully valid regs. Returns with all
@@ -388,34 +388,34 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
  * Interrupt exit is not invoking #1 which is the syscall specific one time
  * work.
  */
-void irqentry_exit_to_user_mode(struct pt_regs *regs);
+void kentry_exit_to_user_mode(struct pt_regs *regs);
 
-#ifndef irqentry_state
+#ifndef kentry_state
 /**
- * struct irqentry_state - Opaque object for exception state storage
- * @exit_rcu: Used exclusively in the irqentry_*() calls; signals whether the
+ * struct kentry_state - Opaque object for exception state storage
+ * @exit_rcu: Used exclusively in the kentry_*() calls; signals whether the
  *            exit path has to invoke rcu_irq_exit().
- * @lockdep: Used exclusively in the irqentry_nmi_*() calls; ensures that
+ * @lockdep: Used exclusively in the kentry_nmi_*() calls; ensures that
  *           lockdep state is restored correctly on exit from nmi.
  *
- * This opaque object is filled in by the irqentry_*_enter() functions and
- * must be passed back into the corresponding irqentry_*_exit() functions
+ * This opaque object is filled in by the kentry_*_enter() functions and
+ * must be passed back into the corresponding kentry_*_exit() functions
  * when the exception is complete.
  *
- * Callers of irqentry_*_[enter|exit]() must consider this structure opaque
+ * Callers of kentry_*_[enter|exit]() must consider this structure opaque
  * and all members private.  Descriptions of the members are provided to aid in
- * the maintenance of the irqentry_*() functions.
+ * the maintenance of the kentry_*() functions.
  */
-typedef struct irqentry_state {
+typedef struct kentry_state {
 	union {
 		bool	exit_rcu;
 		bool	lockdep;
 	};
-} irqentry_state_t;
+} kentry_state_t;
 #endif
 
 /**
- * irqentry_enter - Handle state tracking on ordinary interrupt entries
+ * kentry_enter - Handle state tracking on ordinary interrupt entries
  * @regs:	Pointer to pt_regs of interrupted context
  *
  * Invokes:
@@ -439,25 +439,25 @@ typedef struct irqentry_state {
  * solves the problem of kernel mode pagefaults which can schedule, which
  * is not possible after invoking rcu_irq_enter() without undoing it.
  *
- * For user mode entries irqentry_enter_from_user_mode() is invoked to
+ * For user mode entries kentry_enter_from_user_mode() is invoked to
  * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
  * would not be possible.
  *
  * Returns: An opaque object that must be passed to idtentry_exit()
  */
-irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+kentry_state_t noinstr kentry_enter(struct pt_regs *regs);
 
 /**
- * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
+ * kentry_exit_cond_resched - Conditionally reschedule on return from interrupt
  *
  * Conditional reschedule with additional sanity checks.
  */
-void irqentry_exit_cond_resched(void);
+void kentry_exit_cond_resched(void);
 
 /**
- * irqentry_exit - Handle return from exception that used irqentry_enter()
+ * kentry_exit - Handle return from exception that used kentry_enter()
  * @regs:	Pointer to pt_regs (exception entry regs)
- * @state:	Return value from matching call to irqentry_enter()
+ * @state:	Return value from matching call to kentry_enter()
  *
  * Depending on the return target (kernel/user) this runs the necessary
  * preemption and work checks if possible and required and returns to
@@ -466,27 +466,27 @@ void irqentry_exit_cond_resched(void);
  * This is the last action before returning to the low level ASM code which
  * just needs to return to the appropriate context.
  *
- * Counterpart to irqentry_enter().
+ * Counterpart to kentry_enter().
  */
-void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+void noinstr kentry_exit(struct pt_regs *regs, kentry_state_t state);
 
 /**
- * irqentry_nmi_enter - Handle NMI entry
+ * kentry_nmi_enter - Handle NMI entry
  * @regs:	Pointer to currents pt_regs
  *
- * Similar to irqentry_enter() but taking care of the NMI constraints.
+ * Similar to kentry_enter() but taking care of the NMI constraints.
  */
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs);
 
 /**
- * irqentry_nmi_exit - Handle return from NMI handling
+ * kentry_nmi_exit - Handle return from NMI handling
  * @regs:	Pointer to pt_regs (NMI entry regs)
- * @irq_state:	Return value from matching call to irqentry_nmi_enter()
+ * @irq_state:	Return value from matching call to kentry_nmi_enter()
  *
  * Last action before returning to the low level assembly code.
  *
- * Counterpart to irqentry_nmi_enter().
+ * Counterpart to kentry_nmi_enter().
  */
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state);
 
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 378341642f94..269766a8f981 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -304,12 +304,12 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
 	__exit_to_user_mode();
 }
 
-noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
+noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
 {
 	__enter_from_user_mode(regs);
 }
 
-noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
+noinstr void kentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
 	exit_to_user_mode_prepare(regs);
@@ -317,14 +317,14 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 	__exit_to_user_mode();
 }
 
-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
+noinstr kentry_state_t kentry_enter(struct pt_regs *regs)
 {
-	irqentry_state_t ret = {
+	kentry_state_t ret = {
 		.exit_rcu = false,
 	};
 
 	if (user_mode(regs)) {
-		irqentry_enter_from_user_mode(regs);
+		kentry_enter_from_user_mode(regs);
 		return ret;
 	}
 
@@ -355,7 +355,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		/*
 		 * If RCU is not watching then the same careful
 		 * sequence vs. lockdep and tracing is required
-		 * as in irqentry_enter_from_user_mode().
+		 * as in kentry_enter_from_user_mode().
 		 */
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
@@ -382,7 +382,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	return ret;
 }
 
-void irqentry_exit_cond_resched(void)
+void kentry_exit_cond_resched(void)
 {
 	if (!preempt_count()) {
 		/* Sanity check RCU and thread stack */
@@ -394,13 +394,13 @@ void irqentry_exit_cond_resched(void)
 	}
 }
 
-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+noinstr void kentry_exit(struct pt_regs *regs, kentry_state_t state)
 {
 	lockdep_assert_irqs_disabled();
 
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
-		irqentry_exit_to_user_mode(regs);
+		kentry_exit_to_user_mode(regs);
 	} else if (!regs_irqs_disabled(regs)) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
@@ -420,7 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 
 		instrumentation_begin();
 		if (IS_ENABLED(CONFIG_PREEMPTION))
-			irqentry_exit_cond_resched();
+			kentry_exit_cond_resched();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
@@ -434,9 +434,9 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 	}
 }
 
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs)
 {
-	irqentry_state_t irq_state;
+	kentry_state_t irq_state;
 
 	irq_state.lockdep = lockdep_hardirqs_enabled();
 
@@ -453,7 +453,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	return irq_state;
 }
 
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state)
 {
 	instrumentation_begin();
 	ftrace_nmi_exit();
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 03/11] x86/dumpstack: Remove unnecessary range check fetching opcode bytes
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
  2021-03-04 19:05 ` [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls Andy Lutomirski
  2021-03-04 19:05 ` [PATCH v3 02/11] kentry: Rename irqentry to kentry Andy Lutomirski
@ 2021-03-04 19:05 ` Andy Lutomirski
  2021-03-04 19:05 ` [PATCH v3 04/11] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:05 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

copy_from_user_nmi() validates that the pointer is in the user range,
so there is no need for an extra check in copy_code().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 299c20f0a38b..55cf3c8325c6 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -81,12 +81,6 @@ static int copy_code(struct pt_regs *regs, u8 *buf, unsigned long src,
 	/* The user space code from other tasks cannot be accessed. */
 	if (regs != task_pt_regs(current))
 		return -EPERM;
-	/*
-	 * Make sure userspace isn't trying to trick us into dumping kernel
-	 * memory by pointing the userspace instruction pointer at it.
-	 */
-	if (__chk_range_not_ok(src, nbytes, TASK_SIZE_MAX))
-		return -EINVAL;
 
 	/*
 	 * Even if named copy_from_user_nmi() this can be invoked from
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 04/11] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (2 preceding siblings ...)
  2021-03-04 19:05 ` [PATCH v3 03/11] x86/dumpstack: Remove unnecessary range check fetching opcode bytes Andy Lutomirski
@ 2021-03-04 19:05 ` Andy Lutomirski
  2021-03-04 20:19   ` Ira Weiny
  2021-03-04 19:05 ` [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C Andy Lutomirski
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:05 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski, Josh Poimboeuf

For kernel threads, task_pt_regs is currently all zeros, a valid user state
(if kernel_execve() has been called), or some combination thereof during
execution of kernel_execve().  If a stack trace is printed, the unwinder
might get confused and treat task_pt_regs as a kernel state.  Indeed,
forcing a stack dump results in an attempt to display _kernel_ code bytes
from a bogus address at the very top of kernel memory.

Fix the confusion by setting cs=3 so that user_mode(task_pt_regs(...)) ==
true for kernel threads.

Also improve the error when failing to fetch code bytes to show which type
of fetch failed.  This makes it much easier to understand what's happening.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c |  4 ++--
 arch/x86/kernel/process.c   | 13 +++++++++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 55cf3c8325c6..9b7b69bb12e5 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -128,8 +128,8 @@ void show_opcodes(struct pt_regs *regs, const char *loglvl)
 		/* No access to the user space stack of other tasks. Ignore. */
 		break;
 	default:
-		printk("%sCode: Unable to access opcode bytes at RIP 0x%lx.\n",
-		       loglvl, prologue);
+		printk("%sCode: Unable to access %s opcode bytes at RIP 0x%lx.\n",
+		       loglvl, user_mode(regs) ? "user" : "kernel", prologue);
 		break;
 	}
 }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 145a7ac0c19a..f6f16df04cb9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -163,6 +163,19 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		memset(childregs, 0, sizeof(struct pt_regs));
+
+		/*
+		 * Even though there is no real user state here, these
+		 * are were user regs belong, and kernel_execve() will
+		 * overwrite them with user regs.  Put an obviously
+		 * invalid value that nonetheless causes user_mode(regs)
+		 * to return true in CS.
+		 *
+		 * This also prevents the unwinder from thinking that there
+		 * is invalid kernel state at the top of the stack.
+		 */
+		childregs->cs = 3;
+
 		kthread_frame_init(frame, sp, arg);
 		return 0;
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (3 preceding siblings ...)
  2021-03-04 19:05 ` [PATCH v3 04/11] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
@ 2021-03-04 19:05 ` Andy Lutomirski
  2021-03-05  0:55   ` Brian Gerst
  2021-03-04 19:05 ` [PATCH v3 06/11] kentry: Simplify the common syscall API Andy Lutomirski
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:05 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski, Josh Poimboeuf

ret_from_fork is written in asm, slightly differently, for x86_32 and
x86_64.  Convert it to C.

This is a straight conversion without any particular cleverness.  As a
further cleanup, the code that sets up the ret_from_fork argument registers
could be adjusted to put the arguments in the correct registers.

This will cause the ORC unwinder to find pt_regs even for kernel threads on
x86_64.  This seems harmless.

The 32-bit comment above the now-deleted schedule_tail_wrapper was
obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
same problem more cleanly.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c          | 23 ++++++++++++++
 arch/x86/entry/entry_32.S        | 51 +++++---------------------------
 arch/x86/entry/entry_64.S        | 33 +++++----------------
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/kernel/process.c        |  2 +-
 arch/x86/kernel/process_32.c     |  2 +-
 arch/x86/kernel/unwind_orc.c     |  2 +-
 7 files changed, 43 insertions(+), 72 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 95776f16c1cb..ef1c65938a6b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -214,6 +214,29 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+void ret_from_fork(struct task_struct *prev,
+		   int (*kernel_thread_fn)(void *),
+		   void *kernel_thread_arg,
+		   struct pt_regs *user_regs);
+
+__visible void noinstr ret_from_fork(struct task_struct *prev,
+				     int (*kernel_thread_fn)(void *),
+				     void *kernel_thread_arg,
+				     struct pt_regs *user_regs)
+{
+	instrumentation_begin();
+
+	schedule_tail(prev);
+
+	if (kernel_thread_fn) {
+		kernel_thread_fn(kernel_thread_arg);
+		user_regs->ax = 0;
+	}
+
+	instrumentation_end();
+	syscall_exit_to_user_mode(user_regs);
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index df8c017e6161..7113d259727f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -805,26 +805,6 @@ SYM_CODE_START(__switch_to_asm)
 SYM_CODE_END(__switch_to_asm)
 .popsection
 
-/*
- * The unwinder expects the last frame on the stack to always be at the same
- * offset from the end of the page, which allows it to validate the stack.
- * Calling schedule_tail() directly would break that convention because its an
- * asmlinkage function so its argument has to be pushed on the stack.  This
- * wrapper creates a proper "end of stack" frame header before the call.
- */
-.pushsection .text, "ax"
-SYM_FUNC_START(schedule_tail_wrapper)
-	FRAME_BEGIN
-
-	pushl	%eax
-	call	schedule_tail
-	popl	%eax
-
-	FRAME_END
-	ret
-SYM_FUNC_END(schedule_tail_wrapper)
-.popsection
-
 /*
  * A newly forked process directly context switches into this address.
  *
@@ -833,29 +813,14 @@ SYM_FUNC_END(schedule_tail_wrapper)
  * edi: kernel thread arg
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-	call	schedule_tail_wrapper
-
-	testl	%ebx, %ebx
-	jnz	1f		/* kernel threads are uncommon */
-
-2:
-	/* When we fork, we trace the syscall return in the child, too. */
-	movl    %esp, %eax
-	call    syscall_exit_to_user_mode
-	jmp     .Lsyscall_32_done
-
-	/* kernel thread */
-1:	movl	%edi, %eax
-	CALL_NOSPEC ebx
-	/*
-	 * A kernel thread is allowed to return here after successfully
-	 * calling kernel_execve().  Exit to userspace to complete the execve()
-	 * syscall.
-	 */
-	movl	$0, PT_EAX(%esp)
-	jmp	2b
-SYM_CODE_END(ret_from_fork)
+SYM_CODE_START(asm_ret_from_fork)
+	movl	%ebx, %edx
+	movl	%edi, %ecx
+	pushl	%esp
+	call	ret_from_fork
+	addl	$4, %esp
+	jmp	.Lsyscall_32_done
+SYM_CODE_END(asm_ret_from_fork)
 .popsection
 
 SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..0f7df8861ac1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -273,35 +273,18 @@ SYM_FUNC_END(__switch_to_asm)
  * rax: prev task we switched from
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
+ * rbp: encoded frame pointer for the fp unwinder
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-	UNWIND_HINT_EMPTY
-	movq	%rax, %rdi
-	call	schedule_tail			/* rdi: 'prev' task parameter */
-
-	testq	%rbx, %rbx			/* from kernel_thread? */
-	jnz	1f				/* kernel threads are uncommon */
-
-2:
+SYM_CODE_START(asm_ret_from_fork)
 	UNWIND_HINT_REGS
-	movq	%rsp, %rdi
-	call	syscall_exit_to_user_mode	/* returns with IRQs disabled */
+	movq	%rax, %rdi
+	movq	%rbx, %rsi
+	movq	%r12, %rdx
+	movq	%rsp, %rcx
+	call	ret_from_fork
 	jmp	swapgs_restore_regs_and_return_to_usermode
-
-1:
-	/* kernel thread */
-	UNWIND_HINT_EMPTY
-	movq	%r12, %rdi
-	CALL_NOSPEC rbx
-	/*
-	 * A kernel thread is allowed to return here after successfully
-	 * calling kernel_execve().  Exit to userspace to complete the execve()
-	 * syscall.
-	 */
-	movq	$0, RAX(%rsp)
-	jmp	2b
-SYM_CODE_END(ret_from_fork)
+SYM_CODE_END(asm_ret_from_fork)
 .popsection
 
 .macro DEBUG_ENTRY_ASSERT_IRQS_OFF
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..fcb9b02a1269 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -12,7 +12,7 @@ struct task_struct *__switch_to_asm(struct task_struct *prev,
 __visible struct task_struct *__switch_to(struct task_struct *prev,
 					  struct task_struct *next);
 
-asmlinkage void ret_from_fork(void);
+asmlinkage void asm_ret_from_fork(void);
 
 /*
  * This is the structure pointed to by thread.sp for an inactive task.  The
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f6f16df04cb9..34efbca08738 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -135,7 +135,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	frame = &fork_frame->frame;
 
 	frame->bp = encode_frame_pointer(childregs);
-	frame->ret_addr = (unsigned long) ret_from_fork;
+	frame->ret_addr = (unsigned long) asm_ret_from_fork;
 	p->thread.sp = (unsigned long) fork_frame;
 	p->thread.io_bitmap = NULL;
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..bf8aa15ac652 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -151,7 +151,7 @@ EXPORT_SYMBOL_GPL(start_thread);
  * more flexibility.
  *
  * The return value (in %ax) will be the "prev" task after
- * the task-switch, and shows up in ret_from_fork in entry.S,
+ * the task-switch, and shows up in asm_ret_from_fork in entry_32.S,
  * for example.
  */
 __visible __notrace_funcgraph struct task_struct *
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 73f800100066..c6e7235c6d9f 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -659,7 +659,7 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 		state->sp = task->thread.sp + sizeof(*frame);
 		state->bp = READ_ONCE_NOCHECK(frame->bp);
 		state->ip = READ_ONCE_NOCHECK(frame->ret_addr);
-		state->signal = (void *)state->ip == ret_from_fork;
+		state->signal = (void *)state->ip == asm_ret_from_fork;
 	}
 
 	if (get_stack_info((unsigned long *)state->sp, state->task,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 06/11] kentry: Simplify the common syscall API
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (4 preceding siblings ...)
  2021-03-04 19:05 ` [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C Andy Lutomirski
@ 2021-03-04 19:05 ` Andy Lutomirski
  2021-03-08  9:49   ` Mark Rutland
  2021-03-04 19:06 ` [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only Andy Lutomirski
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:05 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

The new common syscall API had a large and confusing API surface.  Simplify
it.  Now there is exactly one way to use it.  It's a bit more verbose than
the old way for the simple x86_64 native case, but it's much easier to use
right, and the diffstat should speak for itself.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c      | 57 +++++++++++++++---------
 include/linux/entry-common.h | 86 ++++++------------------------------
 kernel/entry/common.c        | 57 +++---------------------
 3 files changed, 54 insertions(+), 146 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ef1c65938a6b..8710b2300b8d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -38,9 +38,12 @@
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
-	nr = syscall_enter_from_user_mode(regs, nr);
-
+	kentry_enter_from_user_mode(regs);
+	local_irq_enable();
 	instrumentation_begin();
+
+	nr = kentry_syscall_begin(regs, nr);
+
 	if (likely(nr < NR_syscalls)) {
 		nr = array_index_nospec(nr, NR_syscalls);
 		regs->ax = sys_call_table[nr](regs);
@@ -52,8 +55,12 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 		regs->ax = x32_sys_call_table[nr](regs);
 #endif
 	}
+
+	kentry_syscall_end(regs);
+
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 }
 #endif
 
@@ -83,33 +90,34 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
 	unsigned int nr = syscall_32_enter(regs);
 
+	kentry_enter_from_user_mode(regs);
+	local_irq_enable();
+	instrumentation_begin();
+
 	/*
 	 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
 	 * orig_ax, the unsigned int return value truncates it.  This may
 	 * or may not be necessary, but it matches the old asm behavior.
 	 */
-	nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
-	instrumentation_begin();
-
+	nr = (unsigned int)kentry_syscall_begin(regs, nr);
 	do_syscall_32_irqs_on(regs, nr);
+	kentry_syscall_end(regs);
 
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 }
 
 static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 {
 	unsigned int nr = syscall_32_enter(regs);
+	bool ret;
 	int res;
 
-	/*
-	 * This cannot use syscall_enter_from_user_mode() as it has to
-	 * fetch EBP before invoking any of the syscall entry work
-	 * functions.
-	 */
-	syscall_enter_from_user_mode_prepare(regs);
-
+	kentry_enter_from_user_mode(regs);
+	local_irq_enable();
 	instrumentation_begin();
+
 	/* Fetch EBP from where the vDSO stashed it. */
 	if (IS_ENABLED(CONFIG_X86_64)) {
 		/*
@@ -126,21 +134,23 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 	if (res) {
 		/* User code screwed up. */
 		regs->ax = -EFAULT;
-
-		instrumentation_end();
-		local_irq_disable();
-		irqentry_exit_to_user_mode(regs);
-		return false;
+		ret = false;
+		goto out;
 	}
 
 	/* The case truncates any ptrace induced syscall nr > 2^32 -1 */
-	nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+	nr = (unsigned int)kentry_syscall_begin(regs, nr);
 
 	/* Now this is just like a normal syscall. */
 	do_syscall_32_irqs_on(regs, nr);
 
+	kentry_syscall_end(regs);
+	ret = true;
+
+out:
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(regs);
+	kentry_exit_to_user_mode(regs);
 	return true;
 }
 
@@ -233,8 +243,11 @@ __visible void noinstr ret_from_fork(struct task_struct *prev,
 		user_regs->ax = 0;
 	}
 
+	kentry_syscall_end(user_regs);
+
+	local_irq_disable();
 	instrumentation_end();
-	syscall_exit_to_user_mode(user_regs);
+	kentry_exit_to_user_mode(user_regs);
 }
 
 #ifdef CONFIG_XEN_PV
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fd2d7c35670a..5287c6c15a66 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -119,31 +119,12 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
 void enter_from_user_mode(struct pt_regs *regs);
 
 /**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs:	Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This handles lockdep, RCU (context tracking) and tracing state, i.e.
- * the functionality provided by enter_from_user_mode().
- *
- * This is invoked when there is extra architecture specific functionality
- * to be done between establishing state and handling user mode entry work.
- */
-void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
-/**
- * syscall_enter_from_user_mode_work - Check and handle work before invoking
- *				       a syscall
+ * kentry_syscall_begin - Prepare to invoke a syscall handler
  * @regs:	Pointer to currents pt_regs
  * @syscall:	The syscall number
  *
  * Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
- * architecture specific work.
+ * enabled after kentry_enter_from_usermode or a similar function.
  *
  * Returns: The original or a modified syscall number
  *
@@ -152,32 +133,16 @@ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
  * syscall_set_return_value() first.  If neither of those are called and -1
  * is returned, then the syscall will fail with ENOSYS.
  *
+ * After calling kentry_syscall_begin(), regardless of the return value,
+ * the caller must call kentry_syscall_end().
+ *
  * It handles the following work items:
  *
  *  1) syscall_work flag dependent invocations of
  *     arch_syscall_enter_tracehook(), __secure_computing(), trace_sys_enter()
  *  2) Invocation of audit_syscall_entry()
  */
-long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
-
-/**
- * syscall_enter_from_user_mode - Establish state and check and handle work
- *				  before invoking a syscall
- * @regs:	Pointer to currents pt_regs
- * @syscall:	The syscall number
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This is combination of syscall_enter_from_user_mode_prepare() and
- * syscall_enter_from_user_mode_work().
- *
- * Returns: The original or a modified syscall number. See
- * syscall_enter_from_user_mode_work() for further explanation.
- */
-long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
+long kentry_syscall_begin(struct pt_regs *regs, long syscall);
 
 /**
  * local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable()
@@ -317,28 +282,16 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
 void exit_to_user_mode(void);
 
 /**
- * syscall_exit_to_user_mode_work - Handle work before returning to user mode
+ * kentry_syscall_end - Finish syscall processing
  * @regs:	Pointer to currents pt_regs
  *
- * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling
- * exit_to_user_mode() to perform the final transition to user mode.
  *
- * Calling convention is the same as for syscall_exit_to_user_mode() and it
- * returns with all work handled and interrupts disabled. The caller must
- * invoke exit_to_user_mode() before actually switching to user mode to
- * make the final state transitions. Interrupts must stay disabled between
- * return from this function and the invocation of exit_to_user_mode().
- */
-void syscall_exit_to_user_mode_work(struct pt_regs *regs);
-
-/**
- * syscall_exit_to_user_mode - Handle work before returning to user mode
- * @regs:	Pointer to currents pt_regs
+ * This must be called after arch code calls kentry_syscall_begin()
+ * and invoking a syscall handler, if any.  This must also be called when
+ * returning from fork() to user mode, since return-from-fork is considered
+ * to be a syscall return.
  *
- * Invoked with interrupts enabled and fully valid regs. Returns with all
- * work handled, interrupts disabled such that the caller can immediately
- * switch to user mode. Called from architecture specific syscall and ret
- * from fork code.
+ * Called with IRQs on.  Returns with IRQs still on.
  *
  * The call order is:
  *  1) One-time syscall exit work:
@@ -346,21 +299,8 @@ void syscall_exit_to_user_mode_work(struct pt_regs *regs);
  *      - audit
  *	- syscall tracing
  *	- tracehook (single stepping)
- *
- *  2) Preparatory work
- *	- Exit to user mode loop (common TIF handling). Invokes
- *	  arch_exit_to_user_mode_work() for architecture specific TIF work
- *	- Architecture specific one time work arch_exit_to_user_mode_prepare()
- *	- Address limit and lockdep checks
- *
- *  3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the
- *     functionality in exit_to_user_mode().
- *
- * This is a combination of syscall_exit_to_user_mode_work() (1,2) and
- * exit_to_user_mode(). This function is preferred unless there is a
- * compelling architectural reason to use the seperate functions.
  */
-void syscall_exit_to_user_mode(struct pt_regs *regs);
+void kentry_syscall_end(struct pt_regs *regs);
 
 /**
  * kentry_enter_from_user_mode - Establish state before invoking the irq handler
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 269766a8f981..800ad406431b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -80,44 +80,19 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 	return ret ? : syscall;
 }
 
-static __always_inline long
-__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
+long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 {
 	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
 
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+	lockdep_assert_irqs_enabled();
+
 	if (work & SYSCALL_WORK_ENTER)
 		syscall = syscall_trace_enter(regs, syscall, work);
 
 	return syscall;
 }
 
-long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
-{
-	return __syscall_enter_from_user_work(regs, syscall);
-}
-
-noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
-{
-	long ret;
-
-	__enter_from_user_mode(regs);
-
-	instrumentation_begin();
-	local_irq_enable();
-	ret = __syscall_enter_from_user_work(regs, syscall);
-	instrumentation_end();
-
-	return ret;
-}
-
-noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
-{
-	__enter_from_user_mode(regs);
-	instrumentation_begin();
-	local_irq_enable();
-	instrumentation_end();
-}
-
 /* See comment for exit_to_user_mode() in entry-common.h */
 static __always_inline void __exit_to_user_mode(void)
 {
@@ -218,7 +193,7 @@ static inline bool report_single_step(unsigned long work)
 /*
  * If SYSCALL_EMU is set, then the only reason to report is when
  * TIF_SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall
- * instruction has been already reported in syscall_enter_from_user_mode().
+ * instruction has been already reported in kentry_syscall_begin().
  */
 static inline bool report_single_step(unsigned long work)
 {
@@ -261,7 +236,7 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
  * Syscall specific exit to user mode preparation. Runs with interrupts
  * enabled.
  */
-static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+void kentry_syscall_end(struct pt_regs *regs)
 {
 	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
 	unsigned long nr = syscall_get_nr(current, regs);
@@ -284,26 +259,6 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 		syscall_exit_work(regs, work);
 }
 
-static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
-{
-	syscall_exit_to_user_mode_prepare(regs);
-	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
-}
-
-void syscall_exit_to_user_mode_work(struct pt_regs *regs)
-{
-	__syscall_exit_to_user_mode_work(regs);
-}
-
-__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
-{
-	instrumentation_begin();
-	__syscall_exit_to_user_mode_work(regs);
-	instrumentation_end();
-	__exit_to_user_mode();
-}
-
 noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
 {
 	__enter_from_user_mode(regs);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (5 preceding siblings ...)
  2021-03-04 19:05 ` [PATCH v3 06/11] kentry: Simplify the common syscall API Andy Lutomirski
@ 2021-03-04 19:06 ` Andy Lutomirski
  2021-03-08 10:06   ` Mark Rutland
  2021-03-04 19:06 ` [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86 Andy Lutomirski
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:06 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

exit_to_user_mode() does part, but not all, of the exit-to-user-mode work.
It's used only by arm64, and arm64 should stop using it (hint, hint!).
Compile it out on other architectures to minimize the chance of error.

enter_from_user_mode() is a legacy way to spell
kentry_enter_from_user_mode().  It's also only used by arm64.  Give it
the same treatment.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/entry-common.h | 34 ++++++----------------------------
 kernel/entry/common.c        |  4 ++++
 2 files changed, 10 insertions(+), 28 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 5287c6c15a66..a75374f87258 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -97,26 +97,15 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
 }
 #endif
 
+#ifdef CONFIG_ARM64
 /**
  * enter_from_user_mode - Establish state when coming from user mode
  *
- * Syscall/interrupt entry disables interrupts, but user mode is traced as
- * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ * Legacy variant of kentry_enter_from_user_mode().  Used only by arm64.
  *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct and interrupts are still
- * disabled. The subsequent functions can be instrumented.
- *
- * This is invoked when there is architecture specific functionality to be
- * done between establishing state and enabling interrupts. The caller must
- * enable interrupts before invoking syscall_enter_from_user_mode_work().
  */
 void enter_from_user_mode(struct pt_regs *regs);
+#endif
 
 /**
  * kentry_syscall_begin - Prepare to invoke a syscall handler
@@ -261,25 +250,14 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
 }
 #endif
 
+#ifdef CONFIG_ARM64
 /**
  * exit_to_user_mode - Fixup state when exiting to user mode
  *
- * Syscall/interrupt exit enables interrupts, but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Invoke architecture specific last minute exit code, e.g. speculation
- *    mitigations, etc.: arch_exit_to_user_mode()
- * 4) Tell lockdep that interrupts are enabled
- *
- * Invoked from architecture specific code when syscall_exit_to_user_mode()
- * is not suitable as the last step before returning to userspace. Must be
- * invoked with interrupts disabled and the caller must be
- * non-instrumentable.
- * The caller has to invoke syscall_exit_to_user_mode_work() before this.
+ * Does the latter part of irqentry_exit_to_user_mode().  Only used by arm64.
  */
 void exit_to_user_mode(void);
+#endif
 
 /**
  * kentry_syscall_end - Finish syscall processing
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 800ad406431b..4ba82c684189 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -25,10 +25,12 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 	instrumentation_end();
 }
 
+#ifdef CONFIG_ARM64
 void noinstr enter_from_user_mode(struct pt_regs *regs)
 {
 	__enter_from_user_mode(regs);
 }
+#endif
 
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
@@ -106,10 +108,12 @@ static __always_inline void __exit_to_user_mode(void)
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+#ifdef CONFIG_ARM64
 void noinstr exit_to_user_mode(void)
 {
 	__exit_to_user_mode();
 }
+#endif
 
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (6 preceding siblings ...)
  2021-03-04 19:06 ` [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only Andy Lutomirski
@ 2021-03-04 19:06 ` Andy Lutomirski
  2021-03-08 10:13   ` Mark Rutland
  2021-03-04 19:06 ` [PATCH v3 09/11] kentry: Add debugging checks for proper kentry API usage Andy Lutomirski
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:06 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

In principle, the generic entry code is generic, and the goal is to use it
in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
to the generic config so that it can be used in the generic entry code and
not just in arch/x86.

Disable it on arm64.  arm64 uses some but not all of the kentry
code, and trying to debug the resulting state machine will be painful.
arm64 can turn it back on when it starts using the entire generic
path.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig.debug | 10 ----------
 lib/Kconfig.debug      | 11 +++++++++++
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 80b57e7f4947..a5a52133730c 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -170,16 +170,6 @@ config CPA_DEBUG
 	help
 	  Do change_page_attr() self-tests every 30 seconds.
 
-config DEBUG_ENTRY
-	bool "Debug low-level entry code"
-	depends on DEBUG_KERNEL
-	help
-	  This option enables sanity checks in x86's low-level entry code.
-	  Some of these sanity checks may slow down kernel entries and
-	  exits or otherwise impact performance.
-
-	  If unsure, say N.
-
 config DEBUG_NMI_SELFTEST
 	bool "NMI Selftest"
 	depends on DEBUG_KERNEL && X86_LOCAL_APIC
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7937265ef879..76549c8afa8a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1411,6 +1411,17 @@ config CSD_LOCK_WAIT_DEBUG
 
 endmenu # lock debugging
 
+config DEBUG_ENTRY
+	bool "Debug low-level entry code"
+	depends on DEBUG_KERNEL
+	depends on !ARM64
+	help
+	  This option enables sanity checks in the low-level entry code.
+	  Some of these sanity checks may slow down kernel entries and
+	  exits or otherwise impact performance.
+
+	  If unsure, say N.
+
 config TRACE_IRQFLAGS
 	depends on TRACE_IRQFLAGS_SUPPORT
 	bool
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 09/11] kentry: Add debugging checks for proper kentry API usage
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (7 preceding siblings ...)
  2021-03-04 19:06 ` [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86 Andy Lutomirski
@ 2021-03-04 19:06 ` Andy Lutomirski
  2021-03-04 19:06 ` [PATCH v3 10/11] kentry: Check that syscall entries and syscall exits match Andy Lutomirski
  2021-03-04 19:06 ` [PATCH v3 11/11] kentry: Verify kentry state in instrumentation_begin/end() Andy Lutomirski
  10 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:06 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

It's quite easy to mess up kentry calls.  Add debgging checks that kentry
transitions to and from user mode match up and that kentry_nmi_enter() and
kentry_nmi_exit() match up.

Checking full matching of kentry_enter() with kentry_exit() needs
per-task state.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/entry/common.c | 81 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 4ba82c684189..f62934d761e3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,9 +11,65 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
+/*
+ * kentry_cpu_depth is 0 in user mode, 1 in normal kernel mode, and
+ * 1 + n * KENTRY_DEPTH_NMI in kentry_nmi_enter() mode.  We can't
+ * use a percpu variable to match up kentry_enter() from kernel mode
+ * with the corresponding kentry_exit() because tasks may schedule inside
+ * nested kentry_enter() regions.
+ */
+#define KENTRY_CPU_DEPTH_NMI	1024UL
+
+#ifdef CONFIG_DEBUG_ENTRY
+
+/*
+ * Extra safe WARN_ONCE.  Per-arch optimized WARN_ONCE() implementations
+ * might go through the low-level entry and kentry code even before noticing
+ * that the warning already fired, which could result in recursive warnings.
+ * This carefully avoids any funny business once a given warning has fired.
+ */
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) ({		\
+	static bool __section(".data.once") __warned;		\
+	int __ret_warn_once = !!(condition);			\
+								\
+	if (unlikely(__ret_warn_once && !READ_ONCE(__warned))) {\
+		WRITE_ONCE(__warned, true);			\
+		WARN(1, format);				\
+	}							\
+	unlikely(__ret_warn_once);				\
+})
+
+
+static DEFINE_PER_CPU(unsigned int, kentry_cpu_depth) = 1UL;
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n)
+{
+	this_cpu_add(kentry_cpu_depth, n);
+}
+
+static void kentry_cpu_depth_check(unsigned int n)
+{
+	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) < n, "kentry depth underflow");
+}
+
+static __always_inline void kentry_cpu_depth_sub(unsigned int n)
+{
+	this_cpu_sub(kentry_cpu_depth, n);
+}
+#else
+
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n) {}
+static void kentry_cpu_depth_check(unsigned int n) {}
+static __always_inline void kentry_cpu_depth_sub(unsigned int n) {}
+
+#endif
+
 /* See comment for enter_from_user_mode() in entry-common.h */
 static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
+	kentry_cpu_depth_add(1);
 	arch_check_user_regs(regs);
 	lockdep_hardirqs_off(CALLER_ADDR0);
 
@@ -22,6 +78,14 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
+
+#ifdef CONFIG_DEBUG_ENTRY
+	DEBUG_ENTRY_WARN_ONCE(
+		this_cpu_read(kentry_cpu_depth) != 1,
+		"kentry: __enter_from_user_mode() called while kentry thought the CPU was in the kernel (%u)",
+		this_cpu_read(kentry_cpu_depth));
+#endif
+
 	instrumentation_end();
 }
 
@@ -99,6 +163,11 @@ long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
 	instrumentation_begin();
+#ifdef CONFIG_DEBUG_ENTRY
+	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
+			      "__exit_to_user_mode called at wrong kentry cpu depth (%u)",
+			      this_cpu_read(kentry_cpu_depth));
+#endif
 	trace_hardirqs_on_prepare();
 	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	instrumentation_end();
@@ -106,6 +175,7 @@ static __always_inline void __exit_to_user_mode(void)
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
+	kentry_cpu_depth_sub(1);
 }
 
 #ifdef CONFIG_ARM64
@@ -360,7 +430,12 @@ noinstr void kentry_exit(struct pt_regs *regs, kentry_state_t state)
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		kentry_exit_to_user_mode(regs);
-	} else if (!regs_irqs_disabled(regs)) {
+		return;
+	}
+
+	kentry_cpu_depth_check(1);
+
+	if (!regs_irqs_disabled(regs)) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
@@ -399,6 +474,8 @@ kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs)
 
 	irq_state.lockdep = lockdep_hardirqs_enabled();
 
+	kentry_cpu_depth_add(KENTRY_CPU_DEPTH_NMI);
+
 	__nmi_enter();
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	lockdep_hardirq_enter();
@@ -415,6 +492,7 @@ kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs)
 void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state)
 {
 	instrumentation_begin();
+	kentry_cpu_depth_check(KENTRY_CPU_DEPTH_NMI);
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {
 		trace_hardirqs_on_prepare();
@@ -427,4 +505,5 @@ void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state)
 	if (irq_state.lockdep)
 		lockdep_hardirqs_on(CALLER_ADDR0);
 	__nmi_exit();
+	kentry_cpu_depth_sub(KENTRY_CPU_DEPTH_NMI);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 10/11] kentry: Check that syscall entries and syscall exits match
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (8 preceding siblings ...)
  2021-03-04 19:06 ` [PATCH v3 09/11] kentry: Add debugging checks for proper kentry API usage Andy Lutomirski
@ 2021-03-04 19:06 ` Andy Lutomirski
  2021-03-04 19:06 ` [PATCH v3 11/11] kentry: Verify kentry state in instrumentation_begin/end() Andy Lutomirski
  10 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:06 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

If arch code calls the wrong kernel entry helpers, syscall entries and
exits can get out of sync.  Add a new field to task_struct to track the
syscall state and validate that it transitions correctly.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched.h |  4 ++++
 init/init_task.c      |  8 ++++++++
 kernel/entry/common.c | 24 +++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..95d6d8686d98 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1368,6 +1368,10 @@ struct task_struct {
 	struct llist_head               kretprobe_instances;
 #endif
 
+#ifdef CONFIG_DEBUG_ENTRY
+	bool kentry_in_syscall;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 8a992d73e6fb..de4fdaac4e8b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -212,6 +212,14 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_DEBUG_ENTRY
+	/*
+	 * The init task, and kernel threads in general, are considered
+	 * to be "in a syscall".  This way they can execve() and then exit
+	 * the supposed syscall that they were in to go to user mode.
+	 */
+	.kentry_in_syscall = true,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f62934d761e3..9dea411de3b0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -148,11 +148,21 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 
 long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 {
-	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
+	unsigned long work;
+
+	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
+		DEBUG_ENTRY_WARN_ONCE(
+			current->kentry_in_syscall,
+			"entering syscall %ld while already in a syscall",
+			syscall);
+		current->kentry_in_syscall = true;
+	}
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	lockdep_assert_irqs_enabled();
 
+	work = READ_ONCE(current_thread_info()->syscall_work);
+
 	if (work & SYSCALL_WORK_ENTER)
 		syscall = syscall_trace_enter(regs, syscall, work);
 
@@ -163,11 +173,16 @@ long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
 	instrumentation_begin();
+
 #ifdef CONFIG_DEBUG_ENTRY
 	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
 			      "__exit_to_user_mode called at wrong kentry cpu depth (%u)",
 			      this_cpu_read(kentry_cpu_depth));
+
+	DEBUG_ENTRY_WARN_ONCE(current->kentry_in_syscall,
+			      "exiting to user mode while in syscall context");
 #endif
+
 	trace_hardirqs_on_prepare();
 	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	instrumentation_end();
@@ -331,6 +346,13 @@ void kentry_syscall_end(struct pt_regs *regs)
 	 */
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
+
+#ifdef CONFIG_DEBUG_ENTRY
+	DEBUG_ENTRY_WARN_ONCE(!current->kentry_in_syscall,
+			      "exiting syscall %lu without entering first", nr);
+
+	current->kentry_in_syscall = 0;
+#endif
 }
 
 noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 11/11] kentry: Verify kentry state in instrumentation_begin/end()
  2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
                   ` (9 preceding siblings ...)
  2021-03-04 19:06 ` [PATCH v3 10/11] kentry: Check that syscall entries and syscall exits match Andy Lutomirski
@ 2021-03-04 19:06 ` Andy Lutomirski
  10 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-04 19:06 UTC (permalink / raw)
  To: x86; +Cc: LKML, Mark Rutland, Andy Lutomirski

Calling instrumentation_begin() and instrumentation_end() when kentry
thinks the CPU is in user mode is an error.  Verify the kentry state when
instrumentation_begin/end() are called.

Add _nocheck() variants to skip verification to avoid WARN() generating
extra kentry warnings.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/traps.c         |  4 ++--
 include/asm-generic/bug.h       |  8 ++++----
 include/linux/instrumentation.h | 25 ++++++++++++++++++++-----
 kernel/entry/common.c           |  7 +++++++
 4 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index be924180005a..983e4be5fdcb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -229,7 +229,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
 	/*
 	 * All lies, just get the WARN/BUG out.
 	 */
-	instrumentation_begin();
+	instrumentation_begin_nocheck();
 	/*
 	 * Since we're emulating a CALL with exceptions, restore the interrupt
 	 * state to what it was at the exception site.
@@ -242,7 +242,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
 	}
 	if (regs->flags & X86_EFLAGS_IF)
 		raw_local_irq_disable();
-	instrumentation_end();
+	instrumentation_end_nocheck();
 
 	return handled;
 }
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 76a10e0dca9f..fc360c463a99 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -85,18 +85,18 @@ void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
 		       const char *fmt, ...);
 #define __WARN()		__WARN_printf(TAINT_WARN, NULL)
 #define __WARN_printf(taint, arg...) do {				\
-		instrumentation_begin();				\
+		instrumentation_begin_nocheck();			\
 		warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);	\
-		instrumentation_end();					\
+		instrumentation_end_nocheck();				\
 	} while (0)
 #else
 extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 #define __WARN()		__WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(taint, arg...) do {				\
-		instrumentation_begin();				\
+		instrumentation_begin_nocheck();			\
 		__warn_printk(arg);					\
 		__WARN_FLAGS(BUGFLAG_NO_CUT_HERE | BUGFLAG_TAINT(taint));\
-		instrumentation_end();					\
+		instrumentation_end_nocheck();				\
 	} while (0)
 #define WARN_ON_ONCE(condition) ({				\
 	int __ret_warn_on = !!(condition);			\
diff --git a/include/linux/instrumentation.h b/include/linux/instrumentation.h
index 93e2ad67fc10..cdf80454f92a 100644
--- a/include/linux/instrumentation.h
+++ b/include/linux/instrumentation.h
@@ -4,14 +4,21 @@
 
 #if defined(CONFIG_DEBUG_ENTRY) && defined(CONFIG_STACK_VALIDATION)
 
+extern void kentry_assert_may_instrument(void);
+
 /* Begin/end of an instrumentation safe region */
-#define instrumentation_begin() ({					\
-	asm volatile("%c0: nop\n\t"						\
+#define instrumentation_begin_nocheck() ({				\
+	asm volatile("%c0: nop\n\t"					\
 		     ".pushsection .discard.instr_begin\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
 
+#define instrumentation_begin() ({					\
+	instrumentation_begin_nocheck();				\
+	kentry_assert_may_instrument();					\
+})
+
 /*
  * Because instrumentation_{begin,end}() can nest, objtool validation considers
  * _begin() a +1 and _end() a -1 and computes a sum over the instructions.
@@ -43,15 +50,23 @@
  * To avoid this, have _end() be a NOP instruction, this ensures it will be
  * part of the condition block and does not escape.
  */
-#define instrumentation_end() ({					\
+#define instrumentation_end_nocheck() ({				\
 	asm volatile("%c0: nop\n\t"					\
 		     ".pushsection .discard.instr_end\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
+
+#define instrumentation_end() ({					\
+	kentry_assert_may_instrument();					\
+	instrumentation_end_nocheck();					\
+})
+
 #else
-# define instrumentation_begin()	do { } while(0)
-# define instrumentation_end()		do { } while(0)
+# define instrumentation_begin_nocheck()	do { } while(0)
+# define instrumentation_begin()		do { } while(0)
+# define instrumentation_end_nocheck()		do { } while(0)
+# define instrumentation_end()			do { } while(0)
 #endif
 
 #endif /* __LINUX_INSTRUMENTATION_H */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 9dea411de3b0..870595bdd405 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -56,6 +56,13 @@ static __always_inline void kentry_cpu_depth_sub(unsigned int n)
 {
 	this_cpu_sub(kentry_cpu_depth, n);
 }
+
+void kentry_assert_may_instrument(void)
+{
+	DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) == 0, "instrumentable code is running in the wrong kentry state");
+}
+EXPORT_SYMBOL_GPL(kentry_assert_may_instrument);
+
 #else
 
 #define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 04/11] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
  2021-03-04 19:05 ` [PATCH v3 04/11] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
@ 2021-03-04 20:19   ` Ira Weiny
  0 siblings, 0 replies; 23+ messages in thread
From: Ira Weiny @ 2021-03-04 20:19 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, LKML, Mark Rutland, Josh Poimboeuf

On Thu, Mar 04, 2021 at 11:05:57AM -0800, Andy Lutomirski wrote:
> For kernel threads, task_pt_regs is currently all zeros, a valid user state
> (if kernel_execve() has been called), or some combination thereof during
> execution of kernel_execve().  If a stack trace is printed, the unwinder
> might get confused and treat task_pt_regs as a kernel state.  Indeed,
> forcing a stack dump results in an attempt to display _kernel_ code bytes
> from a bogus address at the very top of kernel memory.
> 
> Fix the confusion by setting cs=3 so that user_mode(task_pt_regs(...)) ==
> true for kernel threads.
> 
> Also improve the error when failing to fetch code bytes to show which type
> of fetch failed.  This makes it much easier to understand what's happening.
> 
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/kernel/dumpstack.c |  4 ++--
>  arch/x86/kernel/process.c   | 13 +++++++++++++
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index 55cf3c8325c6..9b7b69bb12e5 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -128,8 +128,8 @@ void show_opcodes(struct pt_regs *regs, const char *loglvl)
>  		/* No access to the user space stack of other tasks. Ignore. */
>  		break;
>  	default:
> -		printk("%sCode: Unable to access opcode bytes at RIP 0x%lx.\n",
> -		       loglvl, prologue);
> +		printk("%sCode: Unable to access %s opcode bytes at RIP 0x%lx.\n",
> +		       loglvl, user_mode(regs) ? "user" : "kernel", prologue);
>  		break;
>  	}
>  }
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 145a7ac0c19a..f6f16df04cb9 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -163,6 +163,19 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  	/* Kernel thread ? */
>  	if (unlikely(p->flags & PF_KTHREAD)) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
> +
> +		/*
> +		 * Even though there is no real user state here, these
> +		 * are were user regs belong, and kernel_execve() will
                       ^^^^^
                       where?

Ira

> +		 * overwrite them with user regs.  Put an obviously
> +		 * invalid value that nonetheless causes user_mode(regs)
> +		 * to return true in CS.
> +		 *
> +		 * This also prevents the unwinder from thinking that there
> +		 * is invalid kernel state at the top of the stack.
> +		 */
> +		childregs->cs = 3;
> +
>  		kthread_frame_init(frame, sp, arg);
>  		return 0;
>  	}
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C
  2021-03-04 19:05 ` [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C Andy Lutomirski
@ 2021-03-05  0:55   ` Brian Gerst
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Gerst @ 2021-03-05  0:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, LKML, Mark Rutland, Josh Poimboeuf

On Thu, Mar 4, 2021 at 2:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> ret_from_fork is written in asm, slightly differently, for x86_32 and
> x86_64.  Convert it to C.
>
> This is a straight conversion without any particular cleverness.  As a
> further cleanup, the code that sets up the ret_from_fork argument registers
> could be adjusted to put the arguments in the correct registers.

An alternative would be to stash the function pointer and argument in
the pt_regs of the new kthread task, and test for PF_KTHREAD instead.
That would eliminate the need to pass those two values to the C
version of ret_from_fork().

> This will cause the ORC unwinder to find pt_regs even for kernel threads on
> x86_64.  This seems harmless.
>
> The 32-bit comment above the now-deleted schedule_tail_wrapper was
> obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
> same problem more cleanly.
>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/common.c          | 23 ++++++++++++++
>  arch/x86/entry/entry_32.S        | 51 +++++---------------------------
>  arch/x86/entry/entry_64.S        | 33 +++++----------------
>  arch/x86/include/asm/switch_to.h |  2 +-
>  arch/x86/kernel/process.c        |  2 +-
>  arch/x86/kernel/process_32.c     |  2 +-
>  arch/x86/kernel/unwind_orc.c     |  2 +-
>  7 files changed, 43 insertions(+), 72 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 95776f16c1cb..ef1c65938a6b 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -214,6 +214,29 @@ SYSCALL_DEFINE0(ni_syscall)
>         return -ENOSYS;
>  }
>
> +void ret_from_fork(struct task_struct *prev,
> +                  int (*kernel_thread_fn)(void *),
> +                  void *kernel_thread_arg,
> +                  struct pt_regs *user_regs);
> +
> +__visible void noinstr ret_from_fork(struct task_struct *prev,
> +                                    int (*kernel_thread_fn)(void *),
> +                                    void *kernel_thread_arg,
> +                                    struct pt_regs *user_regs)
> +{
> +       instrumentation_begin();
> +
> +       schedule_tail(prev);
> +
> +       if (kernel_thread_fn) {

This should have an unlikely(), as kernel threads should be the rare case.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip: x86/urgent] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
  2021-03-04 19:05 ` [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls Andy Lutomirski
@ 2021-03-05 10:16   ` tip-bot2 for Andy Lutomirski
  2021-03-06 10:44   ` tip-bot2 for Andy Lutomirski
  2021-03-06 12:18   ` tip-bot2 for Andy Lutomirski
  2 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-03-05 10:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andy Lutomirski, Thomas Gleixner, stable, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     dabf017539988a9bfc40a38dbafd35c501bacc44
Gitweb:        https://git.kernel.org/tip/dabf017539988a9bfc40a38dbafd35c501bacc44
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Thu, 04 Mar 2021 11:05:54 -08:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 05 Mar 2021 11:10:13 +01:00

x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work.  This confuses audit and ptrace.  For example:

    $ ./tools/testing/selftests/x86/syscall_arg_fault_32
    ...
    strace: pid 264258: entering, ptrace_syscall_info.op == 2
    ...

This is a minimal fix intended for ease of backporting.  A more
complete cleanup is coming.

Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/8c82296ddf803b91f8d1e5eac89e5803ba54ab0e.1614884673.git.luto@kernel.org

---
 arch/x86/entry/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a2433ae..4efd39a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 		regs->ax = -EFAULT;
 
 		instrumentation_end();
-		syscall_exit_to_user_mode(regs);
+		local_irq_disable();
+		irqentry_exit_to_user_mode(regs);
 		return false;
 	}
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: x86/urgent] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
  2021-03-04 19:05 ` [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls Andy Lutomirski
  2021-03-05 10:16   ` [tip: x86/urgent] " tip-bot2 for Andy Lutomirski
@ 2021-03-06 10:44   ` tip-bot2 for Andy Lutomirski
  2021-03-06 12:18   ` tip-bot2 for Andy Lutomirski
  2 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-03-06 10:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andy Lutomirski, Thomas Gleixner, stable, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     e59ba7bf71a09e474198741563e0e587ae43d1c7
Gitweb:        https://git.kernel.org/tip/e59ba7bf71a09e474198741563e0e587ae43d1c7
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Thu, 04 Mar 2021 11:05:54 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Sat, 06 Mar 2021 11:37:00 +01:00

x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work.  This confuses audit and ptrace.  For example:

    $ ./tools/testing/selftests/x86/syscall_arg_fault_32
    ...
    strace: pid 264258: entering, ptrace_syscall_info.op == 2
    ...

This is a minimal fix intended for ease of backporting.  A more
complete cleanup is coming.

Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/8c82296ddf803b91f8d1e5eac89e5803ba54ab0e.1614884673.git.luto@kernel.org

---
 arch/x86/entry/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a2433ae..4efd39a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 		regs->ax = -EFAULT;
 
 		instrumentation_end();
-		syscall_exit_to_user_mode(regs);
+		local_irq_disable();
+		irqentry_exit_to_user_mode(regs);
 		return false;
 	}
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: x86/urgent] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
  2021-03-04 19:05 ` [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls Andy Lutomirski
  2021-03-05 10:16   ` [tip: x86/urgent] " tip-bot2 for Andy Lutomirski
  2021-03-06 10:44   ` tip-bot2 for Andy Lutomirski
@ 2021-03-06 12:18   ` tip-bot2 for Andy Lutomirski
  2 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-03-06 12:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andy Lutomirski, Thomas Gleixner, Borislav Petkov, stable, x86,
	linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     5d5675df792ff67e74a500c4c94db0f99e6a10ef
Gitweb:        https://git.kernel.org/tip/5d5675df792ff67e74a500c4c94db0f99e6a10ef
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Thu, 04 Mar 2021 11:05:54 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Sat, 06 Mar 2021 13:10:06 +01:00

x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work.  This confuses audit and ptrace.  For example:

    $ ./tools/testing/selftests/x86/syscall_arg_fault_32
    ...
    strace: pid 264258: entering, ptrace_syscall_info.op == 2
    ...

This is a minimal fix intended for ease of backporting.  A more
complete cleanup is coming.

Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/8c82296ddf803b91f8d1e5eac89e5803ba54ab0e.1614884673.git.luto@kernel.org
---
 arch/x86/entry/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a2433ae..4efd39a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 		regs->ax = -EFAULT;
 
 		instrumentation_end();
-		syscall_exit_to_user_mode(regs);
+		local_irq_disable();
+		irqentry_exit_to_user_mode(regs);
 		return false;
 	}
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 02/11] kentry: Rename irqentry to kentry
  2021-03-04 19:05 ` [PATCH v3 02/11] kentry: Rename irqentry to kentry Andy Lutomirski
@ 2021-03-08  9:45   ` Mark Rutland
  0 siblings, 0 replies; 23+ messages in thread
From: Mark Rutland @ 2021-03-08  9:45 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, LKML

On Thu, Mar 04, 2021 at 11:05:55AM -0800, Andy Lutomirski wrote:
> The common entry functions are mostly named irqentry, and this is
> confusing.  They are used for syscalls, exceptions, NMIs and, yes, IRQs.
> Call them kentry instead, since they are supposed to be usable for any
> entry to the kernel.
> 
> This path doesn't touch the .irqentry section -- someone can contemplate
> changing that later.  That code predates the common entry code.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

FWIW, I agree this is a better name, so:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/x86/entry/common.c         |  8 ++---
>  arch/x86/include/asm/idtentry.h | 28 +++++++--------
>  arch/x86/kernel/cpu/mce/core.c  | 10 +++---
>  arch/x86/kernel/kvm.c           |  6 ++--
>  arch/x86/kernel/nmi.c           |  6 ++--
>  arch/x86/kernel/traps.c         | 28 +++++++--------
>  arch/x86/mm/fault.c             |  6 ++--
>  include/linux/entry-common.h    | 60 ++++++++++++++++-----------------
>  kernel/entry/common.c           | 26 +++++++-------
>  9 files changed, 89 insertions(+), 89 deletions(-)
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 8fdb4cb27efe..95776f16c1cb 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -264,9 +264,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  {
>  	struct pt_regs *old_regs;
>  	bool inhcall;
> -	irqentry_state_t state;
> +	kentry_state_t state;
>  
> -	state = irqentry_enter(regs);
> +	state = kentry_enter(regs);
>  	old_regs = set_irq_regs(regs);
>  
>  	instrumentation_begin();
> @@ -278,11 +278,11 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  	inhcall = get_and_clear_inhcall();
>  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
>  		instrumentation_begin();
> -		irqentry_exit_cond_resched();
> +		kentry_exit_cond_resched();
>  		instrumentation_end();
>  		restore_inhcall(inhcall);
>  	} else {
> -		irqentry_exit(regs, state);
> +		kentry_exit(regs, state);
>  	}
>  }
>  #endif /* CONFIG_XEN_PV */
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index f656aabd1545..3ac805d24289 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -40,8 +40,8 @@
>   * The macro is written so it acts as function definition. Append the
>   * body with a pair of curly brackets.
>   *
> - * irqentry_enter() contains common code which has to be invoked before
> - * arbitrary code in the body. irqentry_exit() contains common code
> + * kentry_enter() contains common code which has to be invoked before
> + * arbitrary code in the body. kentry_exit() contains common code
>   * which has to run before returning to the low level assembly code.
>   */
>  #define DEFINE_IDTENTRY(func)						\
> @@ -49,12 +49,12 @@ static __always_inline void __##func(struct pt_regs *regs);		\
>  									\
>  __visible noinstr void func(struct pt_regs *regs)			\
>  {									\
> -	irqentry_state_t state = irqentry_enter(regs);			\
> +	kentry_state_t state = kentry_enter(regs);			\
>  									\
>  	instrumentation_begin();					\
>  	__##func (regs);						\
>  	instrumentation_end();						\
> -	irqentry_exit(regs, state);					\
> +	kentry_exit(regs, state);					\
>  }									\
>  									\
>  static __always_inline void __##func(struct pt_regs *regs)
> @@ -96,12 +96,12 @@ static __always_inline void __##func(struct pt_regs *regs,		\
>  __visible noinstr void func(struct pt_regs *regs,			\
>  			    unsigned long error_code)			\
>  {									\
> -	irqentry_state_t state = irqentry_enter(regs);			\
> +	kentry_state_t state = kentry_enter(regs);			\
>  									\
>  	instrumentation_begin();					\
>  	__##func (regs, error_code);					\
>  	instrumentation_end();						\
> -	irqentry_exit(regs, state);					\
> +	kentry_exit(regs, state);					\
>  }									\
>  									\
>  static __always_inline void __##func(struct pt_regs *regs,		\
> @@ -156,7 +156,7 @@ __visible noinstr void func(struct pt_regs *regs)
>   * body with a pair of curly brackets.
>   *
>   * Contrary to DEFINE_IDTENTRY_ERRORCODE() this does not invoke the
> - * irqentry_enter/exit() helpers before and after the body invocation. This
> + * kentry_enter/exit() helpers before and after the body invocation. This
>   * needs to be done in the body itself if applicable. Use if extra work
>   * is required before the enter/exit() helpers are invoked.
>   */
> @@ -192,7 +192,7 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector);	\
>  __visible noinstr void func(struct pt_regs *regs,			\
>  			    unsigned long error_code)			\
>  {									\
> -	irqentry_state_t state = irqentry_enter(regs);			\
> +	kentry_state_t state = kentry_enter(regs);			\
>  									\
>  	instrumentation_begin();					\
>  	irq_enter_rcu();						\
> @@ -200,7 +200,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
>  	__##func (regs, (u8)error_code);				\
>  	irq_exit_rcu();							\
>  	instrumentation_end();						\
> -	irqentry_exit(regs, state);					\
> +	kentry_exit(regs, state);					\
>  }									\
>  									\
>  static __always_inline void __##func(struct pt_regs *regs, u8 vector)
> @@ -224,7 +224,7 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector)
>   * DEFINE_IDTENTRY_SYSVEC - Emit code for system vector IDT entry points
>   * @func:	Function name of the entry point
>   *
> - * irqentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
> + * kentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
>   * function body. KVM L1D flush request is set.
>   *
>   * Runs the function on the interrupt stack if the entry hit kernel mode
> @@ -234,7 +234,7 @@ static void __##func(struct pt_regs *regs);				\
>  									\
>  __visible noinstr void func(struct pt_regs *regs)			\
>  {									\
> -	irqentry_state_t state = irqentry_enter(regs);			\
> +	kentry_state_t state = kentry_enter(regs);			\
>  									\
>  	instrumentation_begin();					\
>  	irq_enter_rcu();						\
> @@ -242,7 +242,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
>  	run_sysvec_on_irqstack_cond(__##func, regs);			\
>  	irq_exit_rcu();							\
>  	instrumentation_end();						\
> -	irqentry_exit(regs, state);					\
> +	kentry_exit(regs, state);					\
>  }									\
>  									\
>  static noinline void __##func(struct pt_regs *regs)
> @@ -263,7 +263,7 @@ static __always_inline void __##func(struct pt_regs *regs);		\
>  									\
>  __visible noinstr void func(struct pt_regs *regs)			\
>  {									\
> -	irqentry_state_t state = irqentry_enter(regs);			\
> +	kentry_state_t state = kentry_enter(regs);			\
>  									\
>  	instrumentation_begin();					\
>  	__irq_enter_raw();						\
> @@ -271,7 +271,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
>  	__##func (regs);						\
>  	__irq_exit_raw();						\
>  	instrumentation_end();						\
> -	irqentry_exit(regs, state);					\
> +	kentry_exit(regs, state);					\
>  }									\
>  									\
>  static __always_inline void __##func(struct pt_regs *regs)
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 47a0eb57725c..3d4bfd54934e 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1974,7 +1974,7 @@ void (*machine_check_vector)(struct pt_regs *) = unexpected_machine_check;
>  
>  static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
>  {
> -	irqentry_state_t irq_state;
> +	kentry_state_t irq_state;
>  
>  	WARN_ON_ONCE(user_mode(regs));
>  
> @@ -1986,7 +1986,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
>  	    mce_check_crashing_cpu())
>  		return;
>  
> -	irq_state = irqentry_nmi_enter(regs);
> +	irq_state = kentry_nmi_enter(regs);
>  	/*
>  	 * The call targets are marked noinstr, but objtool can't figure
>  	 * that out because it's an indirect call. Annotate it.
> @@ -1996,18 +1996,18 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
>  	machine_check_vector(regs);
>  
>  	instrumentation_end();
> -	irqentry_nmi_exit(regs, irq_state);
> +	kentry_nmi_exit(regs, irq_state);
>  }
>  
>  static __always_inline void exc_machine_check_user(struct pt_regs *regs)
>  {
> -	irqentry_enter_from_user_mode(regs);
> +	kentry_enter_from_user_mode(regs);
>  	instrumentation_begin();
>  
>  	machine_check_vector(regs);
>  
>  	instrumentation_end();
> -	irqentry_exit_to_user_mode(regs);
> +	kentry_exit_to_user_mode(regs);
>  }
>  
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 5e78e01ca3b4..bcecc3e7cd0a 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
>  noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>  {
>  	u32 flags = kvm_read_and_reset_apf_flags();
> -	irqentry_state_t state;
> +	kentry_state_t state;
>  
>  	if (!flags)
>  		return false;
>  
> -	state = irqentry_enter(regs);
> +	state = kentry_enter(regs);
>  	instrumentation_begin();
>  
>  	/*
> @@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>  	}
>  
>  	instrumentation_end();
> -	irqentry_exit(regs, state);
> +	kentry_exit(regs, state);
>  	return true;
>  }
>  
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index bf250a339655..6f46722a6e94 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);
>  
>  DEFINE_IDTENTRY_RAW(exc_nmi)
>  {
> -	irqentry_state_t irq_state;
> +	kentry_state_t irq_state;
>  
>  	/*
>  	 * Re-enable NMIs right here when running as an SEV-ES guest. This might
> @@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
>  
>  	this_cpu_write(nmi_dr7, local_db_save());
>  
> -	irq_state = irqentry_nmi_enter(regs);
> +	irq_state = kentry_nmi_enter(regs);
>  
>  	inc_irq_stat(__nmi_count);
>  
>  	if (!ignore_nmis)
>  		default_do_nmi(regs);
>  
> -	irqentry_nmi_exit(regs, irq_state);
> +	kentry_nmi_exit(regs, irq_state);
>  
>  	local_db_restore(this_cpu_read(nmi_dr7));
>  
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 7f5aec758f0e..be924180005a 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -249,7 +249,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
>  
>  DEFINE_IDTENTRY_RAW(exc_invalid_op)
>  {
> -	irqentry_state_t state;
> +	kentry_state_t state;
>  
>  	/*
>  	 * We use UD2 as a short encoding for 'CALL __WARN', as such
> @@ -259,11 +259,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
>  	if (!user_mode(regs) && handle_bug(regs))
>  		return;
>  
> -	state = irqentry_enter(regs);
> +	state = kentry_enter(regs);
>  	instrumentation_begin();
>  	handle_invalid_op(regs);
>  	instrumentation_end();
> -	irqentry_exit(regs, state);
> +	kentry_exit(regs, state);
>  }
>  
>  DEFINE_IDTENTRY(exc_coproc_segment_overrun)
> @@ -410,7 +410,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
>  	}
>  #endif
>  
> -	irqentry_nmi_enter(regs);
> +	kentry_nmi_enter(regs);
>  	instrumentation_begin();
>  	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
>  
> @@ -646,26 +646,26 @@ DEFINE_IDTENTRY_RAW(exc_int3)
>  		return;
>  
>  	/*
> -	 * irqentry_enter_from_user_mode() uses static_branch_{,un}likely()
> +	 * kentry_enter_from_user_mode() uses static_branch_{,un}likely()
>  	 * and therefore can trigger INT3, hence poke_int3_handler() must
>  	 * be done before. If the entry came from kernel mode, then use
>  	 * nmi_enter() because the INT3 could have been hit in any context
>  	 * including NMI.
>  	 */
>  	if (user_mode(regs)) {
> -		irqentry_enter_from_user_mode(regs);
> +		kentry_enter_from_user_mode(regs);
>  		instrumentation_begin();
>  		do_int3_user(regs);
>  		instrumentation_end();
> -		irqentry_exit_to_user_mode(regs);
> +		kentry_exit_to_user_mode(regs);
>  	} else {
> -		irqentry_state_t irq_state = irqentry_nmi_enter(regs);
> +		kentry_state_t irq_state = kentry_nmi_enter(regs);
>  
>  		instrumentation_begin();
>  		if (!do_int3(regs))
>  			die("int3", regs, 0);
>  		instrumentation_end();
> -		irqentry_nmi_exit(regs, irq_state);
> +		kentry_nmi_exit(regs, irq_state);
>  	}
>  }
>  
> @@ -860,7 +860,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
>  	 * includes the entry stack is excluded for everything.
>  	 */
>  	unsigned long dr7 = local_db_save();
> -	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
> +	kentry_state_t irq_state = kentry_nmi_enter(regs);
>  	instrumentation_begin();
>  
>  	/*
> @@ -917,7 +917,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
>  		regs->flags &= ~X86_EFLAGS_TF;
>  out:
>  	instrumentation_end();
> -	irqentry_nmi_exit(regs, irq_state);
> +	kentry_nmi_exit(regs, irq_state);
>  
>  	local_db_restore(dr7);
>  }
> @@ -935,14 +935,14 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
>  
>  	/*
>  	 * NB: We can't easily clear DR7 here because
> -	 * irqentry_exit_to_usermode() can invoke ptrace, schedule, access
> +	 * kentry_exit_to_usermode() can invoke ptrace, schedule, access
>  	 * user memory, etc.  This means that a recursive #DB is possible.  If
>  	 * this happens, that #DB will hit exc_debug_kernel() and clear DR7.
>  	 * Since we're not on the IST stack right now, everything will be
>  	 * fine.
>  	 */
>  
> -	irqentry_enter_from_user_mode(regs);
> +	kentry_enter_from_user_mode(regs);
>  	instrumentation_begin();
>  
>  	/*
> @@ -988,7 +988,7 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
>  	local_irq_disable();
>  out:
>  	instrumentation_end();
> -	irqentry_exit_to_user_mode(regs);
> +	kentry_exit_to_user_mode(regs);
>  }
>  
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 106b22d1d189..fea6dee3f6f2 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1432,7 +1432,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
>  DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
>  {
>  	unsigned long address = read_cr2();
> -	irqentry_state_t state;
> +	kentry_state_t state;
>  
>  	prefetchw(&current->mm->mmap_lock);
>  
> @@ -1470,11 +1470,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
>  	 * code reenabled RCU to avoid subsequent wreckage which helps
>  	 * debugability.
>  	 */
> -	state = irqentry_enter(regs);
> +	state = kentry_enter(regs);
>  
>  	instrumentation_begin();
>  	handle_page_fault(regs, error_code, address);
>  	instrumentation_end();
>  
> -	irqentry_exit(regs, state);
> +	kentry_exit(regs, state);
>  }
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index ca86a00abe86..fd2d7c35670a 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -363,7 +363,7 @@ void syscall_exit_to_user_mode_work(struct pt_regs *regs);
>  void syscall_exit_to_user_mode(struct pt_regs *regs);
>  
>  /**
> - * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
> + * kentry_enter_from_user_mode - Establish state before invoking the irq handler
>   * @regs:	Pointer to currents pt_regs
>   *
>   * Invoked from architecture specific entry code with interrupts disabled.
> @@ -373,10 +373,10 @@ void syscall_exit_to_user_mode(struct pt_regs *regs);
>   *
>   * The function establishes state (lockdep, RCU (context tracking), tracing)
>   */
> -void irqentry_enter_from_user_mode(struct pt_regs *regs);
> +void kentry_enter_from_user_mode(struct pt_regs *regs);
>  
>  /**
> - * irqentry_exit_to_user_mode - Interrupt exit work
> + * kentry_exit_to_user_mode - Interrupt exit work
>   * @regs:	Pointer to current's pt_regs
>   *
>   * Invoked with interrupts disbled and fully valid regs. Returns with all
> @@ -388,34 +388,34 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
>   * Interrupt exit is not invoking #1 which is the syscall specific one time
>   * work.
>   */
> -void irqentry_exit_to_user_mode(struct pt_regs *regs);
> +void kentry_exit_to_user_mode(struct pt_regs *regs);
>  
> -#ifndef irqentry_state
> +#ifndef kentry_state
>  /**
> - * struct irqentry_state - Opaque object for exception state storage
> - * @exit_rcu: Used exclusively in the irqentry_*() calls; signals whether the
> + * struct kentry_state - Opaque object for exception state storage
> + * @exit_rcu: Used exclusively in the kentry_*() calls; signals whether the
>   *            exit path has to invoke rcu_irq_exit().
> - * @lockdep: Used exclusively in the irqentry_nmi_*() calls; ensures that
> + * @lockdep: Used exclusively in the kentry_nmi_*() calls; ensures that
>   *           lockdep state is restored correctly on exit from nmi.
>   *
> - * This opaque object is filled in by the irqentry_*_enter() functions and
> - * must be passed back into the corresponding irqentry_*_exit() functions
> + * This opaque object is filled in by the kentry_*_enter() functions and
> + * must be passed back into the corresponding kentry_*_exit() functions
>   * when the exception is complete.
>   *
> - * Callers of irqentry_*_[enter|exit]() must consider this structure opaque
> + * Callers of kentry_*_[enter|exit]() must consider this structure opaque
>   * and all members private.  Descriptions of the members are provided to aid in
> - * the maintenance of the irqentry_*() functions.
> + * the maintenance of the kentry_*() functions.
>   */
> -typedef struct irqentry_state {
> +typedef struct kentry_state {
>  	union {
>  		bool	exit_rcu;
>  		bool	lockdep;
>  	};
> -} irqentry_state_t;
> +} kentry_state_t;
>  #endif
>  
>  /**
> - * irqentry_enter - Handle state tracking on ordinary interrupt entries
> + * kentry_enter - Handle state tracking on ordinary interrupt entries
>   * @regs:	Pointer to pt_regs of interrupted context
>   *
>   * Invokes:
> @@ -439,25 +439,25 @@ typedef struct irqentry_state {
>   * solves the problem of kernel mode pagefaults which can schedule, which
>   * is not possible after invoking rcu_irq_enter() without undoing it.
>   *
> - * For user mode entries irqentry_enter_from_user_mode() is invoked to
> + * For user mode entries kentry_enter_from_user_mode() is invoked to
>   * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
>   * would not be possible.
>   *
>   * Returns: An opaque object that must be passed to idtentry_exit()
>   */
> -irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
> +kentry_state_t noinstr kentry_enter(struct pt_regs *regs);
>  
>  /**
> - * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
> + * kentry_exit_cond_resched - Conditionally reschedule on return from interrupt
>   *
>   * Conditional reschedule with additional sanity checks.
>   */
> -void irqentry_exit_cond_resched(void);
> +void kentry_exit_cond_resched(void);
>  
>  /**
> - * irqentry_exit - Handle return from exception that used irqentry_enter()
> + * kentry_exit - Handle return from exception that used kentry_enter()
>   * @regs:	Pointer to pt_regs (exception entry regs)
> - * @state:	Return value from matching call to irqentry_enter()
> + * @state:	Return value from matching call to kentry_enter()
>   *
>   * Depending on the return target (kernel/user) this runs the necessary
>   * preemption and work checks if possible and required and returns to
> @@ -466,27 +466,27 @@ void irqentry_exit_cond_resched(void);
>   * This is the last action before returning to the low level ASM code which
>   * just needs to return to the appropriate context.
>   *
> - * Counterpart to irqentry_enter().
> + * Counterpart to kentry_enter().
>   */
> -void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
> +void noinstr kentry_exit(struct pt_regs *regs, kentry_state_t state);
>  
>  /**
> - * irqentry_nmi_enter - Handle NMI entry
> + * kentry_nmi_enter - Handle NMI entry
>   * @regs:	Pointer to currents pt_regs
>   *
> - * Similar to irqentry_enter() but taking care of the NMI constraints.
> + * Similar to kentry_enter() but taking care of the NMI constraints.
>   */
> -irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
> +kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs);
>  
>  /**
> - * irqentry_nmi_exit - Handle return from NMI handling
> + * kentry_nmi_exit - Handle return from NMI handling
>   * @regs:	Pointer to pt_regs (NMI entry regs)
> - * @irq_state:	Return value from matching call to irqentry_nmi_enter()
> + * @irq_state:	Return value from matching call to kentry_nmi_enter()
>   *
>   * Last action before returning to the low level assembly code.
>   *
> - * Counterpart to irqentry_nmi_enter().
> + * Counterpart to kentry_nmi_enter().
>   */
> -void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
> +void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state);
>  
>  #endif
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 378341642f94..269766a8f981 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -304,12 +304,12 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
>  	__exit_to_user_mode();
>  }
>  
> -noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
> +noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
>  {
>  	__enter_from_user_mode(regs);
>  }
>  
> -noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
> +noinstr void kentry_exit_to_user_mode(struct pt_regs *regs)
>  {
>  	instrumentation_begin();
>  	exit_to_user_mode_prepare(regs);
> @@ -317,14 +317,14 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
>  	__exit_to_user_mode();
>  }
>  
> -noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> +noinstr kentry_state_t kentry_enter(struct pt_regs *regs)
>  {
> -	irqentry_state_t ret = {
> +	kentry_state_t ret = {
>  		.exit_rcu = false,
>  	};
>  
>  	if (user_mode(regs)) {
> -		irqentry_enter_from_user_mode(regs);
> +		kentry_enter_from_user_mode(regs);
>  		return ret;
>  	}
>  
> @@ -355,7 +355,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  		/*
>  		 * If RCU is not watching then the same careful
>  		 * sequence vs. lockdep and tracing is required
> -		 * as in irqentry_enter_from_user_mode().
> +		 * as in kentry_enter_from_user_mode().
>  		 */
>  		lockdep_hardirqs_off(CALLER_ADDR0);
>  		rcu_irq_enter();
> @@ -382,7 +382,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  	return ret;
>  }
>  
> -void irqentry_exit_cond_resched(void)
> +void kentry_exit_cond_resched(void)
>  {
>  	if (!preempt_count()) {
>  		/* Sanity check RCU and thread stack */
> @@ -394,13 +394,13 @@ void irqentry_exit_cond_resched(void)
>  	}
>  }
>  
> -noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> +noinstr void kentry_exit(struct pt_regs *regs, kentry_state_t state)
>  {
>  	lockdep_assert_irqs_disabled();
>  
>  	/* Check whether this returns to user mode */
>  	if (user_mode(regs)) {
> -		irqentry_exit_to_user_mode(regs);
> +		kentry_exit_to_user_mode(regs);
>  	} else if (!regs_irqs_disabled(regs)) {
>  		/*
>  		 * If RCU was not watching on entry this needs to be done
> @@ -420,7 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  
>  		instrumentation_begin();
>  		if (IS_ENABLED(CONFIG_PREEMPTION))
> -			irqentry_exit_cond_resched();
> +			kentry_exit_cond_resched();
>  		/* Covers both tracing and lockdep */
>  		trace_hardirqs_on();
>  		instrumentation_end();
> @@ -434,9 +434,9 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  	}
>  }
>  
> -irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
> +kentry_state_t noinstr kentry_nmi_enter(struct pt_regs *regs)
>  {
> -	irqentry_state_t irq_state;
> +	kentry_state_t irq_state;
>  
>  	irq_state.lockdep = lockdep_hardirqs_enabled();
>  
> @@ -453,7 +453,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
>  	return irq_state;
>  }
>  
> -void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
> +void noinstr kentry_nmi_exit(struct pt_regs *regs, kentry_state_t irq_state)
>  {
>  	instrumentation_begin();
>  	ftrace_nmi_exit();
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 06/11] kentry: Simplify the common syscall API
  2021-03-04 19:05 ` [PATCH v3 06/11] kentry: Simplify the common syscall API Andy Lutomirski
@ 2021-03-08  9:49   ` Mark Rutland
  0 siblings, 0 replies; 23+ messages in thread
From: Mark Rutland @ 2021-03-08  9:49 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, LKML

On Thu, Mar 04, 2021 at 11:05:59AM -0800, Andy Lutomirski wrote:
> The new common syscall API had a large and confusing API surface.  Simplify
> it.  Now there is exactly one way to use it.  It's a bit more verbose than
> the old way for the simple x86_64 native case, but it's much easier to use
> right, and the diffstat should speak for itself.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

I think that having this more verbose is going to make it easier to
handle ABI warts that differ across architectures, so:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/x86/entry/common.c      | 57 +++++++++++++++---------
>  include/linux/entry-common.h | 86 ++++++------------------------------
>  kernel/entry/common.c        | 57 +++---------------------
>  3 files changed, 54 insertions(+), 146 deletions(-)
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index ef1c65938a6b..8710b2300b8d 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -38,9 +38,12 @@
>  #ifdef CONFIG_X86_64
>  __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  {
> -	nr = syscall_enter_from_user_mode(regs, nr);
> -
> +	kentry_enter_from_user_mode(regs);
> +	local_irq_enable();
>  	instrumentation_begin();
> +
> +	nr = kentry_syscall_begin(regs, nr);
> +
>  	if (likely(nr < NR_syscalls)) {
>  		nr = array_index_nospec(nr, NR_syscalls);
>  		regs->ax = sys_call_table[nr](regs);
> @@ -52,8 +55,12 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  		regs->ax = x32_sys_call_table[nr](regs);
>  #endif
>  	}
> +
> +	kentry_syscall_end(regs);
> +
> +	local_irq_disable();
>  	instrumentation_end();
> -	syscall_exit_to_user_mode(regs);
> +	kentry_exit_to_user_mode(regs);
>  }
>  #endif
>  
> @@ -83,33 +90,34 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
>  {
>  	unsigned int nr = syscall_32_enter(regs);
>  
> +	kentry_enter_from_user_mode(regs);
> +	local_irq_enable();
> +	instrumentation_begin();
> +
>  	/*
>  	 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
>  	 * orig_ax, the unsigned int return value truncates it.  This may
>  	 * or may not be necessary, but it matches the old asm behavior.
>  	 */
> -	nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
> -	instrumentation_begin();
> -
> +	nr = (unsigned int)kentry_syscall_begin(regs, nr);
>  	do_syscall_32_irqs_on(regs, nr);
> +	kentry_syscall_end(regs);
>  
> +	local_irq_disable();
>  	instrumentation_end();
> -	syscall_exit_to_user_mode(regs);
> +	kentry_exit_to_user_mode(regs);
>  }
>  
>  static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
>  {
>  	unsigned int nr = syscall_32_enter(regs);
> +	bool ret;
>  	int res;
>  
> -	/*
> -	 * This cannot use syscall_enter_from_user_mode() as it has to
> -	 * fetch EBP before invoking any of the syscall entry work
> -	 * functions.
> -	 */
> -	syscall_enter_from_user_mode_prepare(regs);
> -
> +	kentry_enter_from_user_mode(regs);
> +	local_irq_enable();
>  	instrumentation_begin();
> +
>  	/* Fetch EBP from where the vDSO stashed it. */
>  	if (IS_ENABLED(CONFIG_X86_64)) {
>  		/*
> @@ -126,21 +134,23 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
>  	if (res) {
>  		/* User code screwed up. */
>  		regs->ax = -EFAULT;
> -
> -		instrumentation_end();
> -		local_irq_disable();
> -		irqentry_exit_to_user_mode(regs);
> -		return false;
> +		ret = false;
> +		goto out;
>  	}
>  
>  	/* The case truncates any ptrace induced syscall nr > 2^32 -1 */
> -	nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
> +	nr = (unsigned int)kentry_syscall_begin(regs, nr);
>  
>  	/* Now this is just like a normal syscall. */
>  	do_syscall_32_irqs_on(regs, nr);
>  
> +	kentry_syscall_end(regs);
> +	ret = true;
> +
> +out:
> +	local_irq_disable();
>  	instrumentation_end();
> -	syscall_exit_to_user_mode(regs);
> +	kentry_exit_to_user_mode(regs);
>  	return true;
>  }
>  
> @@ -233,8 +243,11 @@ __visible void noinstr ret_from_fork(struct task_struct *prev,
>  		user_regs->ax = 0;
>  	}
>  
> +	kentry_syscall_end(user_regs);
> +
> +	local_irq_disable();
>  	instrumentation_end();
> -	syscall_exit_to_user_mode(user_regs);
> +	kentry_exit_to_user_mode(user_regs);
>  }
>  
>  #ifdef CONFIG_XEN_PV
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index fd2d7c35670a..5287c6c15a66 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -119,31 +119,12 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
>  void enter_from_user_mode(struct pt_regs *regs);
>  
>  /**
> - * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
> - * @regs:	Pointer to currents pt_regs
> - *
> - * Invoked from architecture specific syscall entry code with interrupts
> - * disabled. The calling code has to be non-instrumentable. When the
> - * function returns all state is correct, interrupts are enabled and the
> - * subsequent functions can be instrumented.
> - *
> - * This handles lockdep, RCU (context tracking) and tracing state, i.e.
> - * the functionality provided by enter_from_user_mode().
> - *
> - * This is invoked when there is extra architecture specific functionality
> - * to be done between establishing state and handling user mode entry work.
> - */
> -void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
> -
> -/**
> - * syscall_enter_from_user_mode_work - Check and handle work before invoking
> - *				       a syscall
> + * kentry_syscall_begin - Prepare to invoke a syscall handler
>   * @regs:	Pointer to currents pt_regs
>   * @syscall:	The syscall number
>   *
>   * Invoked from architecture specific syscall entry code with interrupts
> - * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
> - * architecture specific work.
> + * enabled after kentry_enter_from_usermode or a similar function.
>   *
>   * Returns: The original or a modified syscall number
>   *
> @@ -152,32 +133,16 @@ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
>   * syscall_set_return_value() first.  If neither of those are called and -1
>   * is returned, then the syscall will fail with ENOSYS.
>   *
> + * After calling kentry_syscall_begin(), regardless of the return value,
> + * the caller must call kentry_syscall_end().
> + *
>   * It handles the following work items:
>   *
>   *  1) syscall_work flag dependent invocations of
>   *     arch_syscall_enter_tracehook(), __secure_computing(), trace_sys_enter()
>   *  2) Invocation of audit_syscall_entry()
>   */
> -long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
> -
> -/**
> - * syscall_enter_from_user_mode - Establish state and check and handle work
> - *				  before invoking a syscall
> - * @regs:	Pointer to currents pt_regs
> - * @syscall:	The syscall number
> - *
> - * Invoked from architecture specific syscall entry code with interrupts
> - * disabled. The calling code has to be non-instrumentable. When the
> - * function returns all state is correct, interrupts are enabled and the
> - * subsequent functions can be instrumented.
> - *
> - * This is combination of syscall_enter_from_user_mode_prepare() and
> - * syscall_enter_from_user_mode_work().
> - *
> - * Returns: The original or a modified syscall number. See
> - * syscall_enter_from_user_mode_work() for further explanation.
> - */
> -long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
> +long kentry_syscall_begin(struct pt_regs *regs, long syscall);
>  
>  /**
>   * local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable()
> @@ -317,28 +282,16 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
>  void exit_to_user_mode(void);
>  
>  /**
> - * syscall_exit_to_user_mode_work - Handle work before returning to user mode
> + * kentry_syscall_end - Finish syscall processing
>   * @regs:	Pointer to currents pt_regs
>   *
> - * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling
> - * exit_to_user_mode() to perform the final transition to user mode.
>   *
> - * Calling convention is the same as for syscall_exit_to_user_mode() and it
> - * returns with all work handled and interrupts disabled. The caller must
> - * invoke exit_to_user_mode() before actually switching to user mode to
> - * make the final state transitions. Interrupts must stay disabled between
> - * return from this function and the invocation of exit_to_user_mode().
> - */
> -void syscall_exit_to_user_mode_work(struct pt_regs *regs);
> -
> -/**
> - * syscall_exit_to_user_mode - Handle work before returning to user mode
> - * @regs:	Pointer to currents pt_regs
> + * This must be called after arch code calls kentry_syscall_begin()
> + * and invoking a syscall handler, if any.  This must also be called when
> + * returning from fork() to user mode, since return-from-fork is considered
> + * to be a syscall return.
>   *
> - * Invoked with interrupts enabled and fully valid regs. Returns with all
> - * work handled, interrupts disabled such that the caller can immediately
> - * switch to user mode. Called from architecture specific syscall and ret
> - * from fork code.
> + * Called with IRQs on.  Returns with IRQs still on.
>   *
>   * The call order is:
>   *  1) One-time syscall exit work:
> @@ -346,21 +299,8 @@ void syscall_exit_to_user_mode_work(struct pt_regs *regs);
>   *      - audit
>   *	- syscall tracing
>   *	- tracehook (single stepping)
> - *
> - *  2) Preparatory work
> - *	- Exit to user mode loop (common TIF handling). Invokes
> - *	  arch_exit_to_user_mode_work() for architecture specific TIF work
> - *	- Architecture specific one time work arch_exit_to_user_mode_prepare()
> - *	- Address limit and lockdep checks
> - *
> - *  3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the
> - *     functionality in exit_to_user_mode().
> - *
> - * This is a combination of syscall_exit_to_user_mode_work() (1,2) and
> - * exit_to_user_mode(). This function is preferred unless there is a
> - * compelling architectural reason to use the seperate functions.
>   */
> -void syscall_exit_to_user_mode(struct pt_regs *regs);
> +void kentry_syscall_end(struct pt_regs *regs);
>  
>  /**
>   * kentry_enter_from_user_mode - Establish state before invoking the irq handler
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 269766a8f981..800ad406431b 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -80,44 +80,19 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
>  	return ret ? : syscall;
>  }
>  
> -static __always_inline long
> -__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
> +long kentry_syscall_begin(struct pt_regs *regs, long syscall)
>  {
>  	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
>  
> +	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> +	lockdep_assert_irqs_enabled();
> +
>  	if (work & SYSCALL_WORK_ENTER)
>  		syscall = syscall_trace_enter(regs, syscall, work);
>  
>  	return syscall;
>  }
>  
> -long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
> -{
> -	return __syscall_enter_from_user_work(regs, syscall);
> -}
> -
> -noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
> -{
> -	long ret;
> -
> -	__enter_from_user_mode(regs);
> -
> -	instrumentation_begin();
> -	local_irq_enable();
> -	ret = __syscall_enter_from_user_work(regs, syscall);
> -	instrumentation_end();
> -
> -	return ret;
> -}
> -
> -noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
> -{
> -	__enter_from_user_mode(regs);
> -	instrumentation_begin();
> -	local_irq_enable();
> -	instrumentation_end();
> -}
> -
>  /* See comment for exit_to_user_mode() in entry-common.h */
>  static __always_inline void __exit_to_user_mode(void)
>  {
> @@ -218,7 +193,7 @@ static inline bool report_single_step(unsigned long work)
>  /*
>   * If SYSCALL_EMU is set, then the only reason to report is when
>   * TIF_SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall
> - * instruction has been already reported in syscall_enter_from_user_mode().
> + * instruction has been already reported in kentry_syscall_begin().
>   */
>  static inline bool report_single_step(unsigned long work)
>  {
> @@ -261,7 +236,7 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
>   * Syscall specific exit to user mode preparation. Runs with interrupts
>   * enabled.
>   */
> -static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
> +void kentry_syscall_end(struct pt_regs *regs)
>  {
>  	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
>  	unsigned long nr = syscall_get_nr(current, regs);
> @@ -284,26 +259,6 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
>  		syscall_exit_work(regs, work);
>  }
>  
> -static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
> -{
> -	syscall_exit_to_user_mode_prepare(regs);
> -	local_irq_disable_exit_to_user();
> -	exit_to_user_mode_prepare(regs);
> -}
> -
> -void syscall_exit_to_user_mode_work(struct pt_regs *regs)
> -{
> -	__syscall_exit_to_user_mode_work(regs);
> -}
> -
> -__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
> -{
> -	instrumentation_begin();
> -	__syscall_exit_to_user_mode_work(regs);
> -	instrumentation_end();
> -	__exit_to_user_mode();
> -}
> -
>  noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
>  {
>  	__enter_from_user_mode(regs);
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only
  2021-03-04 19:06 ` [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only Andy Lutomirski
@ 2021-03-08 10:06   ` Mark Rutland
  2021-03-14  1:18     ` Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Rutland @ 2021-03-08 10:06 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, LKML

On Thu, Mar 04, 2021 at 11:06:00AM -0800, Andy Lutomirski wrote:
> exit_to_user_mode() does part, but not all, of the exit-to-user-mode work.
> It's used only by arm64, and arm64 should stop using it (hint, hint!).
> Compile it out on other architectures to minimize the chance of error.

For context, I do plan to move over, but there's a reasonable amount of
preparatory work needed first (e.g. factoring out the remaining asm
entry points, and reworking our pseudo-NMI management to play nicely
with the common entry code).

> enter_from_user_mode() is a legacy way to spell
> kentry_enter_from_user_mode().  It's also only used by arm64.  Give it
> the same treatment.

I think you can remove these entirely, no ifdeferry necessary.

Currently arm64 cannot select CONFIG_GENERIC_ENTRY, so we open-code
copies of these in arch/arm64/kernel/entry-common.c, and doesn't use
these common versions at all. When we move over to the common code we
can move directly to the kentry_* versions. If we are relying on the
prototypes anywhere, that's a bug as of itself.

In retrospect I probably should have given our local copies an arm64_*
prefix. If I can't get rid of them soon I'll add that to lessen the
scope for confusion.

Mark.

> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  include/linux/entry-common.h | 34 ++++++----------------------------
>  kernel/entry/common.c        |  4 ++++
>  2 files changed, 10 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 5287c6c15a66..a75374f87258 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -97,26 +97,15 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
>  }
>  #endif
>  
> +#ifdef CONFIG_ARM64
>  /**
>   * enter_from_user_mode - Establish state when coming from user mode
>   *
> - * Syscall/interrupt entry disables interrupts, but user mode is traced as
> - * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
> + * Legacy variant of kentry_enter_from_user_mode().  Used only by arm64.
>   *
> - * 1) Tell lockdep that interrupts are disabled
> - * 2) Invoke context tracking if enabled to reactivate RCU
> - * 3) Trace interrupts off state
> - *
> - * Invoked from architecture specific syscall entry code with interrupts
> - * disabled. The calling code has to be non-instrumentable. When the
> - * function returns all state is correct and interrupts are still
> - * disabled. The subsequent functions can be instrumented.
> - *
> - * This is invoked when there is architecture specific functionality to be
> - * done between establishing state and enabling interrupts. The caller must
> - * enable interrupts before invoking syscall_enter_from_user_mode_work().
>   */
>  void enter_from_user_mode(struct pt_regs *regs);
> +#endif
>  
>  /**
>   * kentry_syscall_begin - Prepare to invoke a syscall handler
> @@ -261,25 +250,14 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
>  }
>  #endif
>  
> +#ifdef CONFIG_ARM64
>  /**
>   * exit_to_user_mode - Fixup state when exiting to user mode
>   *
> - * Syscall/interrupt exit enables interrupts, but the kernel state is
> - * interrupts disabled when this is invoked. Also tell RCU about it.
> - *
> - * 1) Trace interrupts on state
> - * 2) Invoke context tracking if enabled to adjust RCU state
> - * 3) Invoke architecture specific last minute exit code, e.g. speculation
> - *    mitigations, etc.: arch_exit_to_user_mode()
> - * 4) Tell lockdep that interrupts are enabled
> - *
> - * Invoked from architecture specific code when syscall_exit_to_user_mode()
> - * is not suitable as the last step before returning to userspace. Must be
> - * invoked with interrupts disabled and the caller must be
> - * non-instrumentable.
> - * The caller has to invoke syscall_exit_to_user_mode_work() before this.
> + * Does the latter part of irqentry_exit_to_user_mode().  Only used by arm64.
>   */
>  void exit_to_user_mode(void);
> +#endif
>  
>  /**
>   * kentry_syscall_end - Finish syscall processing
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 800ad406431b..4ba82c684189 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -25,10 +25,12 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
>  	instrumentation_end();
>  }
>  
> +#ifdef CONFIG_ARM64
>  void noinstr enter_from_user_mode(struct pt_regs *regs)
>  {
>  	__enter_from_user_mode(regs);
>  }
> +#endif
>  
>  static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
>  {
> @@ -106,10 +108,12 @@ static __always_inline void __exit_to_user_mode(void)
>  	lockdep_hardirqs_on(CALLER_ADDR0);
>  }
>  
> +#ifdef CONFIG_ARM64
>  void noinstr exit_to_user_mode(void)
>  {
>  	__exit_to_user_mode();
>  }
> +#endif
>  
>  /* Workaround to allow gradual conversion of architecture code */
>  void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86
  2021-03-04 19:06 ` [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86 Andy Lutomirski
@ 2021-03-08 10:13   ` Mark Rutland
  2021-03-29 11:50     ` Sven Schnelle
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Rutland @ 2021-03-08 10:13 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, LKML, Sven Schnelle

On Thu, Mar 04, 2021 at 11:06:01AM -0800, Andy Lutomirski wrote:
> In principle, the generic entry code is generic, and the goal is to use it
> in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
> to the generic config so that it can be used in the generic entry code and
> not just in arch/x86.
> 
> Disable it on arm64.  arm64 uses some but not all of the kentry
> code, and trying to debug the resulting state machine will be painful.
> arm64 can turn it back on when it starts using the entire generic
> path.

Can we make this depend on CONFIG_GENERIC_ENTRY instead of !ARM64?
That'd be more in line with "use the generic entry code, get the generic
functionality". Note that arm64 doesn't select CONFIG_GENERIC_ENTRY
today.

I see that s390 selects CONFIG_GENERIC_ENTRY, and either way this will
enable DEBUG_ENTRY for them, so it'd ve worth checking whether this is
ok for them.

Sven, thoughts?

Thanks,
Mark.

> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/Kconfig.debug | 10 ----------
>  lib/Kconfig.debug      | 11 +++++++++++
>  2 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 80b57e7f4947..a5a52133730c 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -170,16 +170,6 @@ config CPA_DEBUG
>  	help
>  	  Do change_page_attr() self-tests every 30 seconds.
>  
> -config DEBUG_ENTRY
> -	bool "Debug low-level entry code"
> -	depends on DEBUG_KERNEL
> -	help
> -	  This option enables sanity checks in x86's low-level entry code.
> -	  Some of these sanity checks may slow down kernel entries and
> -	  exits or otherwise impact performance.
> -
> -	  If unsure, say N.
> -
>  config DEBUG_NMI_SELFTEST
>  	bool "NMI Selftest"
>  	depends on DEBUG_KERNEL && X86_LOCAL_APIC
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 7937265ef879..76549c8afa8a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1411,6 +1411,17 @@ config CSD_LOCK_WAIT_DEBUG
>  
>  endmenu # lock debugging
>  
> +config DEBUG_ENTRY
> +	bool "Debug low-level entry code"
> +	depends on DEBUG_KERNEL
> +	depends on !ARM64
> +	help
> +	  This option enables sanity checks in the low-level entry code.
> +	  Some of these sanity checks may slow down kernel entries and
> +	  exits or otherwise impact performance.
> +
> +	  If unsure, say N.
> +
>  config TRACE_IRQFLAGS
>  	depends on TRACE_IRQFLAGS_SUPPORT
>  	bool
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only
  2021-03-08 10:06   ` Mark Rutland
@ 2021-03-14  1:18     ` Andy Lutomirski
  0 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2021-03-14  1:18 UTC (permalink / raw)
  To: Mark Rutland; +Cc: Andy Lutomirski, X86 ML, LKML

On Mon, Mar 8, 2021 at 2:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Thu, Mar 04, 2021 at 11:06:00AM -0800, Andy Lutomirski wrote:
> > exit_to_user_mode() does part, but not all, of the exit-to-user-mode work.
> > It's used only by arm64, and arm64 should stop using it (hint, hint!).
> > Compile it out on other architectures to minimize the chance of error.
>
> For context, I do plan to move over, but there's a reasonable amount of
> preparatory work needed first (e.g. factoring out the remaining asm
> entry points, and reworking our pseudo-NMI management to play nicely
> with the common entry code).
>
> > enter_from_user_mode() is a legacy way to spell
> > kentry_enter_from_user_mode().  It's also only used by arm64.  Give it
> > the same treatment.
>
> I think you can remove these entirely, no ifdeferry necessary.
>
> Currently arm64 cannot select CONFIG_GENERIC_ENTRY, so we open-code
> copies of these in arch/arm64/kernel/entry-common.c, and doesn't use
> these common versions at all. When we move over to the common code we
> can move directly to the kentry_* versions. If we are relying on the
> prototypes anywhere, that's a bug as of itself.

I got faked out!  Fixed for the next version.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86
  2021-03-08 10:13   ` Mark Rutland
@ 2021-03-29 11:50     ` Sven Schnelle
  0 siblings, 0 replies; 23+ messages in thread
From: Sven Schnelle @ 2021-03-29 11:50 UTC (permalink / raw)
  To: Mark Rutland; +Cc: Andy Lutomirski, x86, LKML

Mark Rutland <mark.rutland@arm.com> writes:

> On Thu, Mar 04, 2021 at 11:06:01AM -0800, Andy Lutomirski wrote:
>> In principle, the generic entry code is generic, and the goal is to use it
>> in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
>> to the generic config so that it can be used in the generic entry code and
>> not just in arch/x86.
>> 
>> Disable it on arm64.  arm64 uses some but not all of the kentry
>> code, and trying to debug the resulting state machine will be painful.
>> arm64 can turn it back on when it starts using the entire generic
>> path.
>
> Can we make this depend on CONFIG_GENERIC_ENTRY instead of !ARM64?
> That'd be more in line with "use the generic entry code, get the generic
> functionality". Note that arm64 doesn't select CONFIG_GENERIC_ENTRY
> today.
>
> I see that s390 selects CONFIG_GENERIC_ENTRY, and either way this will
> enable DEBUG_ENTRY for them, so it'd ve worth checking whether this is
> ok for them.
>
> Sven, thoughts?
>

For s390 that change should be fine.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-03-29 11:51 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-04 19:05 [PATCH v3 00/11] kentry: A stable bugfix and a bunch of improvements Andy Lutomirski
2021-03-04 19:05 ` [PATCH v3 01/11] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls Andy Lutomirski
2021-03-05 10:16   ` [tip: x86/urgent] " tip-bot2 for Andy Lutomirski
2021-03-06 10:44   ` tip-bot2 for Andy Lutomirski
2021-03-06 12:18   ` tip-bot2 for Andy Lutomirski
2021-03-04 19:05 ` [PATCH v3 02/11] kentry: Rename irqentry to kentry Andy Lutomirski
2021-03-08  9:45   ` Mark Rutland
2021-03-04 19:05 ` [PATCH v3 03/11] x86/dumpstack: Remove unnecessary range check fetching opcode bytes Andy Lutomirski
2021-03-04 19:05 ` [PATCH v3 04/11] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads Andy Lutomirski
2021-03-04 20:19   ` Ira Weiny
2021-03-04 19:05 ` [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C Andy Lutomirski
2021-03-05  0:55   ` Brian Gerst
2021-03-04 19:05 ` [PATCH v3 06/11] kentry: Simplify the common syscall API Andy Lutomirski
2021-03-08  9:49   ` Mark Rutland
2021-03-04 19:06 ` [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only Andy Lutomirski
2021-03-08 10:06   ` Mark Rutland
2021-03-14  1:18     ` Andy Lutomirski
2021-03-04 19:06 ` [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86 Andy Lutomirski
2021-03-08 10:13   ` Mark Rutland
2021-03-29 11:50     ` Sven Schnelle
2021-03-04 19:06 ` [PATCH v3 09/11] kentry: Add debugging checks for proper kentry API usage Andy Lutomirski
2021-03-04 19:06 ` [PATCH v3 10/11] kentry: Check that syscall entries and syscall exits match Andy Lutomirski
2021-03-04 19:06 ` [PATCH v3 11/11] kentry: Verify kentry state in instrumentation_begin/end() Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).