linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest
@ 2020-07-22 21:59 Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 01/15] seccomp: Provide stub for __secure_computing() Thomas Gleixner
                   ` (15 more replies)
  0 siblings, 16 replies; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 21:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

This is the 5th version of generic entry/exit functionality for host and
guest.

The 4th version is available here:

    https://lore.kernel.org/r/20200721105706.030914876@linutronix.de

Changes vs. V4:

  - Add the missing instrumentation prevetions to the entry Makefile (Kees)
  - Rename exit_to_guest to xfer_to_guest (Sean)

The patches depend on:

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/entry

The lot is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/entry

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 01/15] seccomp: Provide stub for __secure_computing()
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
@ 2020-07-22 21:59 ` Thomas Gleixner
  2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 02/15] entry: Provide generic syscall entry functionality Thomas Gleixner
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 21:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

To avoid #ifdeffery in the upcoming generic syscall entry work code provide
a stub for __secure_computing() as this is preferred over
secure_computing() because the TIF flag is already evaluated.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>

---
V4: New patch
---
 include/linux/seccomp.h |    1 +
 1 file changed, 1 insertion(+)

--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -61,6 +61,7 @@ struct seccomp_filter { };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 static inline int secure_computing(void) { return 0; }
+static inline int __secure_computing(void) { return 0; }
 #else
 static inline void secure_computing_strict(int this_syscall) { return; }
 #endif



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 02/15] entry: Provide generic syscall entry functionality
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 01/15] seccomp: Provide stub for __secure_computing() Thomas Gleixner
@ 2020-07-22 21:59 ` Thomas Gleixner
  2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 03/15] entry: Provide generic syscall exit function Thomas Gleixner
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 21:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

On syscall entry certain work needs to be done:

   - Establish state (lockdep, context tracking, tracing)
   - Conditional work (ptrace, seccomp, audit...)

This code is needlessly duplicated and  different in all
architectures.

Provide a generic version based on the x86 implementation which has all the
RCU and instrumentation bits right.

As interrupt/exception entry from user space needs parts of the same
functionality, provide a function for this as well.

syscall_enter_from_user_mode() and irqentry_enter_from_user_mode() must be
called right after the low level ASM entry. The calling code must be
non-instrumentable. After the functions returns state is correct and the
subsequent functions can be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
---
V5: Add the missing protections against sanitizers and other goodies

V4: Remove the architecture wrappers for seccomp and audit and use
    the generic code (Kees)

    Fix TIF_SYSCALL_AUDIT dummy define and remove TIF_SYSCALL_TRACE as it's
    defined in all architectures.

V3: Adapt to noinstr changes. Add interrupt/exception entry function.

V2: Fix function documentation (Mike)
    Add comment about return value (Andy)
---
 arch/Kconfig                 |    3 +
 include/linux/entry-common.h |  121 +++++++++++++++++++++++++++++++++++++++++++
 kernel/Makefile              |    1 
 kernel/entry/Makefile        |   12 ++++
 kernel/entry/common.c        |   88 +++++++++++++++++++++++++++++++
 5 files changed, 225 insertions(+)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -27,6 +27,9 @@ config HAVE_IMA_KEXEC
 config HOTPLUG_SMT
 	bool
 
+config GENERIC_ENTRY
+       bool
+
 config OPROFILE
 	tristate "OProfile system profiling"
 	depends on PROFILING
--- /dev/null
+++ b/include/linux/entry-common.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_ENTRYCOMMON_H
+#define __LINUX_ENTRYCOMMON_H
+
+#include <linux/tracehook.h>
+#include <linux/syscalls.h>
+#include <linux/seccomp.h>
+#include <linux/sched.h>
+
+#include <asm/entry-common.h>
+
+/*
+ * Define dummy _TIF work flags if not defined by the architecture or for
+ * disabled functionality.
+ */
+#ifndef _TIF_SYSCALL_EMU
+# define _TIF_SYSCALL_EMU		(0)
+#endif
+
+#ifndef _TIF_SYSCALL_TRACEPOINT
+# define _TIF_SYSCALL_TRACEPOINT	(0)
+#endif
+
+#ifndef _TIF_SECCOMP
+# define _TIF_SECCOMP			(0)
+#endif
+
+#ifndef _TIF_SYSCALL_AUDIT
+# define _TIF_SYSCALL_AUDIT		(0)
+#endif
+
+/*
+ * TIF flags handled in syscall_enter_from_usermode()
+ */
+#ifndef ARCH_SYSCALL_ENTER_WORK
+# define ARCH_SYSCALL_ENTER_WORK	(0)
+#endif
+
+#define SYSCALL_ENTER_WORK						\
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | TIF_SECCOMP |	\
+	 _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU |			\
+	 ARCH_SYSCALL_ENTER_WORK)
+
+/**
+ * arch_check_user_regs - Architecture specific sanity check for user mode regs
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Defaults to an empty implementation. Can be replaced by architecture
+ * specific code.
+ *
+ * Invoked from syscall_enter_from_user_mode() in the non-instrumentable
+ * section. Use __always_inline so the compiler cannot push it out of line
+ * and make it instrumentable.
+ */
+static __always_inline void arch_check_user_regs(struct pt_regs *regs);
+
+#ifndef arch_check_user_regs
+static __always_inline void arch_check_user_regs(struct pt_regs *regs) {}
+#endif
+
+/**
+ * arch_syscall_enter_tracehook - Wrapper around tracehook_report_syscall_entry()
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Returns: 0 on success or an error code to skip the syscall.
+ *
+ * Defaults to tracehook_report_syscall_entry(). Can be replaced by
+ * architecture specific code.
+ *
+ * Invoked from syscall_enter_from_user_mode()
+ */
+static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs);
+
+#ifndef arch_syscall_enter_tracehook
+static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs)
+{
+	return tracehook_report_syscall_entry(regs);
+}
+#endif
+
+/**
+ * syscall_enter_from_user_mode - Check and handle work before invoking
+ *				 a syscall
+ * @regs:	Pointer to currents pt_regs
+ * @syscall:	The syscall number
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * disabled. The calling code has to be non-instrumentable. When the
+ * function returns all state is correct and the subsequent functions can be
+ * instrumented.
+ *
+ * Returns: The original or a modified syscall number
+ *
+ * If the returned syscall number is -1 then the syscall should be
+ * skipped. In this case the caller may invoke syscall_set_error() or
+ * syscall_set_return_value() first.  If neither of those are called and -1
+ * is returned, then the syscall will fail with ENOSYS.
+ *
+ * The following functionality is handled here:
+ *
+ *  1) Establish state (lockdep, RCU (context tracking), tracing)
+ *  2) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
+ *     __secure_computing(), trace_sys_enter()
+ *  3) Invocation of audit_syscall_entry()
+ */
+long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
+
+/**
+ * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked from architecture specific entry code with interrupts disabled.
+ * Can only be called when the interrupt entry came from user mode. The
+ * calling code must be non-instrumentable.  When the function returns all
+ * state is correct and the subsequent functions can be instrumented.
+ *
+ * The function establishes state (lockdep, RCU (context tracking), tracing)
+ */
+void irqentry_enter_from_user_mode(struct pt_regs *regs);
+
+#endif
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -48,6 +48,7 @@ obj-y += irq/
 obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
+obj-y += entry/
 
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
--- /dev/null
+++ b/kernel/entry/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# Prevent the noinstr section from being pestered by sanitizer and other goodies
+# as long as these things cannot be disabled per function.
+KASAN_SANITIZE := n
+UBSAN_SANITIZE := n
+KCOV_INSTRUMENT := n
+
+CFLAGS_REMOVE_common.o	 = -fstack-protector -fstack-protector-strong
+CFLAGS_common.o		+= -fno-stack-protector
+
+obj-$(CONFIG_GENERIC_ENTRY) += common.o
--- /dev/null
+++ b/kernel/entry/common.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/context_tracking.h>
+#include <linux/entry-common.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+/**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall/interrupt entry disables interrupts, but user mode is traced as
+ * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ */
+static __always_inline void enter_from_user_mode(struct pt_regs *regs)
+{
+	arch_check_user_regs(regs);
+	lockdep_hardirqs_off(CALLER_ADDR0);
+
+	CT_WARN_ON(ct_state() != CONTEXT_USER);
+	user_exit_irqoff();
+
+	instrumentation_begin();
+	trace_hardirqs_off_finish();
+	instrumentation_end();
+}
+
+static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
+{
+	if (unlikely(audit_context())) {
+		unsigned long args[6];
+
+		syscall_get_arguments(current, regs, args);
+		audit_syscall_entry(syscall, args[0], args[1], args[2], args[3]);
+	}
+}
+
+static long syscall_trace_enter(struct pt_regs *regs, long syscall,
+				unsigned long ti_work)
+{
+	long ret = 0;
+
+	/* Handle ptrace */
+	if (ti_work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) {
+		ret = arch_syscall_enter_tracehook(regs);
+		if (ret || (ti_work & _TIF_SYSCALL_EMU))
+			return -1L;
+	}
+
+	/* Do seccomp after ptrace, to catch any tracer changes. */
+	if (ti_work & _TIF_SECCOMP) {
+		ret = __secure_computing(NULL);
+		if (ret == -1L)
+			return ret;
+	}
+
+	if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT))
+		trace_sys_enter(regs, syscall);
+
+	syscall_enter_audit(regs, syscall);
+
+	return ret ? : syscall;
+}
+
+noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
+{
+	unsigned long ti_work;
+
+	enter_from_user_mode(regs);
+	instrumentation_begin();
+
+	local_irq_enable();
+	ti_work = READ_ONCE(current_thread_info()->flags);
+	if (ti_work & SYSCALL_ENTER_WORK)
+		syscall = syscall_trace_enter(regs, syscall, ti_work);
+	instrumentation_end();
+
+	return syscall;
+}
+
+noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+	enter_from_user_mode(regs);
+}


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 03/15] entry: Provide generic syscall exit function
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 01/15] seccomp: Provide stub for __secure_computing() Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 02/15] entry: Provide generic syscall entry functionality Thomas Gleixner
@ 2020-07-22 21:59 ` Thomas Gleixner
  2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 04/15] entry: Provide generic interrupt entry/exit code Thomas Gleixner
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 21:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Like syscall entry all architectures have similar and pointlessly different
code to handle pending work before returning from a syscall to user space.

  1) One-time syscall exit work:
      - rseq syscall exit
      - audit
      - syscall tracing
      - tracehook (single stepping)

  2) Preparatory work
      - Exit to user mode loop (common TIF handling).
      - Architecture specific one time work arch_exit_to_user_mode_prepare()
      - Address limit and lockdep checks
     
  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
     arch_exit_to_user_mode() to handle e.g. speculation mitigations

Provide a generic version based on the x86 code which has all the RCU and
instrumentation protections right.

Provide a variant for interrupt return to user mode as well which shares
the above #2 and #3 work items.

After syscall_exit_to_user_mode() and irqentry_exit_to_user_mode() the
architecture code just has to return to user space. The code after
returning from these functions must not be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>

---
V4: Remove _TIF_NOTIFY_RESUME dummy define as it is defined by all architectures
    Remove the return value helper and use the existing regs_return_value()
---
 include/linux/entry-common.h |  189 +++++++++++++++++++++++++++++++++++++++++++
 kernel/entry/common.c        |  169 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 358 insertions(+)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -29,6 +29,14 @@
 # define _TIF_SYSCALL_AUDIT		(0)
 #endif
 
+#ifndef _TIF_PATCH_PENDING
+# define _TIF_PATCH_PENDING		(0)
+#endif
+
+#ifndef _TIF_UPROBE
+# define _TIF_UPROBE			(0)
+#endif
+
 /*
  * TIF flags handled in syscall_enter_from_usermode()
  */
@@ -41,6 +49,29 @@
 	 _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU |			\
 	 ARCH_SYSCALL_ENTER_WORK)
 
+/*
+ * TIF flags handled in syscall_exit_to_user_mode()
+ */
+#ifndef ARCH_SYSCALL_EXIT_WORK
+# define ARCH_SYSCALL_EXIT_WORK		(0)
+#endif
+
+#define SYSCALL_EXIT_WORK						\
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |			\
+	 _TIF_SYSCALL_TRACEPOINT | ARCH_SYSCALL_EXIT_WORK)
+
+/*
+ * TIF flags handled in exit_to_user_mode_loop()
+ */
+#ifndef ARCH_EXIT_TO_USER_MODE_WORK
+# define ARCH_EXIT_TO_USER_MODE_WORK		(0)
+#endif
+
+#define EXIT_TO_USER_MODE_WORK						\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 ARCH_EXIT_TO_USER_MODE_WORK)
+
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
  * @regs:	Pointer to currents pt_regs
@@ -106,6 +137,149 @@ static inline __must_check int arch_sysc
 long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
 
 /**
+ * local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable()
+ * @ti_work:	Cached TIF flags gathered with interrupts disabled
+ *
+ * Defaults to local_irq_enable(). Can be supplied by architecture specific
+ * code.
+ */
+static inline void local_irq_enable_exit_to_user(unsigned long ti_work);
+
+#ifndef local_irq_enable_exit_to_user
+static inline void local_irq_enable_exit_to_user(unsigned long ti_work)
+{
+	local_irq_enable();
+}
+#endif
+
+/**
+ * local_irq_disable_exit_to_user - Exit to user variant of local_irq_disable()
+ *
+ * Defaults to local_irq_disable(). Can be supplied by architecture specific
+ * code.
+ */
+static inline void local_irq_disable_exit_to_user(void);
+
+#ifndef local_irq_disable_exit_to_user
+static inline void local_irq_disable_exit_to_user(void)
+{
+	local_irq_disable();
+}
+#endif
+
+/**
+ * arch_exit_to_user_mode_work - Architecture specific TIF work for exit
+ *				 to user mode.
+ * @regs:	Pointer to currents pt_regs
+ * @ti_work:	Cached TIF flags gathered with interrupts disabled
+ *
+ * Invoked from exit_to_user_mode_loop() with interrupt enabled
+ *
+ * Defaults to NOOP. Can be supplied by architecture specific code.
+ */
+static inline void arch_exit_to_user_mode_work(struct pt_regs *regs,
+					       unsigned long ti_work);
+
+#ifndef arch_exit_to_user_mode_work
+static inline void arch_exit_to_user_mode_work(struct pt_regs *regs,
+					       unsigned long ti_work)
+{
+}
+#endif
+
+/**
+ * arch_exit_to_user_mode_prepare - Architecture specific preparation for
+ *				    exit to user mode.
+ * @regs:	Pointer to currents pt_regs
+ * @ti_work:	Cached TIF flags gathered with interrupts disabled
+ *
+ * Invoked from exit_to_user_mode_prepare() with interrupt disabled as the last
+ * function before return. Defaults to NOOP.
+ */
+static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
+						  unsigned long ti_work);
+
+#ifndef arch_exit_to_user_mode_prepare
+static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
+						  unsigned long ti_work)
+{
+}
+#endif
+
+/**
+ * arch_exit_to_user_mode - Architecture specific final work before
+ *			    exit to user mode.
+ *
+ * Invoked from exit_to_user_mode() with interrupt disabled as the last
+ * function before return. Defaults to NOOP.
+ *
+ * This needs to be __always_inline because it is non-instrumentable code
+ * invoked after context tracking switched to user mode.
+ *
+ * An architecture implementation must not do anything complex, no locking
+ * etc. The main purpose is for speculation mitigations.
+ */
+static __always_inline void arch_exit_to_user_mode(void);
+
+#ifndef arch_exit_to_user_mode
+static __always_inline void arch_exit_to_user_mode(void) { }
+#endif
+
+/**
+ * arch_do_signal -  Architecture specific signal delivery function
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked from exit_to_user_mode_loop().
+ */
+void arch_do_signal(struct pt_regs *regs);
+
+/**
+ * arch_syscall_exit_tracehook - Wrapper around tracehook_report_syscall_exit()
+ * @regs:	Pointer to currents pt_regs
+ * @step:	Indicator for single step
+ *
+ * Defaults to tracehook_report_syscall_exit(). Can be replaced by
+ * architecture specific code.
+ *
+ * Invoked from syscall_exit_to_user_mode()
+ */
+static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step);
+
+#ifndef arch_syscall_exit_tracehook
+static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
+{
+	tracehook_report_syscall_exit(regs, step);
+}
+#endif
+
+/**
+ * syscall_exit_to_user_mode - Handle work before returning to user mode
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked with interrupts enabled and fully valid regs. Returns with all
+ * work handled, interrupts disabled such that the caller can immediately
+ * switch to user mode. Called from architecture specific syscall and ret
+ * from fork code.
+ *
+ * The call order is:
+ *  1) One-time syscall exit work:
+ *	- rseq syscall exit
+ *      - audit
+ *	- syscall tracing
+ *	- tracehook (single stepping)
+ *
+ *  2) Preparatory work
+ *	- Exit to user mode loop (common TIF handling). Invokes
+ *	  arch_exit_to_user_mode_work() for architecture specific TIF work
+ *	- Architecture specific one time work arch_exit_to_user_mode_prepare()
+ *	- Address limit and lockdep checks
+ *
+ *  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
+ *     arch_exit_to_user_mode() to handle e.g. speculation mitigations
+ */
+void syscall_exit_to_user_mode(struct pt_regs *regs);
+
+/**
  * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
  * @regs:	Pointer to currents pt_regs
  *
@@ -118,4 +292,19 @@ long syscall_enter_from_user_mode(struct
  */
 void irqentry_enter_from_user_mode(struct pt_regs *regs);
 
+/**
+ * irqentry_exit_to_user_mode - Interrupt exit work
+ * @regs:	Pointer to current's pt_regs
+ *
+ * Invoked with interrupts disbled and fully valid regs. Returns with all
+ * work handled, interrupts disabled such that the caller can immediately
+ * switch to user mode. Called from architecture specific interrupt
+ * handling code.
+ *
+ * The call order is #2 and #3 as described in syscall_exit_to_user_mode().
+ * Interrupt exit is not invoking #1 which is the syscall specific one time
+ * work.
+ */
+void irqentry_exit_to_user_mode(struct pt_regs *regs);
+
 #endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -2,6 +2,8 @@
 
 #include <linux/context_tracking.h>
 #include <linux/entry-common.h>
+#include <linux/livepatch.h>
+#include <linux/audit.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -82,7 +84,174 @@ noinstr long syscall_enter_from_user_mod
 	return syscall;
 }
 
+/**
+ * exit_to_user_mode - Fixup state when exiting to user mode
+ *
+ * Syscall/interupt exit enables interrupts, but the kernel state is
+ * interrupts disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Invoke architecture specific last minute exit code, e.g. speculation
+ *    mitigations, etc.
+ * 4) Tell lockdep that interrupts are enabled
+ */
+static __always_inline void exit_to_user_mode(void)
+{
+	instrumentation_begin();
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	instrumentation_end();
+
+	user_enter_irqoff();
+	arch_exit_to_user_mode();
+	lockdep_hardirqs_on(CALLER_ADDR0);
+}
+
+/* Workaround to allow gradual conversion of architecture code */
+void __weak arch_do_signal(struct pt_regs *regs) { }
+
+static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
+					    unsigned long ti_work)
+{
+	/*
+	 * Before returning to user space ensure that all pending work
+	 * items have been completed.
+	 */
+	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+
+		local_irq_enable_exit_to_user(ti_work);
+
+		if (ti_work & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (ti_work & _TIF_UPROBE)
+			uprobe_notify_resume(regs);
+
+		if (ti_work & _TIF_PATCH_PENDING)
+			klp_update_patch_state(current);
+
+		if (ti_work & _TIF_SIGPENDING)
+			arch_do_signal(regs);
+
+		if (ti_work & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+			rseq_handle_notify_resume(NULL, regs);
+		}
+
+		/* Architecture specific TIF work */
+		arch_exit_to_user_mode_work(regs, ti_work);
+
+		/*
+		 * Disable interrupts and reevaluate the work flags as they
+		 * might have changed while interrupts and preemption was
+		 * enabled above.
+		 */
+		local_irq_disable_exit_to_user();
+		ti_work = READ_ONCE(current_thread_info()->flags);
+	}
+
+	/* Return the latest work state for arch_exit_to_user_mode() */
+	return ti_work;
+}
+
+static void exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	lockdep_assert_irqs_disabled();
+
+	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
+		ti_work = exit_to_user_mode_loop(regs, ti_work);
+
+	arch_exit_to_user_mode_prepare(regs, ti_work);
+
+	/* Ensure that the address limit is intact and no locks are held */
+	addr_limit_user_check();
+	lockdep_assert_irqs_disabled();
+	lockdep_sys_exit();
+}
+
+#ifndef _TIF_SINGLESTEP
+static inline bool report_single_step(unsigned long ti_work)
+{
+	return false;
+}
+#else
+/*
+ * If TIF_SYSCALL_EMU is set, then the only reason to report is when
+ * TIF_SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall
+ * instruction has been already reported in syscall_enter_from_usermode().
+ */
+#define SYSEMU_STEP	(_TIF_SINGLESTEP | _TIF_SYSCALL_EMU)
+
+static inline bool report_single_step(unsigned long ti_work)
+{
+	return (ti_work & SYSEMU_STEP) == _TIF_SINGLESTEP;
+}
+#endif
+
+static void syscall_exit_work(struct pt_regs *regs, unsigned long ti_work)
+{
+	bool step;
+
+	audit_syscall_exit(regs);
+
+	if (ti_work & _TIF_SYSCALL_TRACEPOINT)
+		trace_sys_exit(regs, syscall_get_return_value(current, regs));
+
+	step = report_single_step(ti_work);
+	if (step || ti_work & _TIF_SYSCALL_TRACE)
+		arch_syscall_exit_tracehook(regs, step);
+}
+
+/*
+ * Syscall specific exit to user mode preparation. Runs with interrupts
+ * enabled.
+ */
+static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	u32 cached_flags = READ_ONCE(current_thread_info()->flags);
+	unsigned long nr = syscall_get_nr(current, regs);
+
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
+	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
+		if (WARN(irqs_disabled(), "syscall %lu left IRQs disabled", nr))
+			local_irq_enable();
+	}
+
+	rseq_syscall(regs);
+
+	/*
+	 * Do one-time syscall specific work. If these work items are
+	 * enabled, we want to run them exactly once per syscall exit with
+	 * interrupts enabled.
+	 */
+	if (unlikely(cached_flags & SYSCALL_EXIT_WORK))
+		syscall_exit_work(regs, cached_flags);
+}
+
+__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+{
+	instrumentation_begin();
+	syscall_exit_to_user_mode_prepare(regs);
+	local_irq_disable_exit_to_user();
+	exit_to_user_mode_prepare(regs);
+	instrumentation_end();
+	exit_to_user_mode();
+}
+
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
 	enter_from_user_mode(regs);
 }
+
+noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
+{
+	instrumentation_begin();
+	exit_to_user_mode_prepare(regs);
+	instrumentation_end();
+	exit_to_user_mode();
+}



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 04/15] entry: Provide generic interrupt entry/exit code
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (2 preceding siblings ...)
  2020-07-22 21:59 ` [patch V5 03/15] entry: Provide generic syscall exit function Thomas Gleixner
@ 2020-07-22 21:59 ` Thomas Gleixner
  2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 21:59 ` [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode Thomas Gleixner
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 21:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Like the syscall entry/exit code interrupt/exception entry after the real
low level ASM bits should not be different accross architectures.

Provide a generic version based on the x86 code.

irqentry_enter() is called after the low level entry code and
irqentry_exit() must be invoked right before returning to the low level
code which just contains the actual return logic. The code before
irqentry_enter() and irqentry_exit() must not be instrumented. Code after
irqentry_enter() and before irqentry_exit() can be instrumented.

irqentry_enter() invokes irqentry_enter_from_user_mode() if the
interrupt/exception came from user mode. If if entered from kernel mode it
handles the kernel mode variant of establishing state for lockdep, RCU and
tracing depending on the kernel context it interrupted (idle, non-idle).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/entry-common.h |   62 ++++++++++++++++++++++
 kernel/entry/common.c        |  117 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 179 insertions(+)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -307,4 +307,66 @@ void irqentry_enter_from_user_mode(struc
  */
 void irqentry_exit_to_user_mode(struct pt_regs *regs);
 
+#ifndef irqentry_state
+typedef struct irqentry_state {
+	bool	exit_rcu;
+} irqentry_state_t;
+#endif
+
+/**
+ * irqentry_enter - Handle state tracking on ordinary interrupt entries
+ * @regs:	Pointer to pt_regs of interrupted context
+ *
+ * Invokes:
+ *  - lockdep irqflag state tracking as low level ASM entry disabled
+ *    interrupts.
+ *
+ *  - Context tracking if the exception hit user mode.
+ *
+ *  - The hardirq tracer to keep the state consistent as low level ASM
+ *    entry disabled interrupts.
+ *
+ * As a precondition, this requires that the entry came from user mode,
+ * idle, or a kernel context in which RCU is watching.
+ *
+ * For kernel mode entries RCU handling is done conditional. If RCU is
+ * watching then the only RCU requirement is to check whether the tick has
+ * to be restarted. If RCU is not watching then rcu_irq_enter() has to be
+ * invoked on entry and rcu_irq_exit() on exit.
+ *
+ * Avoiding the rcu_irq_enter/exit() calls is an optimization but also
+ * solves the problem of kernel mode pagefaults which can schedule, which
+ * is not possible after invoking rcu_irq_enter() without undoing it.
+ *
+ * For user mode entries irqentry_enter_from_user_mode() is invoked to
+ * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
+ * would not be possible.
+ *
+ * Returns: An opaque object that must be passed to idtentry_exit()
+ */
+irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+
+/**
+ * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
+ *
+ * Conditional reschedule with additional sanity checks.
+ */
+void irqentry_exit_cond_resched(void);
+
+/**
+ * irqentry_exit - Handle return from exception that used irqentry_enter()
+ * @regs:	Pointer to pt_regs (exception entry regs)
+ * @state:	Return value from matching call to irqentry_enter()
+ *
+ * Depending on the return target (kernel/user) this runs the necessary
+ * preemption and work checks if possible and reguired and returns to
+ * the caller with interrupts disabled and no further work pending.
+ *
+ * This is the last action before returning to the low level ASM code which
+ * just needs to return to the appropriate context.
+ *
+ * Counterpart to irqentry_enter().
+ */
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+
 #endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -255,3 +255,120 @@ noinstr void irqentry_exit_to_user_mode(
 	instrumentation_end();
 	exit_to_user_mode();
 }
+
+irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs)
+{
+	irqentry_state_t ret = {
+		.exit_rcu = false,
+	};
+
+	if (user_mode(regs)) {
+		irqentry_enter_from_user_mode(regs);
+		return ret;
+	}
+
+	/*
+	 * If this entry hit the idle task invoke rcu_irq_enter() whether
+	 * RCU is watching or not.
+	 *
+	 * Interupts can nest when the first interrupt invokes softirq
+	 * processing on return which enables interrupts.
+	 *
+	 * Scheduler ticks in the idle task can mark quiescent state and
+	 * terminate a grace period, if and only if the timer interrupt is
+	 * not nested into another interrupt.
+	 *
+	 * Checking for __rcu_is_watching() here would prevent the nesting
+	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
+	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
+	 * assume that it is the first interupt and eventually claim
+	 * quiescient state and end grace periods prematurely.
+	 *
+	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
+	 * consistent.
+	 *
+	 * TINY_RCU does not support EQS, so let the compiler eliminate
+	 * this part when enabled.
+	 */
+	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
+		/*
+		 * If RCU is not watching then the same careful
+		 * sequence vs. lockdep and tracing is required
+		 * as in irq_enter_from_user_mode().
+		 */
+		lockdep_hardirqs_off(CALLER_ADDR0);
+		rcu_irq_enter();
+		instrumentation_begin();
+		trace_hardirqs_off_finish();
+		instrumentation_end();
+
+		ret.exit_rcu = true;
+		return ret;
+	}
+
+	/*
+	 * If RCU is watching then RCU only wants to check whether it needs
+	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
+	 * already contains a warning when RCU is not watching, so no point
+	 * in having another one here.
+	 */
+	instrumentation_begin();
+	rcu_irq_enter_check_tick();
+	/* Use the combo lockdep/tracing function */
+	trace_hardirqs_off();
+	instrumentation_end();
+
+	return ret;
+}
+
+void irqentry_exit_cond_resched(void)
+{
+	if (!preempt_count()) {
+		/* Sanity check RCU and thread stack */
+		rcu_irq_exit_check_preempt();
+		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
+			WARN_ON_ONCE(!on_thread_stack());
+		if (need_resched())
+			preempt_schedule_irq();
+	}
+}
+
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+{
+	lockdep_assert_irqs_disabled();
+
+	/* Check whether this returns to user mode */
+	if (user_mode(regs)) {
+		irqentry_exit_to_user_mode(regs);
+	} else if (!regs_irqs_disabled(regs)) {
+		/*
+		 * If RCU was not watching on entry this needs to be done
+		 * carefully and needs the same ordering of lockdep/tracing
+		 * and RCU as the return to user mode path.
+		 */
+		if (state.exit_rcu) {
+			instrumentation_begin();
+			/* Tell the tracer that IRET will enable interrupts */
+			trace_hardirqs_on_prepare();
+			lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+			instrumentation_end();
+			rcu_irq_exit();
+			lockdep_hardirqs_on(CALLER_ADDR0);
+			return;
+		}
+
+		instrumentation_begin();
+		if (IS_ENABLED(CONFIG_PREEMPTION))
+			irqentry_exit_cond_resched();
+		/* Covers both tracing and lockdep */
+		trace_hardirqs_on();
+		instrumentation_end();
+	} else {
+		/*
+		 * IRQ flags state is correct already. Just tell RCU if it
+		 * was not watching on entry.
+		 */
+		if (state.exit_rcu)
+			rcu_irq_exit();
+	}
+}



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (3 preceding siblings ...)
  2020-07-22 21:59 ` [patch V5 04/15] entry: Provide generic interrupt entry/exit code Thomas Gleixner
@ 2020-07-22 21:59 ` Thomas Gleixner
  2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
  2020-07-29 16:55   ` [patch V5 05/15] " Qian Cai
  2020-07-22 22:00 ` [patch V5 06/15] x86/entry: Consolidate check_user_regs() Thomas Gleixner
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 21:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Entering a guest is similar to exiting to user space. Pending work like
handling signals, rescheduling, task work etc. needs to be handled before
that.

Provide generic infrastructure to avoid duplication of the same handling
code all over the place.

The transfer to guest mode handling is different from the exit to usermode
handling, e.g. vs. rseq and live patching, so a separate function is used.

The initial list of work items handled is:

    TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME

Architecture specific TIF flags can be added via defines in the
architecture specific include files.

The calling convention is also different from the syscall/interrupt entry
functions as KVM invokes this from the outer vcpu_run() loop with
interrupts and preemption enabled. To prevent missing a pending work item
it invokes a check for pending TIF work from interrupt disabled code right
before transitioning to guest mode. The lockdep, RCU and tracing state
handling is also done directly around the switch to and from guest mode.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Rename exit -> xfer (Sean)

V3: Reworked and simplified version adopted to recent X86 and KVM changes
    
V2: Moved KVM specific functions to kvm (Paolo)
    Added lockdep assert (Andy)
    Dropped live patching from enter guest mode work (Miroslav)
---
 include/linux/entry-kvm.h |   80 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h  |    8 ++++
 kernel/entry/Makefile     |    3 +
 kernel/entry/kvm.c        |   51 +++++++++++++++++++++++++++++
 virt/kvm/Kconfig          |    3 +
 5 files changed, 144 insertions(+), 1 deletion(-)

--- /dev/null
+++ b/include/linux/entry-kvm.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_ENTRYKVM_H
+#define __LINUX_ENTRYKVM_H
+
+#include <linux/entry-common.h>
+
+/* Transfer to guest mode work */
+#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
+
+#ifndef ARCH_XFER_TO_GUEST_MODE_WORK
+# define ARCH_XFER_TO_GUEST_MODE_WORK	(0)
+#endif
+
+#define XFER_TO_GUEST_MODE_WORK					\
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |			\
+	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+
+struct kvm_vcpu;
+
+/**
+ * arch_xfer_to_guest_mode_work - Architecture specific xfer to guest mode
+ *				  work function.
+ * @vcpu:	Pointer to current's VCPU data
+ * @ti_work:	Cached TIF flags gathered in xfer_to_guest_mode()
+ *
+ * Invoked from xfer_to_guest_mode_work(). Defaults to NOOP. Can be
+ * replaced by architecture specific code.
+ */
+static inline int arch_xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
+					      unsigned long ti_work);
+
+#ifndef arch_xfer_to_guest_mode_work
+static inline int arch_xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
+					       unsigned long ti_work)
+{
+	return 0;
+}
+#endif
+
+/**
+ * xfer_to_guest_mode - Check and handle pending work which needs to be
+ *			handled before returning to guest mode
+ * @vcpu:	Pointer to current's VCPU data
+ *
+ * Returns: 0 or an error code
+ */
+int xfer_to_guest_mode(struct kvm_vcpu *vcpu);
+
+/**
+ * __xfer_to_guest_mode_work_pending - Check if work is pending
+ *
+ * Returns: True if work pending, False otherwise.
+ *
+ * Bare variant of xfer_to_guest_mode_work_pending(). Can be called from
+ * interrupt enabled code for racy quick checks with care.
+ */
+static inline bool __xfer_to_guest_mode_work_pending(void)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	return !!(ti_work & XFER_TO_GUEST_MODE_WORK);
+}
+
+/**
+ * xfer_to_guest_mode_work_pending - Check if work is pending which needs to be
+ *				     handled before returning to guest mode
+ *
+ * Returns: True if work pending, False otherwise.
+ *
+ * Has to be invoked with interrupts disabled before the transition to
+ * guest mode.
+ */
+static inline bool xfer_to_guest_mode_work_pending(void)
+{
+	lockdep_assert_irqs_disabled();
+	return __xfer_to_guest_mode_work_pending();
+}
+#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
+
+#endif
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1439,4 +1439,12 @@ int kvm_vm_create_worker_thread(struct k
 				uintptr_t data, const char *name,
 				struct task_struct **thread_ptr);
 
+#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
+static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
+{
+	vcpu->run->exit_reason = KVM_EXIT_INTR;
+	vcpu->stat.signal_exits++;
+}
+#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
+
 #endif
--- a/kernel/entry/Makefile
+++ b/kernel/entry/Makefile
@@ -9,4 +9,5 @@ KCOV_INSTRUMENT := n
 CFLAGS_REMOVE_common.o	 = -fstack-protector -fstack-protector-strong
 CFLAGS_common.o		+= -fno-stack-protector
 
-obj-$(CONFIG_GENERIC_ENTRY) += common.o
+obj-$(CONFIG_GENERIC_ENTRY) 		+= common.o
+obj-$(CONFIG_KVM_XFER_TO_GUEST_WORK)	+= kvm.o
--- /dev/null
+++ b/kernel/entry/kvm.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/entry-kvm.h>
+#include <linux/kvm_host.h>
+
+static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
+{
+	do {
+		int ret;
+
+		if (ti_work & _TIF_SIGPENDING) {
+			kvm_handle_signal_exit(vcpu);
+			return -EINTR;
+		}
+
+		if (ti_work & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (ti_work & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(NULL);
+		}
+
+		ret = arch_xfer_to_guest_mode_work(vcpu, ti_work);
+		if (ret)
+			return ret;
+
+		ti_work = READ_ONCE(current_thread_info()->flags);
+	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+	return 0;
+}
+
+int xfer_to_guest_mode(struct kvm_vcpu *vcpu)
+{
+	unsigned long ti_work;
+
+	/*
+	 * This is invoked from the outer guest loop with interrupts and
+	 * preemption enabled.
+	 *
+	 * KVM invokes xfer_to_guest_mode_work_pending() with interrupts
+	 * disabled in the inner loop before going into guest mode. No need
+	 * to disable interrupts here.
+	 */
+	ti_work = READ_ONCE(current_thread_info()->flags);
+	if (!(ti_work & XFER_TO_GUEST_MODE_WORK))
+		return 0;
+
+	return xfer_to_guest_mode_work(vcpu, ti_work);
+}
+EXPORT_SYMBOL_GPL(xfer_to_guest_mode);
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -60,3 +60,6 @@ config HAVE_KVM_VCPU_RUN_PID_CHANGE
 
 config HAVE_KVM_NO_POLL
        bool
+
+config KVM_XFER_TO_GUEST_WORK
+       bool


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 06/15] x86/entry: Consolidate check_user_regs()
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (4 preceding siblings ...)
  2020-07-22 21:59 ` [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 07/15] x86/entry: Consolidate 32/64 bit syscall entry Thomas Gleixner
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

The user register sanity check is sprinkled all over the place. Move it
into enter_from_user_mode().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>

---
 arch/x86/entry/common.c |   24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -82,10 +82,11 @@ static noinstr void check_user_regs(stru
  * 2) Invoke context tracking if enabled to reactivate RCU
  * 3) Trace interrupts off state
  */
-static noinstr void enter_from_user_mode(void)
+static noinstr void enter_from_user_mode(struct pt_regs *regs)
 {
 	enum ctx_state state = ct_state();
 
+	check_user_regs(regs);
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	user_exit_irqoff();
 
@@ -95,8 +96,9 @@ static noinstr void enter_from_user_mode
 	instrumentation_end();
 }
 #else
-static __always_inline void enter_from_user_mode(void)
+static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 {
+	check_user_regs(regs);
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
@@ -369,9 +371,7 @@ static void __syscall_return_slowpath(st
 {
 	struct thread_info *ti;
 
-	check_user_regs(regs);
-
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	local_irq_enable();
@@ -434,9 +434,7 @@ static void do_syscall_32_irqs_on(struct
 /* Handles int $0x80 */
 __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
-	check_user_regs(regs);
-
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	local_irq_enable();
@@ -487,8 +485,6 @@ static bool __do_fast_syscall_32(struct
 					vdso_image_32.sym_int80_landing_pad;
 	bool success;
 
-	check_user_regs(regs);
-
 	/*
 	 * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
 	 * so that 'regs->ip -= 2' lands back on an int $0x80 instruction.
@@ -496,7 +492,7 @@ static bool __do_fast_syscall_32(struct
 	 */
 	regs->ip = landing_pad;
 
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	local_irq_enable();
@@ -599,8 +595,7 @@ idtentry_state_t noinstr idtentry_enter(
 	};
 
 	if (user_mode(regs)) {
-		check_user_regs(regs);
-		enter_from_user_mode();
+		enter_from_user_mode(regs);
 		return ret;
 	}
 
@@ -733,8 +728,7 @@ void noinstr idtentry_exit(struct pt_reg
  */
 void noinstr idtentry_enter_user(struct pt_regs *regs)
 {
-	check_user_regs(regs);
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 }
 
 /**




^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 07/15] x86/entry: Consolidate 32/64 bit syscall entry
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (5 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 06/15] x86/entry: Consolidate check_user_regs() Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 08/15] x86/entry: Move user return notifier out of loop Thomas Gleixner
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

64bit and 32bit entry code have the same open coded syscall entry handling
after the bitwidth specific bits.

Move it to a helper function and share the code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/entry/common.c |   93 +++++++++++++++++++++---------------------------
 1 file changed, 41 insertions(+), 52 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -366,8 +366,7 @@ static void __syscall_return_slowpath(st
 	exit_to_user_mode();
 }
 
-#ifdef CONFIG_X86_64
-__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+static noinstr long syscall_enter(struct pt_regs *regs, unsigned long nr)
 {
 	struct thread_info *ti;
 
@@ -379,6 +378,16 @@ static void __syscall_return_slowpath(st
 	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
 		nr = syscall_trace_enter(regs);
 
+	instrumentation_end();
+	return nr;
+}
+
+#ifdef CONFIG_X86_64
+__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+{
+	nr = syscall_enter(regs, nr);
+
+	instrumentation_begin();
 	if (likely(nr < NR_syscalls)) {
 		nr = array_index_nospec(nr, NR_syscalls);
 		regs->ax = sys_call_table[nr](regs);
@@ -390,64 +399,53 @@ static void __syscall_return_slowpath(st
 		regs->ax = x32_sys_call_table[nr](regs);
 #endif
 	}
-	__syscall_return_slowpath(regs);
-
 	instrumentation_end();
-	exit_to_user_mode();
+	syscall_return_slowpath(regs);
 }
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
+{
+	if (IS_ENABLED(CONFIG_IA32_EMULATION))
+		current_thread_info()->status |= TS_COMPAT;
+	/*
+	 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
+	 * orig_ax, the unsigned int return value truncates it.  This may
+	 * or may not be necessary, but it matches the old asm behavior.
+	 */
+	return syscall_enter(regs, (unsigned int)regs->orig_ax);
+}
+
 /*
- * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
- * all entry and exit work and returns with IRQs off.  This function is
- * extremely hot in workloads that use it, and it's usually called from
- * do_fast_syscall_32, so forcibly inline it to improve performance.
+ * Invoke a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.
  */
-static void do_syscall_32_irqs_on(struct pt_regs *regs)
+static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
+						  unsigned int nr)
 {
-	struct thread_info *ti = current_thread_info();
-	unsigned int nr = (unsigned int)regs->orig_ax;
-
-#ifdef CONFIG_IA32_EMULATION
-	ti->status |= TS_COMPAT;
-#endif
-
-	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
-		/*
-		 * Subtlety here: if ptrace pokes something larger than
-		 * 2^32-1 into orig_ax, this truncates it.  This may or
-		 * may not be necessary, but it matches the old asm
-		 * behavior.
-		 */
-		nr = syscall_trace_enter(regs);
-	}
-
 	if (likely(nr < IA32_NR_syscalls)) {
+		instrumentation_begin();
 		nr = array_index_nospec(nr, IA32_NR_syscalls);
 		regs->ax = ia32_sys_call_table[nr](regs);
+		instrumentation_end();
 	}
-
-	__syscall_return_slowpath(regs);
 }
 
 /* Handles int $0x80 */
 __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
-	enter_from_user_mode(regs);
-	instrumentation_begin();
-
-	local_irq_enable();
-	do_syscall_32_irqs_on(regs);
+	unsigned int nr = syscall_32_enter(regs);
 
-	instrumentation_end();
-	exit_to_user_mode();
+	do_syscall_32_irqs_on(regs, nr);
+	syscall_return_slowpath(regs);
 }
 
-static bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 {
+	unsigned int nr	= syscall_32_enter(regs);
 	int res;
 
+	instrumentation_begin();
 	/* Fetch EBP from where the vDSO stashed it. */
 	if (IS_ENABLED(CONFIG_X86_64)) {
 		/*
@@ -460,17 +458,18 @@ static bool __do_fast_syscall_32(struct
 		res = get_user(*(u32 *)&regs->bp,
 		       (u32 __user __force *)(unsigned long)(u32)regs->sp);
 	}
+	instrumentation_end();
 
 	if (res) {
 		/* User code screwed up. */
 		regs->ax = -EFAULT;
-		local_irq_disable();
-		__prepare_exit_to_usermode(regs);
+		syscall_return_slowpath(regs);
 		return false;
 	}
 
 	/* Now this is just like a normal syscall. */
-	do_syscall_32_irqs_on(regs);
+	do_syscall_32_irqs_on(regs, nr);
+	syscall_return_slowpath(regs);
 	return true;
 }
 
@@ -483,7 +482,6 @@ static bool __do_fast_syscall_32(struct
 	 */
 	unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
 					vdso_image_32.sym_int80_landing_pad;
-	bool success;
 
 	/*
 	 * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
@@ -492,17 +490,8 @@ static bool __do_fast_syscall_32(struct
 	 */
 	regs->ip = landing_pad;
 
-	enter_from_user_mode(regs);
-	instrumentation_begin();
-
-	local_irq_enable();
-	success = __do_fast_syscall_32(regs);
-
-	instrumentation_end();
-	exit_to_user_mode();
-
-	/* If it failed, keep it simple: use IRET. */
-	if (!success)
+	/* Invoke the syscall. If it failed, keep it simple: use IRET. */
+	if (!__do_fast_syscall_32(regs))
 		return 0;
 
 #ifdef CONFIG_X86_64




^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 08/15] x86/entry: Move user return notifier out of loop
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (6 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 07/15] x86/entry: Consolidate 32/64 bit syscall entry Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-23 23:41   ` Sean Christopherson
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 09/15] x86/ptrace: Provide pt_regs helper for entry/exit Thomas Gleixner
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Guests and user space share certain MSRs. KVM sets these MSRs to guest
values once and does not set them back to user space values on every VM
exit to spare the costly MSR operations.

User return notifiers ensure that these MSRs are set back to the correct
values before returning to user space in exit_to_usermode_loop().

There is no reason to evaluate the TIF flag indicating that user return
notifiers need to be invoked in the loop. The important point is that they
are invoked before returning to user space.

Move the invocation out of the loop into the section which does the last
preperatory steps before returning to user space. That section is not
preemptible and runs with interrupts disabled until the actual return.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
V4: New patch
---
 arch/x86/entry/common.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -208,7 +208,7 @@ static long syscall_trace_enter(struct p
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
@@ -242,9 +242,6 @@ static void exit_to_usermode_loop(struct
 			rseq_handle_notify_resume(NULL, regs);
 		}
 
-		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
-			fire_user_return_notifiers();
-
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
@@ -273,6 +270,9 @@ static void __prepare_exit_to_usermode(s
 	/* Reload ti->flags; we may have rescheduled above. */
 	cached_flags = READ_ONCE(ti->flags);
 
+	if (cached_flags & _TIF_USER_RETURN_NOTIFY)
+		fire_user_return_notifiers();
+
 	if (unlikely(cached_flags & _TIF_IO_BITMAP))
 		tss_update_io_bitmap();
 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 09/15] x86/ptrace: Provide pt_regs helper for entry/exit
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (7 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 08/15] x86/entry: Move user return notifier out of loop Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 10/15] x86/entry: Use generic syscall entry function Thomas Gleixner
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

As a preparatory step for moving the syscall and interrupt entry/exit
handling into generic code, provide a pt_regs helper which retrieves the
interrupt state from pt_regs. This is required to check whether interrupts
are reenabled by return from interrupt/exception.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>

---
V4: Remove the syscall nr and return value helpers as they alredy exist
---
 arch/x86/include/asm/ptrace.h |    5 +++++
 1 file changed, 5 insertions(+)

--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -209,6 +209,11 @@ static inline void user_stack_pointer_se
 	regs->sp = val;
 }
 
+static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
+{
+	return !(regs->flags & X86_EFLAGS_IF);
+}
+
 /* Query offset/name of register from its name/offset */
 extern int regs_query_register_offset(const char *name);
 extern const char *regs_query_register_name(unsigned int offset);



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 10/15] x86/entry: Use generic syscall entry function
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (8 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 09/15] x86/ptrace: Provide pt_regs helper for entry/exit Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 11/15] x86/entry: Use generic syscall exit functionality Thomas Gleixner
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Replace the syscall entry work handling with the generic version. Provide
the necessary helper inlines to handle the real architecture specific
parts, e.g. ptrace.

Use a temporary define for idtentry_enter_user which will be cleaned up
seperately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>

---
V4: Drop the seccomp/audit wrappers as they are not longer required (Kees)

V3: Adapt to the noinstr changes
---
 arch/x86/Kconfig                    |    1 
 arch/x86/entry/common.c             |  181 +-----------------------------------
 arch/x86/include/asm/entry-common.h |   32 ++++++
 arch/x86/include/asm/idtentry.h     |    5 
 arch/x86/include/asm/thread_info.h  |    5 
 5 files changed, 45 insertions(+), 179 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -115,6 +115,7 @@ config X86
 	select GENERIC_CPU_AUTOPROBE
 	select GENERIC_CPU_VULNERABILITIES
 	select GENERIC_EARLY_IOREMAP
+	select GENERIC_ENTRY
 	select GENERIC_FIND_FIRST_BIT
 	select GENERIC_IOMAP
 	select GENERIC_IRQ_EFFECTIVE_AFF_MASK	if SMP
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -10,13 +10,13 @@
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
+#include <linux/entry-common.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/errno.h>
 #include <linux/ptrace.h>
 #include <linux/tracehook.h>
 #include <linux/audit.h>
-#include <linux/seccomp.h>
 #include <linux/signal.h>
 #include <linux/export.h>
 #include <linux/context_tracking.h>
@@ -42,70 +42,8 @@
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
 
-#define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-/* Check that the stack and regs on entry from user mode are sane. */
-static noinstr void check_user_regs(struct pt_regs *regs)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
-		/*
-		 * Make sure that the entry code gave us a sensible EFLAGS
-		 * register.  Native because we want to check the actual CPU
-		 * state, not the interrupt state as imagined by Xen.
-		 */
-		unsigned long flags = native_save_fl();
-		WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
-				      X86_EFLAGS_NT));
-
-		/* We think we came from user mode. Make sure pt_regs agrees. */
-		WARN_ON_ONCE(!user_mode(regs));
-
-		/*
-		 * All entries from user mode (except #DF) should be on the
-		 * normal thread stack and should have user pt_regs in the
-		 * correct location.
-		 */
-		WARN_ON_ONCE(!on_thread_stack());
-		WARN_ON_ONCE(regs != task_pt_regs(current));
-	}
-}
-
-#ifdef CONFIG_CONTEXT_TRACKING
-/**
- * enter_from_user_mode - Establish state when coming from user mode
- *
- * Syscall entry disables interrupts, but user mode is traced as interrupts
- * enabled. Also with NO_HZ_FULL RCU might be idle.
- *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- */
-static noinstr void enter_from_user_mode(struct pt_regs *regs)
-{
-	enum ctx_state state = ct_state();
-
-	check_user_regs(regs);
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	user_exit_irqoff();
-
-	instrumentation_begin();
-	CT_WARN_ON(state != CONTEXT_USER);
-	trace_hardirqs_off_finish();
-	instrumentation_end();
-}
-#else
-static __always_inline void enter_from_user_mode(struct pt_regs *regs)
-{
-	check_user_regs(regs);
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	instrumentation_begin();
-	trace_hardirqs_off_finish();
-	instrumentation_end();
-}
-#endif
-
 /**
  * exit_to_user_mode - Fixup state when exiting to user mode
  *
@@ -129,83 +67,6 @@ static __always_inline void exit_to_user
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
-static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
-{
-#ifdef CONFIG_X86_64
-	if (arch == AUDIT_ARCH_X86_64) {
-		audit_syscall_entry(regs->orig_ax, regs->di,
-				    regs->si, regs->dx, regs->r10);
-	} else
-#endif
-	{
-		audit_syscall_entry(regs->orig_ax, regs->bx,
-				    regs->cx, regs->dx, regs->si);
-	}
-}
-
-/*
- * Returns the syscall nr to run (which should match regs->orig_ax) or -1
- * to skip the syscall.
- */
-static long syscall_trace_enter(struct pt_regs *regs)
-{
-	u32 arch = in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
-
-	struct thread_info *ti = current_thread_info();
-	unsigned long ret = 0;
-	u32 work;
-
-	work = READ_ONCE(ti->flags);
-
-	if (work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) {
-		ret = tracehook_report_syscall_entry(regs);
-		if (ret || (work & _TIF_SYSCALL_EMU))
-			return -1L;
-	}
-
-#ifdef CONFIG_SECCOMP
-	/*
-	 * Do seccomp after ptrace, to catch any tracer changes.
-	 */
-	if (work & _TIF_SECCOMP) {
-		struct seccomp_data sd;
-
-		sd.arch = arch;
-		sd.nr = regs->orig_ax;
-		sd.instruction_pointer = regs->ip;
-#ifdef CONFIG_X86_64
-		if (arch == AUDIT_ARCH_X86_64) {
-			sd.args[0] = regs->di;
-			sd.args[1] = regs->si;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->r10;
-			sd.args[4] = regs->r8;
-			sd.args[5] = regs->r9;
-		} else
-#endif
-		{
-			sd.args[0] = regs->bx;
-			sd.args[1] = regs->cx;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->si;
-			sd.args[4] = regs->di;
-			sd.args[5] = regs->bp;
-		}
-
-		ret = __secure_computing(&sd);
-		if (ret == -1)
-			return ret;
-	}
-#endif
-
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->orig_ax);
-
-	do_audit_syscall_entry(regs, arch);
-
-	return ret ?: regs->orig_ax;
-}
-
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING)
@@ -366,26 +227,10 @@ static void __syscall_return_slowpath(st
 	exit_to_user_mode();
 }
 
-static noinstr long syscall_enter(struct pt_regs *regs, unsigned long nr)
-{
-	struct thread_info *ti;
-
-	enter_from_user_mode(regs);
-	instrumentation_begin();
-
-	local_irq_enable();
-	ti = current_thread_info();
-	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
-		nr = syscall_trace_enter(regs);
-
-	instrumentation_end();
-	return nr;
-}
-
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
-	nr = syscall_enter(regs, nr);
+	nr = syscall_enter_from_user_mode(regs, nr);
 
 	instrumentation_begin();
 	if (likely(nr < NR_syscalls)) {
@@ -407,6 +252,8 @@ static noinstr long syscall_enter(struct
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
 static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
 {
+	unsigned int nr = (unsigned int)regs->orig_ax;
+
 	if (IS_ENABLED(CONFIG_IA32_EMULATION))
 		current_thread_info()->status |= TS_COMPAT;
 	/*
@@ -414,7 +261,7 @@ static __always_inline unsigned int sysc
 	 * orig_ax, the unsigned int return value truncates it.  This may
 	 * or may not be necessary, but it matches the old asm behavior.
 	 */
-	return syscall_enter(regs, (unsigned int)regs->orig_ax);
+	return (unsigned int)syscall_enter_from_user_mode(regs, nr);
 }
 
 /*
@@ -568,7 +415,7 @@ SYSCALL_DEFINE0(ni_syscall)
  * solves the problem of kernel mode pagefaults which can schedule, which
  * is not possible after invoking rcu_irq_enter() without undoing it.
  *
- * For user mode entries enter_from_user_mode() must be invoked to
+ * For user mode entries irqentry_enter_from_user_mode() must be invoked to
  * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
  * would not be possible.
  *
@@ -584,7 +431,7 @@ idtentry_state_t noinstr idtentry_enter(
 	};
 
 	if (user_mode(regs)) {
-		enter_from_user_mode(regs);
+		irqentry_enter_from_user_mode(regs);
 		return ret;
 	}
 
@@ -615,7 +462,7 @@ idtentry_state_t noinstr idtentry_enter(
 		/*
 		 * If RCU is not watching then the same careful
 		 * sequence vs. lockdep and tracing is required
-		 * as in enter_from_user_mode().
+		 * as in irqentry_enter_from_user_mode().
 		 */
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
@@ -709,18 +556,6 @@ void noinstr idtentry_exit(struct pt_reg
 }
 
 /**
- * idtentry_enter_user - Handle state tracking on idtentry from user mode
- * @regs:	Pointer to pt_regs of interrupted context
- *
- * Invokes enter_from_user_mode() to establish the proper context for
- * NOHZ_FULL. Otherwise scheduling on exit would not be possible.
- */
-void noinstr idtentry_enter_user(struct pt_regs *regs)
-{
-	enter_from_user_mode(regs);
-}
-
-/**
  * idtentry_exit_user - Handle return from exception to user mode
  * @regs:	Pointer to pt_regs (exception entry regs)
  *
--- /dev/null
+++ b/arch/x86/include/asm/entry-common.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_X86_ENTRY_COMMON_H
+#define _ASM_X86_ENTRY_COMMON_H
+
+/* Check that the stack and regs on entry from user mode are sane. */
+static __always_inline void arch_check_user_regs(struct pt_regs *regs)
+{
+	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
+		/*
+		 * Make sure that the entry code gave us a sensible EFLAGS
+		 * register.  Native because we want to check the actual CPU
+		 * state, not the interrupt state as imagined by Xen.
+		 */
+		unsigned long flags = native_save_fl();
+		WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
+				      X86_EFLAGS_NT));
+
+		/* We think we came from user mode. Make sure pt_regs agrees. */
+		WARN_ON_ONCE(!user_mode(regs));
+
+		/*
+		 * All entries from user mode (except #DF) should be on the
+		 * normal thread stack and should have user pt_regs in the
+		 * correct location.
+		 */
+		WARN_ON_ONCE(!on_thread_stack());
+		WARN_ON_ONCE(regs != task_pt_regs(current));
+	}
+}
+#define arch_check_user_regs arch_check_user_regs
+
+#endif
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -6,11 +6,14 @@
 #include <asm/trapnr.h>
 
 #ifndef __ASSEMBLY__
+#include <linux/entry-common.h>
 #include <linux/hardirq.h>
 
 #include <asm/irq_stack.h>
 
-void idtentry_enter_user(struct pt_regs *regs);
+/* Temporary define */
+#define idtentry_enter_user	irqentry_enter_from_user_mode
+
 void idtentry_exit_user(struct pt_regs *regs);
 
 typedef struct idtentry_state {
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -133,11 +133,6 @@ struct thread_info {
 #define _TIF_X32		(1 << TIF_X32)
 #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
 
-/* Work to do before invoking the actual syscall. */
-#define _TIF_WORK_SYSCALL_ENTRY	\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
-
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
 	(_TIF_NOCPUID | _TIF_NOTSC | _TIF_BLOCKSTEP |		\



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 11/15] x86/entry: Use generic syscall exit functionality
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (9 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 10/15] x86/entry: Use generic syscall entry function Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 12/15] x86/entry: Cleanup idtentry_entry/exit_user Thomas Gleixner
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Replace the x86 variant with the generic version. Provide the relevant
architecture specific helper functions and defines.

Use a temporary define for idtentry_exit_user which will be cleaned up
seperately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>

---
V4: Drop a pointless define
    Adjust to moved TIF_USER_RETURN_NOTIFY handling
    Bring back I/O bitmap handling
---
 arch/x86/entry/common.c             |  221 ------------------------------------
 arch/x86/entry/entry_32.S           |    2 
 arch/x86/entry/entry_64.S           |    2 
 arch/x86/include/asm/entry-common.h |   44 +++++++
 arch/x86/include/asm/idtentry.h     |    3 
 arch/x86/include/asm/signal.h       |    1 
 arch/x86/kernel/signal.c            |    3 
 7 files changed, 54 insertions(+), 222 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -15,15 +15,8 @@
 #include <linux/smp.h>
 #include <linux/errno.h>
 #include <linux/ptrace.h>
-#include <linux/tracehook.h>
-#include <linux/audit.h>
-#include <linux/signal.h>
 #include <linux/export.h>
-#include <linux/context_tracking.h>
-#include <linux/user-return-notifier.h>
 #include <linux/nospec.h>
-#include <linux/uprobes.h>
-#include <linux/livepatch.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 
@@ -42,191 +35,6 @@
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
 
-#include <trace/events/syscalls.h>
-
-/**
- * exit_to_user_mode - Fixup state when exiting to user mode
- *
- * Syscall exit enables interrupts, but the kernel state is interrupts
- * disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Clear CPU buffers if CPU is affected by MDS and the migitation is on.
- * 4) Tell lockdep that interrupts are enabled
- */
-static __always_inline void exit_to_user_mode(void)
-{
-	instrumentation_begin();
-	trace_hardirqs_on_prepare();
-	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-	instrumentation_end();
-
-	user_enter_irqoff();
-	mds_user_clear_cpu_buffers();
-	lockdep_hardirqs_on(CALLER_ADDR0);
-}
-
-#define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING)
-
-static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
-{
-	/*
-	 * In order to return to user mode, we need to have IRQs off with
-	 * none of EXIT_TO_USERMODE_LOOP_FLAGS set.  Several of these flags
-	 * can be set at any time on preemptible kernels if we have IRQs on,
-	 * so we need to loop.  Disabling preemption wouldn't help: doing the
-	 * work to clear some of the flags can sleep.
-	 */
-	while (true) {
-		/* We have work to do. */
-		local_irq_enable();
-
-		if (cached_flags & _TIF_NEED_RESCHED)
-			schedule();
-
-		if (cached_flags & _TIF_UPROBE)
-			uprobe_notify_resume(regs);
-
-		if (cached_flags & _TIF_PATCH_PENDING)
-			klp_update_patch_state(current);
-
-		/* deal with pending signal delivery */
-		if (cached_flags & _TIF_SIGPENDING)
-			do_signal(regs);
-
-		if (cached_flags & _TIF_NOTIFY_RESUME) {
-			clear_thread_flag(TIF_NOTIFY_RESUME);
-			tracehook_notify_resume(regs);
-			rseq_handle_notify_resume(NULL, regs);
-		}
-
-		/* Disable IRQs and retry */
-		local_irq_disable();
-
-		cached_flags = READ_ONCE(current_thread_info()->flags);
-
-		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
-			break;
-	}
-}
-
-static void __prepare_exit_to_usermode(struct pt_regs *regs)
-{
-	struct thread_info *ti = current_thread_info();
-	u32 cached_flags;
-
-	addr_limit_user_check();
-
-	lockdep_assert_irqs_disabled();
-	lockdep_sys_exit();
-
-	cached_flags = READ_ONCE(ti->flags);
-
-	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
-		exit_to_usermode_loop(regs, cached_flags);
-
-	/* Reload ti->flags; we may have rescheduled above. */
-	cached_flags = READ_ONCE(ti->flags);
-
-	if (cached_flags & _TIF_USER_RETURN_NOTIFY)
-		fire_user_return_notifiers();
-
-	if (unlikely(cached_flags & _TIF_IO_BITMAP))
-		tss_update_io_bitmap();
-
-	fpregs_assert_state_consistent();
-	if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
-		switch_fpu_return();
-
-#ifdef CONFIG_COMPAT
-	/*
-	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
-	 * returning to user mode.  We need to clear it *after* signal
-	 * handling, because syscall restart has a fixup for compat
-	 * syscalls.  The fixup is exercised by the ptrace_syscall_32
-	 * selftest.
-	 *
-	 * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
-	 * special case only applies after poking regs and before the
-	 * very next return to user mode.
-	 */
-	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
-#endif
-}
-
-static noinstr void prepare_exit_to_usermode(struct pt_regs *regs)
-{
-	instrumentation_begin();
-	__prepare_exit_to_usermode(regs);
-	instrumentation_end();
-	exit_to_user_mode();
-}
-
-#define SYSCALL_EXIT_WORK_FLAGS				\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
-
-static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
-{
-	bool step;
-
-	audit_syscall_exit(regs);
-
-	if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
-		trace_sys_exit(regs, regs->ax);
-
-	/*
-	 * If TIF_SYSCALL_EMU is set, we only get here because of
-	 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
-	 * We already reported this syscall instruction in
-	 * syscall_trace_enter().
-	 */
-	step = unlikely(
-		(cached_flags & (_TIF_SINGLESTEP | _TIF_SYSCALL_EMU))
-		== _TIF_SINGLESTEP);
-	if (step || cached_flags & _TIF_SYSCALL_TRACE)
-		tracehook_report_syscall_exit(regs, step);
-}
-
-static void __syscall_return_slowpath(struct pt_regs *regs)
-{
-	struct thread_info *ti = current_thread_info();
-	u32 cached_flags = READ_ONCE(ti->flags);
-
-	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
-
-	if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
-	    WARN(irqs_disabled(), "syscall %ld left IRQs disabled", regs->orig_ax))
-		local_irq_enable();
-
-	rseq_syscall(regs);
-
-	/*
-	 * First do one-time work.  If these work items are enabled, we
-	 * want to run them exactly once per syscall exit with IRQs on.
-	 */
-	if (unlikely(cached_flags & SYSCALL_EXIT_WORK_FLAGS))
-		syscall_slow_exit_work(regs, cached_flags);
-
-	local_irq_disable();
-	__prepare_exit_to_usermode(regs);
-}
-
-/*
- * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
- * state such that we can immediately switch to user mode.
- */
-__visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
-{
-	instrumentation_begin();
-	__syscall_return_slowpath(regs);
-	instrumentation_end();
-	exit_to_user_mode();
-}
-
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -245,7 +53,7 @@ static void __syscall_return_slowpath(st
 #endif
 	}
 	instrumentation_end();
-	syscall_return_slowpath(regs);
+	syscall_exit_to_user_mode(regs);
 }
 #endif
 
@@ -284,7 +92,7 @@ static __always_inline void do_syscall_3
 	unsigned int nr = syscall_32_enter(regs);
 
 	do_syscall_32_irqs_on(regs, nr);
-	syscall_return_slowpath(regs);
+	syscall_exit_to_user_mode(regs);
 }
 
 static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
@@ -310,13 +118,13 @@ static noinstr bool __do_fast_syscall_32
 	if (res) {
 		/* User code screwed up. */
 		regs->ax = -EFAULT;
-		syscall_return_slowpath(regs);
+		syscall_exit_to_user_mode(regs);
 		return false;
 	}
 
 	/* Now this is just like a normal syscall. */
 	do_syscall_32_irqs_on(regs, nr);
-	syscall_return_slowpath(regs);
+	syscall_exit_to_user_mode(regs);
 	return true;
 }
 
@@ -524,7 +332,7 @@ void noinstr idtentry_exit(struct pt_reg
 
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
-		prepare_exit_to_usermode(regs);
+		irqentry_exit_to_user_mode(regs);
 	} else if (regs->flags & X86_EFLAGS_IF) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
@@ -555,25 +363,6 @@ void noinstr idtentry_exit(struct pt_reg
 	}
 }
 
-/**
- * idtentry_exit_user - Handle return from exception to user mode
- * @regs:	Pointer to pt_regs (exception entry regs)
- *
- * Runs the necessary preemption and work checks and returns to the caller
- * with interrupts disabled and no further work pending.
- *
- * This is the last action before returning to the low level ASM code which
- * just needs to return to the appropriate context.
- *
- * Counterpart to idtentry_enter_user().
- */
-void noinstr idtentry_exit_user(struct pt_regs *regs)
-{
-	lockdep_assert_irqs_disabled();
-
-	prepare_exit_to_usermode(regs);
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -846,7 +846,7 @@ SYM_CODE_START(ret_from_fork)
 2:
 	/* When we fork, we trace the syscall return in the child, too. */
 	movl    %esp, %eax
-	call    syscall_return_slowpath
+	call    syscall_exit_to_user_mode
 	jmp     .Lsyscall_32_done
 
 	/* kernel thread */
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -283,7 +283,7 @@ SYM_CODE_START(ret_from_fork)
 2:
 	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
-	call	syscall_return_slowpath	/* returns with IRQs disabled */
+	call	syscall_exit_to_user_mode	/* returns with IRQs disabled */
 	jmp	swapgs_restore_regs_and_return_to_usermode
 
 1:
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -2,6 +2,12 @@
 #ifndef _ASM_X86_ENTRY_COMMON_H
 #define _ASM_X86_ENTRY_COMMON_H
 
+#include <linux/user-return-notifier.h>
+
+#include <asm/nospec-branch.h>
+#include <asm/io_bitmap.h>
+#include <asm/fpu/api.h>
+
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
 {
@@ -29,4 +35,42 @@ static __always_inline void arch_check_u
 }
 #define arch_check_user_regs arch_check_user_regs
 
+#define ARCH_SYSCALL_EXIT_WORK		(_TIF_SINGLESTEP)
+
+static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
+						  unsigned long ti_work)
+{
+	if (ti_work & _TIF_USER_RETURN_NOTIFY)
+		fire_user_return_notifiers();
+
+	if (unlikely(ti_work & _TIF_IO_BITMAP))
+		tss_update_io_bitmap();
+
+	fpregs_assert_state_consistent();
+	if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
+		switch_fpu_return();
+
+#ifdef CONFIG_COMPAT
+	/*
+	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
+	 * returning to user mode.  We need to clear it *after* signal
+	 * handling, because syscall restart has a fixup for compat
+	 * syscalls.  The fixup is exercised by the ptrace_syscall_32
+	 * selftest.
+	 *
+	 * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
+	 * special case only applies after poking regs and before the
+	 * very next return to user mode.
+	 */
+	current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
+#endif
+}
+#define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
+
+static __always_inline void arch_exit_to_user_mode(void)
+{
+	mds_user_clear_cpu_buffers();
+}
+#define arch_exit_to_user_mode arch_exit_to_user_mode
+
 #endif
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -13,8 +13,7 @@
 
 /* Temporary define */
 #define idtentry_enter_user	irqentry_enter_from_user_mode
-
-void idtentry_exit_user(struct pt_regs *regs);
+#define idtentry_exit_user	irqentry_exit_to_user_mode
 
 typedef struct idtentry_state {
 	bool exit_rcu;
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -35,7 +35,6 @@ typedef sigset_t compat_sigset_t;
 #endif /* __ASSEMBLY__ */
 #include <uapi/asm/signal.h>
 #ifndef __ASSEMBLY__
-extern void do_signal(struct pt_regs *regs);
 
 #define __ARCH_HAS_SA_RESTORER
 
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
 #include <linux/context_tracking.h>
+#include <linux/entry-common.h>
 #include <linux/syscalls.h>
 
 #include <asm/processor.h>
@@ -803,7 +804,7 @@ static inline unsigned long get_nr_resta
  * want to handle. Thus you cannot kill init even with a SIGKILL even by
  * mistake.
  */
-void do_signal(struct pt_regs *regs)
+void arch_do_signal(struct pt_regs *regs)
 {
 	struct ksignal ksig;
 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 12/15] x86/entry: Cleanup idtentry_entry/exit_user
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (10 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 11/15] x86/entry: Use generic syscall exit functionality Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 13/15] x86/entry: Use generic interrupt entry/exit code Thomas Gleixner
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Cleanup the temporary defines and use irqentry_ instead of idtentry_.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>

---
 arch/x86/include/asm/idtentry.h |    4 ----
 arch/x86/kernel/cpu/mce/core.c  |    4 ++--
 arch/x86/kernel/traps.c         |   18 +++++++++---------
 3 files changed, 11 insertions(+), 15 deletions(-)

--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,10 +11,6 @@
 
 #include <asm/irq_stack.h>
 
-/* Temporary define */
-#define idtentry_enter_user	irqentry_enter_from_user_mode
-#define idtentry_exit_user	irqentry_exit_to_user_mode
-
 typedef struct idtentry_state {
 	bool exit_rcu;
 } idtentry_state_t;
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1927,11 +1927,11 @@ static __always_inline void exc_machine_
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
 {
-	idtentry_enter_user(regs);
+	irqentry_enter_from_user_mode(regs);
 	instrumentation_begin();
 	machine_check_vector(regs);
 	instrumentation_end();
-	idtentry_exit_user(regs);
+	irqentry_exit_to_user_mode(regs);
 }
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -638,18 +638,18 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		return;
 
 	/*
-	 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
-	 * can trigger INT3, hence poke_int3_handler() must be done
-	 * before. If the entry came from kernel mode, then use nmi_enter()
-	 * because the INT3 could have been hit in any context including
-	 * NMI.
+	 * irqentry_enter_from_user_mode() uses static_branch_{,un}likely()
+	 * and therefore can trigger INT3, hence poke_int3_handler() must
+	 * be done before. If the entry came from kernel mode, then use
+	 * nmi_enter() because the INT3 could have been hit in any context
+	 * including NMI.
 	 */
 	if (user_mode(regs)) {
-		idtentry_enter_user(regs);
+		irqentry_enter_from_user_mode(regs);
 		instrumentation_begin();
 		do_int3_user(regs);
 		instrumentation_end();
-		idtentry_exit_user(regs);
+		irqentry_exit_to_user_mode(regs);
 	} else {
 		nmi_enter();
 		instrumentation_begin();
@@ -901,12 +901,12 @@ static __always_inline void exc_debug_us
 	 */
 	WARN_ON_ONCE(!user_mode(regs));
 
-	idtentry_enter_user(regs);
+	irqentry_enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	handle_debug(regs, dr6, true);
 	instrumentation_end();
-	idtentry_exit_user(regs);
+	irqentry_exit_to_user_mode(regs);
 }
 
 #ifdef CONFIG_X86_64




^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 13/15] x86/entry: Use generic interrupt entry/exit code
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (11 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 12/15] x86/entry: Cleanup idtentry_entry/exit_user Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 14:28   ` Ingo Molnar
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 14/15] x86/entry: Cleanup idtentry_enter/exit Thomas Gleixner
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Replace the x86 code with the generic variant. Use temporary defines for
idtentry_* which will be cleaned up in the next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/entry/common.c         |  167 ----------------------------------------
 arch/x86/include/asm/idtentry.h |   10 --
 2 files changed, 5 insertions(+), 172 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -198,171 +198,6 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
-/**
- * idtentry_enter - Handle state tracking on ordinary idtentries
- * @regs:	Pointer to pt_regs of interrupted context
- *
- * Invokes:
- *  - lockdep irqflag state tracking as low level ASM entry disabled
- *    interrupts.
- *
- *  - Context tracking if the exception hit user mode.
- *
- *  - The hardirq tracer to keep the state consistent as low level ASM
- *    entry disabled interrupts.
- *
- * As a precondition, this requires that the entry came from user mode,
- * idle, or a kernel context in which RCU is watching.
- *
- * For kernel mode entries RCU handling is done conditional. If RCU is
- * watching then the only RCU requirement is to check whether the tick has
- * to be restarted. If RCU is not watching then rcu_irq_enter() has to be
- * invoked on entry and rcu_irq_exit() on exit.
- *
- * Avoiding the rcu_irq_enter/exit() calls is an optimization but also
- * solves the problem of kernel mode pagefaults which can schedule, which
- * is not possible after invoking rcu_irq_enter() without undoing it.
- *
- * For user mode entries irqentry_enter_from_user_mode() must be invoked to
- * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
- * would not be possible.
- *
- * Returns: An opaque object that must be passed to idtentry_exit()
- *
- * The return value must be fed into the state argument of
- * idtentry_exit().
- */
-idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
-{
-	idtentry_state_t ret = {
-		.exit_rcu = false,
-	};
-
-	if (user_mode(regs)) {
-		irqentry_enter_from_user_mode(regs);
-		return ret;
-	}
-
-	/*
-	 * If this entry hit the idle task invoke rcu_irq_enter() whether
-	 * RCU is watching or not.
-	 *
-	 * Interupts can nest when the first interrupt invokes softirq
-	 * processing on return which enables interrupts.
-	 *
-	 * Scheduler ticks in the idle task can mark quiescent state and
-	 * terminate a grace period, if and only if the timer interrupt is
-	 * not nested into another interrupt.
-	 *
-	 * Checking for __rcu_is_watching() here would prevent the nesting
-	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
-	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
-	 * assume that it is the first interupt and eventually claim
-	 * quiescient state and end grace periods prematurely.
-	 *
-	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
-	 * consistent.
-	 *
-	 * TINY_RCU does not support EQS, so let the compiler eliminate
-	 * this part when enabled.
-	 */
-	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
-		/*
-		 * If RCU is not watching then the same careful
-		 * sequence vs. lockdep and tracing is required
-		 * as in irqentry_enter_from_user_mode().
-		 */
-		lockdep_hardirqs_off(CALLER_ADDR0);
-		rcu_irq_enter();
-		instrumentation_begin();
-		trace_hardirqs_off_finish();
-		instrumentation_end();
-
-		ret.exit_rcu = true;
-		return ret;
-	}
-
-	/*
-	 * If RCU is watching then RCU only wants to check whether it needs
-	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
-	 * already contains a warning when RCU is not watching, so no point
-	 * in having another one here.
-	 */
-	instrumentation_begin();
-	rcu_irq_enter_check_tick();
-	/* Use the combo lockdep/tracing function */
-	trace_hardirqs_off();
-	instrumentation_end();
-
-	return ret;
-}
-
-static void idtentry_exit_cond_resched(struct pt_regs *regs, bool may_sched)
-{
-	if (may_sched && !preempt_count()) {
-		/* Sanity check RCU and thread stack */
-		rcu_irq_exit_check_preempt();
-		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
-			WARN_ON_ONCE(!on_thread_stack());
-		if (need_resched())
-			preempt_schedule_irq();
-	}
-	/* Covers both tracing and lockdep */
-	trace_hardirqs_on();
-}
-
-/**
- * idtentry_exit - Handle return from exception that used idtentry_enter()
- * @regs:	Pointer to pt_regs (exception entry regs)
- * @state:	Return value from matching call to idtentry_enter()
- *
- * Depending on the return target (kernel/user) this runs the necessary
- * preemption and work checks if possible and reguired and returns to
- * the caller with interrupts disabled and no further work pending.
- *
- * This is the last action before returning to the low level ASM code which
- * just needs to return to the appropriate context.
- *
- * Counterpart to idtentry_enter(). The return value of the entry
- * function must be fed into the @state argument.
- */
-void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
-{
-	lockdep_assert_irqs_disabled();
-
-	/* Check whether this returns to user mode */
-	if (user_mode(regs)) {
-		irqentry_exit_to_user_mode(regs);
-	} else if (regs->flags & X86_EFLAGS_IF) {
-		/*
-		 * If RCU was not watching on entry this needs to be done
-		 * carefully and needs the same ordering of lockdep/tracing
-		 * and RCU as the return to user mode path.
-		 */
-		if (state.exit_rcu) {
-			instrumentation_begin();
-			/* Tell the tracer that IRET will enable interrupts */
-			trace_hardirqs_on_prepare();
-			lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-			instrumentation_end();
-			rcu_irq_exit();
-			lockdep_hardirqs_on(CALLER_ADDR0);
-			return;
-		}
-
-		instrumentation_begin();
-		idtentry_exit_cond_resched(regs, IS_ENABLED(CONFIG_PREEMPTION));
-		instrumentation_end();
-	} else {
-		/*
-		 * IRQ flags state is correct already. Just tell RCU if it
-		 * was not watching on entry.
-		 */
-		if (state.exit_rcu)
-			rcu_irq_exit();
-	}
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -427,7 +262,7 @@ static void __xen_pv_evtchn_do_upcall(vo
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
 		instrumentation_begin();
-		idtentry_exit_cond_resched(regs, true);
+		irqentry_exit_cond_resched();
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,12 +11,10 @@
 
 #include <asm/irq_stack.h>
 
-typedef struct idtentry_state {
-	bool exit_rcu;
-} idtentry_state_t;
-
-idtentry_state_t idtentry_enter(struct pt_regs *regs);
-void idtentry_exit(struct pt_regs *regs, idtentry_state_t state);
+/* Temporary defines */
+typedef irqentry_state_t idtentry_state_t;
+#define idtentry_enter irqentry_enter
+#define idtentry_exit irqentry_exit
 
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 14/15] x86/entry: Cleanup idtentry_enter/exit
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (12 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 13/15] x86/entry: Use generic interrupt entry/exit code Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-07-22 22:00 ` [patch V5 15/15] x86/kvm: Use generic xfer to guest work function Thomas Gleixner
  2020-07-24 20:51 ` [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Remove the temporary defines and fixup all references.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>

---
 arch/x86/entry/common.c         |    6 +++---
 arch/x86/include/asm/idtentry.h |   33 ++++++++++++++-------------------
 arch/x86/kernel/kvm.c           |    6 +++---
 arch/x86/kernel/traps.c         |    6 +++---
 arch/x86/mm/fault.c             |    6 +++---
 5 files changed, 26 insertions(+), 31 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -248,9 +248,9 @@ static void __xen_pv_evtchn_do_upcall(vo
 {
 	struct pt_regs *old_regs;
 	bool inhcall;
-	idtentry_state_t state;
+	irqentry_state_t state;
 
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 	old_regs = set_irq_regs(regs);
 
 	instrumentation_begin();
@@ -266,7 +266,7 @@ static void __xen_pv_evtchn_do_upcall(vo
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
-		idtentry_exit(regs, state);
+		irqentry_exit(regs, state);
 	}
 }
 #endif /* CONFIG_XEN_PV */
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,11 +11,6 @@
 
 #include <asm/irq_stack.h>
 
-/* Temporary defines */
-typedef irqentry_state_t idtentry_state_t;
-#define idtentry_enter irqentry_enter
-#define idtentry_exit irqentry_exit
-
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -45,8 +40,8 @@ typedef irqentry_state_t idtentry_state_
  * The macro is written so it acts as function definition. Append the
  * body with a pair of curly brackets.
  *
- * idtentry_enter() contains common code which has to be invoked before
- * arbitrary code in the body. idtentry_exit() contains common code
+ * irqentry_enter() contains common code which has to be invoked before
+ * arbitrary code in the body. irqentry_exit() contains common code
  * which has to run before returning to the low level assembly code.
  */
 #define DEFINE_IDTENTRY(func)						\
@@ -54,12 +49,12 @@ static __always_inline void __##func(str
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__##func (regs);						\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -101,12 +96,12 @@ static __always_inline void __##func(str
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__##func (regs, error_code);					\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs,		\
@@ -161,7 +156,7 @@ static __always_inline void __##func(str
  * body with a pair of curly brackets.
  *
  * Contrary to DEFINE_IDTENTRY_ERRORCODE() this does not invoke the
- * idtentry_enter/exit() helpers before and after the body invocation. This
+ * irqentry_enter/exit() helpers before and after the body invocation. This
  * needs to be done in the body itself if applicable. Use if extra work
  * is required before the enter/exit() helpers are invoked.
  */
@@ -197,7 +192,7 @@ static __always_inline void __##func(str
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
@@ -205,7 +200,7 @@ static __always_inline void __##func(str
 	__##func (regs, (u8)error_code);				\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -229,7 +224,7 @@ static __always_inline void __##func(str
  * DEFINE_IDTENTRY_SYSVEC - Emit code for system vector IDT entry points
  * @func:	Function name of the entry point
  *
- * idtentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
+ * irqentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
  * function body. KVM L1D flush request is set.
  *
  * Runs the function on the interrupt stack if the entry hit kernel mode
@@ -239,7 +234,7 @@ static void __##func(struct pt_regs *reg
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
@@ -247,7 +242,7 @@ static void __##func(struct pt_regs *reg
 	run_on_irqstack_cond(__##func, regs, regs);			\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static noinline void __##func(struct pt_regs *regs)
@@ -268,7 +263,7 @@ static __always_inline void __##func(str
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__irq_enter_raw();						\
@@ -276,7 +271,7 @@ static __always_inline void __##func(str
 	__##func (regs);						\
 	__irq_exit_raw();						\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -233,7 +233,7 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf
 noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 {
 	u32 reason = kvm_read_and_reset_apf_flags();
-	idtentry_state_t state;
+	irqentry_state_t state;
 
 	switch (reason) {
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
@@ -243,7 +243,7 @@ noinstr bool __kvm_handle_async_pf(struc
 		return false;
 	}
 
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 	instrumentation_begin();
 
 	/*
@@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struc
 	}
 
 	instrumentation_end();
-	idtentry_exit(regs, state);
+	irqentry_exit(regs, state);
 	return true;
 }
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt
 
 DEFINE_IDTENTRY_RAW(exc_invalid_op)
 {
-	idtentry_state_t state;
+	irqentry_state_t state;
 
 	/*
 	 * We use UD2 as a short encoding for 'CALL __WARN', as such
@@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 	if (!user_mode(regs) && handle_bug(regs))
 		return;
 
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 	instrumentation_begin();
 	handle_invalid_op(regs);
 	instrumentation_end();
-	idtentry_exit(regs, state);
+	irqentry_exit(regs, state);
 }
 
 DEFINE_IDTENTRY(exc_coproc_segment_overrun)
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1377,7 +1377,7 @@ handle_page_fault(struct pt_regs *regs,
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
 	unsigned long address = read_cr2();
-	idtentry_state_t state;
+	irqentry_state_t state;
 
 	prefetchw(&current->mm->mmap_lock);
 
@@ -1412,11 +1412,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_f
 	 * code reenabled RCU to avoid subsequent wreckage which helps
 	 * debugability.
 	 */
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 
 	instrumentation_begin();
 	handle_page_fault(regs, error_code, address);
 	instrumentation_end();
 
-	idtentry_exit(regs, state);
+	irqentry_exit(regs, state);
 }




^ permalink raw reply	[flat|nested] 45+ messages in thread

* [patch V5 15/15] x86/kvm: Use generic xfer to guest work function
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (13 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 14/15] x86/entry: Cleanup idtentry_enter/exit Thomas Gleixner
@ 2020-07-22 22:00 ` Thomas Gleixner
  2020-07-24  0:17   ` Sean Christopherson
                     ` (2 more replies)
  2020-07-24 20:51 ` [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
  15 siblings, 3 replies; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-22 22:00 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

From: Thomas Gleixner <tglx@linutronix.de>

Use the generic infrastructure to check for and handle pending work before
transitioning into guest mode.

This now handles TIF_NOTIFY_RESUME as well which was ignored so
far. Handling it is important as this covers task work and task work will
be used to offload the heavy lifting of POSIX CPU timers to thread context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Rename exit -> xfer
---
 arch/x86/kvm/Kconfig   |    1 +
 arch/x86/kvm/vmx/vmx.c |   11 +++++------
 arch/x86/kvm/x86.c     |   15 ++++++---------
 3 files changed, 12 insertions(+), 15 deletions(-)

--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -42,6 +42,7 @@ config KVM
 	select HAVE_KVM_MSI
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_NO_POLL
+	select KVM_XFER_TO_GUEST_WORK
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
 	select SRCU
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/tboot.h>
 #include <linux/trace_events.h>
+#include <linux/entry-kvm.h>
 
 #include <asm/apic.h>
 #include <asm/asm.h>
@@ -5376,14 +5377,12 @@ static int handle_invalid_guest_state(st
 		}
 
 		/*
-		 * Note, return 1 and not 0, vcpu_run() is responsible for
-		 * morphing the pending signal into the proper return code.
+		 * Note, return 1 and not 0, vcpu_run() will invoke
+		 * xfer_to_guest_mode() which will create a proper return
+		 * code.
 		 */
-		if (signal_pending(current))
+		if (__xfer_to_guest_mode_work_pending())
 			return 1;
-
-		if (need_resched())
-			schedule();
 	}
 
 	return 1;
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -56,6 +56,7 @@
 #include <linux/sched/stat.h>
 #include <linux/sched/isolation.h>
 #include <linux/mem_encrypt.h>
+#include <linux/entry-kvm.h>
 
 #include <trace/events/kvm.h>
 
@@ -1585,7 +1586,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr);
 bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
 {
 	return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) ||
-		need_resched() || signal_pending(current);
+		xfer_to_guest_mode_work_pending();
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_exit_request);
 
@@ -8676,15 +8677,11 @@ static int vcpu_run(struct kvm_vcpu *vcp
 			break;
 		}
 
-		if (signal_pending(current)) {
-			r = -EINTR;
-			vcpu->run->exit_reason = KVM_EXIT_INTR;
-			++vcpu->stat.signal_exits;
-			break;
-		}
-		if (need_resched()) {
+		if (xfer_to_guest_mode_work_pending()) {
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
-			cond_resched();
+			r = xfer_to_guest_mode(vcpu);
+			if (r)
+				return r;
 			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
 		}
 	}


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 08/15] x86/entry: Move user return notifier out of loop
  2020-07-22 22:00 ` [patch V5 08/15] x86/entry: Move user return notifier out of loop Thomas Gleixner
@ 2020-07-23 23:41   ` Sean Christopherson
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 45+ messages in thread
From: Sean Christopherson @ 2020-07-23 23:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi

On Thu, Jul 23, 2020 at 12:00:02AM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Guests and user space share certain MSRs. KVM sets these MSRs to guest
> values once and does not set them back to user space values on every VM
> exit to spare the costly MSR operations.
> 
> User return notifiers ensure that these MSRs are set back to the correct
> values before returning to user space in exit_to_usermode_loop().
> 
> There is no reason to evaluate the TIF flag indicating that user return
> notifiers need to be invoked in the loop. The important point is that they
> are invoked before returning to user space.
> 
> Move the invocation out of the loop into the section which does the last
> preperatory steps before returning to user space. That section is not
> preemptible and runs with interrupts disabled until the actual return.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-and-tested-by: Sean Christopherson <sean.j.christopherson@intel.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 15/15] x86/kvm: Use generic xfer to guest work function
  2020-07-22 22:00 ` [patch V5 15/15] x86/kvm: Use generic xfer to guest work function Thomas Gleixner
@ 2020-07-24  0:17   ` Sean Christopherson
  2020-07-24  0:46     ` Thomas Gleixner
  2020-07-24 14:24   ` Ingo Molnar
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2020-07-24  0:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi

On Thu, Jul 23, 2020 at 12:00:09AM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Use the generic infrastructure to check for and handle pending work before
> transitioning into guest mode.
> 
> This now handles TIF_NOTIFY_RESUME as well which was ignored so
> far. Handling it is important as this covers task work and task work will
> be used to offload the heavy lifting of POSIX CPU timers to thread context.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V5: Rename exit -> xfer
> ---

One nit/question below (though it's really about patch 5).

Reviewed-and-tested-by: Sean Christopherson <sean.j.christopherson@intel.com>

> @@ -8676,15 +8677,11 @@ static int vcpu_run(struct kvm_vcpu *vcp
>  			break;
>  		}
>  
> -		if (signal_pending(current)) {
> -			r = -EINTR;
> -			vcpu->run->exit_reason = KVM_EXIT_INTR;
> -			++vcpu->stat.signal_exits;
> -			break;
> -		}
> -		if (need_resched()) {
> +		if (xfer_to_guest_mode_work_pending()) {
>  			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> -			cond_resched();
> +			r = xfer_to_guest_mode(vcpu);

Any reason not to call this xfer_to_guest_mode_work()?  Or handle_work(),
do_work(), etc...  Without the "work" part, it looks like a function that
should be invoked unconditionally.  It's obvious that's not the case if
one looks at the implementation, but it's a bit confusing on the KVM side
of things.

> +			if (r)
> +				return r;
>  			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>  		}
>  	}
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 15/15] x86/kvm: Use generic xfer to guest work function
  2020-07-24  0:17   ` Sean Christopherson
@ 2020-07-24  0:46     ` Thomas Gleixner
  2020-07-24  0:55       ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-24  0:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi

Sean,

Sean Christopherson <sean.j.christopherson@intel.com> writes:
> On Thu, Jul 23, 2020 at 12:00:09AM +0200, Thomas Gleixner wrote:
>> +		if (xfer_to_guest_mode_work_pending()) {
>>  			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> -			cond_resched();
>> +			r = xfer_to_guest_mode(vcpu);
>
> Any reason not to call this xfer_to_guest_mode_work()?  Or handle_work(),
> do_work(), etc...  Without the "work" part, it looks like a function that
> should be invoked unconditionally.  It's obvious that's not the case if
> one looks at the implementation, but it's a bit confusing on the KVM side
> of things.

The reason is probably lazyness. The original approach was to have this
as close as possible to user entry/exit but with the recent changes
vs. instrumentation and RCU this is not longer the case.

I really want to keep the notion of transitioning in the function name,
so xfer_to_guest_mode_handle_work() makes a lot of sense.

I'll change that before merging the lot into the tip tree if your
Reviewed-by still stands with that change made w/o reposting.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 15/15] x86/kvm: Use generic xfer to guest work function
  2020-07-24  0:46     ` Thomas Gleixner
@ 2020-07-24  0:55       ` Sean Christopherson
  0 siblings, 0 replies; 45+ messages in thread
From: Sean Christopherson @ 2020-07-24  0:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi

On Fri, Jul 24, 2020 at 02:46:26AM +0200, Thomas Gleixner wrote:
> Sean,
> 
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> > On Thu, Jul 23, 2020 at 12:00:09AM +0200, Thomas Gleixner wrote:
> >> +		if (xfer_to_guest_mode_work_pending()) {
> >>  			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> >> -			cond_resched();
> >> +			r = xfer_to_guest_mode(vcpu);
> >
> > Any reason not to call this xfer_to_guest_mode_work()?  Or handle_work(),
> > do_work(), etc...  Without the "work" part, it looks like a function that
> > should be invoked unconditionally.  It's obvious that's not the case if
> > one looks at the implementation, but it's a bit confusing on the KVM side
> > of things.
> 
> The reason is probably lazyness. The original approach was to have this
> as close as possible to user entry/exit but with the recent changes
> vs. instrumentation and RCU this is not longer the case.
> 
> I really want to keep the notion of transitioning in the function name,
> so xfer_to_guest_mode_handle_work() makes a lot of sense.
> 
> I'll change that before merging the lot into the tip tree if your
> Reviewed-by still stands with that change made w/o reposting.

Ya, works for me.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 15/15] x86/kvm: Use generic xfer to guest work function
  2020-07-22 22:00 ` [patch V5 15/15] x86/kvm: Use generic xfer to guest work function Thomas Gleixner
  2020-07-24  0:17   ` Sean Christopherson
@ 2020-07-24 14:24   ` Ingo Molnar
  2020-07-24 19:08     ` Thomas Gleixner
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2020-07-24 14:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson


* Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Use the generic infrastructure to check for and handle pending work before
> transitioning into guest mode.
> 
> This now handles TIF_NOTIFY_RESUME as well which was ignored so
> far. Handling it is important as this covers task work and task work will
> be used to offload the heavy lifting of POSIX CPU timers to thread context.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V5: Rename exit -> xfer
> ---
>  arch/x86/kvm/Kconfig   |    1 +
>  arch/x86/kvm/vmx/vmx.c |   11 +++++------
>  arch/x86/kvm/x86.c     |   15 ++++++---------
>  3 files changed, 12 insertions(+), 15 deletions(-)
> 
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -42,6 +42,7 @@ config KVM
>  	select HAVE_KVM_MSI
>  	select HAVE_KVM_CPU_RELAX_INTERCEPT
>  	select HAVE_KVM_NO_POLL
> +	select KVM_XFER_TO_GUEST_WORK
>  	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
>  	select KVM_VFIO
>  	select SRCU
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -27,6 +27,7 @@
>  #include <linux/slab.h>
>  #include <linux/tboot.h>
>  #include <linux/trace_events.h>
> +#include <linux/entry-kvm.h>
>  
>  #include <asm/apic.h>
>  #include <asm/asm.h>
> @@ -5376,14 +5377,12 @@ static int handle_invalid_guest_state(st

>  
>  	return 1;
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -56,6 +56,7 @@
>  #include <linux/sched/stat.h>
>  #include <linux/sched/isolation.h>
>  #include <linux/mem_encrypt.h>
> +#include <linux/entry-kvm.h>
>  
>  #include <trace/events/kvm.h>
>  
> @@ -1585,7 +1586,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr);
>  bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
>  {
>  	return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) ||
> -		need_resched() || signal_pending(current);
> +		xfer_to_guest_mode_work_pending();
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_exit_request);
>  
> @@ -8676,15 +8677,11 @@ static int vcpu_run(struct kvm_vcpu *vcp
>  			break;
>  		}
>  
> -		if (signal_pending(current)) {
> -			r = -EINTR;
> -			vcpu->run->exit_reason = KVM_EXIT_INTR;
> -			++vcpu->stat.signal_exits;
> -			break;
> -		}
> -		if (need_resched()) {
> +		if (xfer_to_guest_mode_work_pending()) {
>  			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> -			cond_resched();
> +			r = xfer_to_guest_mode(vcpu);
> +			if (r)
> +				return r;
>  			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>  		}
>  	}

So this chunk replaces vcpu_run()'s cond_resched() call with a call to 
xfer_to_guest_mode(), which checks NEED_RESCHED & acts upon it, among 
other things.

But:

>  		}
>  
>  		/*
> -		 * Note, return 1 and not 0, vcpu_run() is responsible for
> -		 * morphing the pending signal into the proper return code.
> +		 * Note, return 1 and not 0, vcpu_run() will invoke
> +		 * xfer_to_guest_mode() which will create a proper return
> +		 * code.
>  		 */
> -		if (signal_pending(current))
> +		if (__xfer_to_guest_mode_work_pending())
>  			return 1;
> -
> -		if (need_resched())
> -			schedule();
>  	}

AFAICS this chunk removes a conditional reschedule point from 
handle_invalid_guest_state() and replaces it with 
__xfer_to_guest_mode_work_pending().

But __xfer_to_guest_mode_work_pending() doesn't do the cond-resched of 
the full xfer_to_guest_mode_work() function - so we essentially lose a 
cond_resched() here.

Is this side effect intended, was the cond_resched() superfluous?

The logic is quite a maze, and the KVM code was missing a few of the 
state checks, so maybe this is all for the better - just wanted to 
mention it, in case it matters.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 13/15] x86/entry: Use generic interrupt entry/exit code
  2020-07-22 22:00 ` [patch V5 13/15] x86/entry: Use generic interrupt entry/exit code Thomas Gleixner
@ 2020-07-24 14:28   ` Ingo Molnar
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2020-07-24 14:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson


* Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Replace the x86 code with the generic variant. Use temporary defines for
> idtentry_* which will be cleaned up in the next step.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

There was a comment that still referenced the old x86-specific API 
names - the patch below updates them.

Thanks,

	Ingo

====
 arch/x86/include/asm/idtentry.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 73eb277e63ab..a9fb57f06d39 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -128,7 +128,7 @@ static __always_inline void __##func(struct pt_regs *regs,		\
  * body with a pair of curly brackets.
  *
  * Contrary to DEFINE_IDTENTRY() this does not invoke the
- * idtentry_enter/exit() helpers before and after the body invocation. This
+ * irqentry_enter/exit() helpers before and after the body invocation. This
  * needs to be done in the body itself if applicable. Use if extra work
  * is required before the enter/exit() helpers are invoked.
  */


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [patch V5 15/15] x86/kvm: Use generic xfer to guest work function
  2020-07-24 14:24   ` Ingo Molnar
@ 2020-07-24 19:08     ` Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-24 19:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

Ingo Molnar <mingo@kernel.org> writes:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
>>  		/*
>> -		 * Note, return 1 and not 0, vcpu_run() is responsible for
>> -		 * morphing the pending signal into the proper return code.
>> +		 * Note, return 1 and not 0, vcpu_run() will invoke
>> +		 * xfer_to_guest_mode() which will create a proper return
>> +		 * code.
>>  		 */
>> -		if (signal_pending(current))
>> +		if (__xfer_to_guest_mode_work_pending())
>>  			return 1;
>> -
>> -		if (need_resched())
>> -			schedule();
>>  	}
>
> AFAICS this chunk removes a conditional reschedule point from 
> handle_invalid_guest_state() and replaces it with 
> __xfer_to_guest_mode_work_pending().
>
> But __xfer_to_guest_mode_work_pending() doesn't do the cond-resched of 
> the full xfer_to_guest_mode_work() function - so we essentially lose a 
> cond_resched() here.
>
> Is this side effect intended, was the cond_resched() superfluous?

It makes the thing drop back to the outer loop for any pending work not
only for signals. That avoids having yet another thing to worry about.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [tip: core/entry] entry: Provide infrastructure for work before transitioning to guest mode
  2020-07-22 21:59 ` [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode Thomas Gleixner
@ 2020-07-24 19:08   ` tip-bot2 for Thomas Gleixner
  2020-07-29 16:55   ` [patch V5 05/15] " Qian Cai
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 19:08 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, LKML

The following commit has been merged into the core/entry branch of tip:

Commit-ID:     935ace2fb5cc49ae88bd1f1735ddc51cdc2ebfb3
Gitweb:        https://git.kernel.org/tip/935ace2fb5cc49ae88bd1f1735ddc51cdc2ebfb3
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 22 Jul 2020 23:59:59 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:03:42 +02:00

entry: Provide infrastructure for work before transitioning to guest mode

Entering a guest is similar to exiting to user space. Pending work like
handling signals, rescheduling, task work etc. needs to be handled before
that.

Provide generic infrastructure to avoid duplication of the same handling
code all over the place.

The transfer to guest mode handling is different from the exit to usermode
handling, e.g. vs. rseq and live patching, so a separate function is used.

The initial list of work items handled is:

    TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME

Architecture specific TIF flags can be added via defines in the
architecture specific include files.

The calling convention is also different from the syscall/interrupt entry
functions as KVM invokes this from the outer vcpu_run() loop with
interrupts and preemption enabled. To prevent missing a pending work item
it invokes a check for pending TIF work from interrupt disabled code right
before transitioning to guest mode. The lockdep, RCU and tracing state
handling is also done directly around the switch to and from guest mode.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220519.833296398@linutronix.de
---
 include/linux/entry-kvm.h | 80 ++++++++++++++++++++++++++++++++++++++-
 include/linux/kvm_host.h  |  8 ++++-
 kernel/entry/Makefile     |  3 +-
 kernel/entry/kvm.c        | 51 ++++++++++++++++++++++++-
 virt/kvm/Kconfig          |  3 +-
 5 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/entry-kvm.h
 create mode 100644 kernel/entry/kvm.c

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
new file mode 100644
index 0000000..0cef17a
--- /dev/null
+++ b/include/linux/entry-kvm.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_ENTRYKVM_H
+#define __LINUX_ENTRYKVM_H
+
+#include <linux/entry-common.h>
+
+/* Transfer to guest mode work */
+#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
+
+#ifndef ARCH_XFER_TO_GUEST_MODE_WORK
+# define ARCH_XFER_TO_GUEST_MODE_WORK	(0)
+#endif
+
+#define XFER_TO_GUEST_MODE_WORK					\
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |			\
+	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+
+struct kvm_vcpu;
+
+/**
+ * arch_xfer_to_guest_mode_handle_work - Architecture specific xfer to guest
+ *					 mode work handling function.
+ * @vcpu:	Pointer to current's VCPU data
+ * @ti_work:	Cached TIF flags gathered in xfer_to_guest_mode_handle_work()
+ *
+ * Invoked from xfer_to_guest_mode_handle_work(). Defaults to NOOP. Can be
+ * replaced by architecture specific code.
+ */
+static inline int arch_xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu,
+						      unsigned long ti_work);
+
+#ifndef arch_xfer_to_guest_mode_work
+static inline int arch_xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu,
+						      unsigned long ti_work)
+{
+	return 0;
+}
+#endif
+
+/**
+ * xfer_to_guest_mode_handle_work - Check and handle pending work which needs
+ *				    to be handled before going to guest mode
+ * @vcpu:	Pointer to current's VCPU data
+ *
+ * Returns: 0 or an error code
+ */
+int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu);
+
+/**
+ * __xfer_to_guest_mode_work_pending - Check if work is pending
+ *
+ * Returns: True if work pending, False otherwise.
+ *
+ * Bare variant of xfer_to_guest_mode_work_pending(). Can be called from
+ * interrupt enabled code for racy quick checks with care.
+ */
+static inline bool __xfer_to_guest_mode_work_pending(void)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	return !!(ti_work & XFER_TO_GUEST_MODE_WORK);
+}
+
+/**
+ * xfer_to_guest_mode_work_pending - Check if work is pending which needs to be
+ *				     handled before returning to guest mode
+ *
+ * Returns: True if work pending, False otherwise.
+ *
+ * Has to be invoked with interrupts disabled before the transition to
+ * guest mode.
+ */
+static inline bool xfer_to_guest_mode_work_pending(void)
+{
+	lockdep_assert_irqs_disabled();
+	return __xfer_to_guest_mode_work_pending();
+}
+#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
+
+#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d564855..ac83e9c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1439,4 +1439,12 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 				uintptr_t data, const char *name,
 				struct task_struct **thread_ptr);
 
+#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
+static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
+{
+	vcpu->run->exit_reason = KVM_EXIT_INTR;
+	vcpu->stat.signal_exits++;
+}
+#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
+
 #endif
diff --git a/kernel/entry/Makefile b/kernel/entry/Makefile
index c207d20..34c8a3f 100644
--- a/kernel/entry/Makefile
+++ b/kernel/entry/Makefile
@@ -9,4 +9,5 @@ KCOV_INSTRUMENT := n
 CFLAGS_REMOVE_common.o	 = -fstack-protector -fstack-protector-strong
 CFLAGS_common.o		+= -fno-stack-protector
 
-obj-$(CONFIG_GENERIC_ENTRY) += common.o
+obj-$(CONFIG_GENERIC_ENTRY) 		+= common.o
+obj-$(CONFIG_KVM_XFER_TO_GUEST_WORK)	+= kvm.o
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
new file mode 100644
index 0000000..eb1a8a4
--- /dev/null
+++ b/kernel/entry/kvm.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/entry-kvm.h>
+#include <linux/kvm_host.h>
+
+static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
+{
+	do {
+		int ret;
+
+		if (ti_work & _TIF_SIGPENDING) {
+			kvm_handle_signal_exit(vcpu);
+			return -EINTR;
+		}
+
+		if (ti_work & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (ti_work & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(NULL);
+		}
+
+		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
+		if (ret)
+			return ret;
+
+		ti_work = READ_ONCE(current_thread_info()->flags);
+	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+	return 0;
+}
+
+int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
+{
+	unsigned long ti_work;
+
+	/*
+	 * This is invoked from the outer guest loop with interrupts and
+	 * preemption enabled.
+	 *
+	 * KVM invokes xfer_to_guest_mode_work_pending() with interrupts
+	 * disabled in the inner loop before going into guest mode. No need
+	 * to disable interrupts here.
+	 */
+	ti_work = READ_ONCE(current_thread_info()->flags);
+	if (!(ti_work & XFER_TO_GUEST_MODE_WORK))
+		return 0;
+
+	return xfer_to_guest_mode_work(vcpu, ti_work);
+}
+EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index aad9284..1c37ccd 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -60,3 +60,6 @@ config HAVE_KVM_VCPU_RUN_PID_CHANGE
 
 config HAVE_KVM_NO_POLL
        bool
+
+config KVM_XFER_TO_GUEST_WORK
+       bool

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: core/entry] entry: Provide generic syscall exit function
  2020-07-22 21:59 ` [patch V5 03/15] entry: Provide generic syscall exit function Thomas Gleixner
@ 2020-07-24 19:08   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 19:08 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the core/entry branch of tip:

Commit-ID:     a9f3a74a29af095f3e1b89e9176f8127912ae0f0
Gitweb:        https://git.kernel.org/tip/a9f3a74a29af095f3e1b89e9176f8127912ae0f0
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 22 Jul 2020 23:59:57 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 14:59:03 +02:00

entry: Provide generic syscall exit function

Like syscall entry all architectures have similar and pointlessly different
code to handle pending work before returning from a syscall to user space.

  1) One-time syscall exit work:
      - rseq syscall exit
      - audit
      - syscall tracing
      - tracehook (single stepping)

  2) Preparatory work
      - Exit to user mode loop (common TIF handling).
      - Architecture specific one time work arch_exit_to_user_mode_prepare()
      - Address limit and lockdep checks
     
  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
     arch_exit_to_user_mode() to handle e.g. speculation mitigations

Provide a generic version based on the x86 code which has all the RCU and
instrumentation protections right.

Provide a variant for interrupt return to user mode as well which shares
the above #2 and #3 work items.

After syscall_exit_to_user_mode() and irqentry_exit_to_user_mode() the
architecture code just has to return to user space. The code after
returning from these functions must not be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.613977173@linutronix.de


---
 include/linux/entry-common.h | 189 ++++++++++++++++++++++++++++++++++-
 kernel/entry/common.c        | 169 ++++++++++++++++++++++++++++++-
 2 files changed, 358 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 42fc8e4..c4a57be 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -29,6 +29,14 @@
 # define _TIF_SYSCALL_AUDIT		(0)
 #endif
 
+#ifndef _TIF_PATCH_PENDING
+# define _TIF_PATCH_PENDING		(0)
+#endif
+
+#ifndef _TIF_UPROBE
+# define _TIF_UPROBE			(0)
+#endif
+
 /*
  * TIF flags handled in syscall_enter_from_usermode()
  */
@@ -41,6 +49,29 @@
 	 _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU |			\
 	 ARCH_SYSCALL_ENTER_WORK)
 
+/*
+ * TIF flags handled in syscall_exit_to_user_mode()
+ */
+#ifndef ARCH_SYSCALL_EXIT_WORK
+# define ARCH_SYSCALL_EXIT_WORK		(0)
+#endif
+
+#define SYSCALL_EXIT_WORK						\
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |			\
+	 _TIF_SYSCALL_TRACEPOINT | ARCH_SYSCALL_EXIT_WORK)
+
+/*
+ * TIF flags handled in exit_to_user_mode_loop()
+ */
+#ifndef ARCH_EXIT_TO_USER_MODE_WORK
+# define ARCH_EXIT_TO_USER_MODE_WORK		(0)
+#endif
+
+#define EXIT_TO_USER_MODE_WORK						\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 ARCH_EXIT_TO_USER_MODE_WORK)
+
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
  * @regs:	Pointer to currents pt_regs
@@ -106,6 +137,149 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
 long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
 
 /**
+ * local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable()
+ * @ti_work:	Cached TIF flags gathered with interrupts disabled
+ *
+ * Defaults to local_irq_enable(). Can be supplied by architecture specific
+ * code.
+ */
+static inline void local_irq_enable_exit_to_user(unsigned long ti_work);
+
+#ifndef local_irq_enable_exit_to_user
+static inline void local_irq_enable_exit_to_user(unsigned long ti_work)
+{
+	local_irq_enable();
+}
+#endif
+
+/**
+ * local_irq_disable_exit_to_user - Exit to user variant of local_irq_disable()
+ *
+ * Defaults to local_irq_disable(). Can be supplied by architecture specific
+ * code.
+ */
+static inline void local_irq_disable_exit_to_user(void);
+
+#ifndef local_irq_disable_exit_to_user
+static inline void local_irq_disable_exit_to_user(void)
+{
+	local_irq_disable();
+}
+#endif
+
+/**
+ * arch_exit_to_user_mode_work - Architecture specific TIF work for exit
+ *				 to user mode.
+ * @regs:	Pointer to currents pt_regs
+ * @ti_work:	Cached TIF flags gathered with interrupts disabled
+ *
+ * Invoked from exit_to_user_mode_loop() with interrupt enabled
+ *
+ * Defaults to NOOP. Can be supplied by architecture specific code.
+ */
+static inline void arch_exit_to_user_mode_work(struct pt_regs *regs,
+					       unsigned long ti_work);
+
+#ifndef arch_exit_to_user_mode_work
+static inline void arch_exit_to_user_mode_work(struct pt_regs *regs,
+					       unsigned long ti_work)
+{
+}
+#endif
+
+/**
+ * arch_exit_to_user_mode_prepare - Architecture specific preparation for
+ *				    exit to user mode.
+ * @regs:	Pointer to currents pt_regs
+ * @ti_work:	Cached TIF flags gathered with interrupts disabled
+ *
+ * Invoked from exit_to_user_mode_prepare() with interrupt disabled as the last
+ * function before return. Defaults to NOOP.
+ */
+static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
+						  unsigned long ti_work);
+
+#ifndef arch_exit_to_user_mode_prepare
+static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
+						  unsigned long ti_work)
+{
+}
+#endif
+
+/**
+ * arch_exit_to_user_mode - Architecture specific final work before
+ *			    exit to user mode.
+ *
+ * Invoked from exit_to_user_mode() with interrupt disabled as the last
+ * function before return. Defaults to NOOP.
+ *
+ * This needs to be __always_inline because it is non-instrumentable code
+ * invoked after context tracking switched to user mode.
+ *
+ * An architecture implementation must not do anything complex, no locking
+ * etc. The main purpose is for speculation mitigations.
+ */
+static __always_inline void arch_exit_to_user_mode(void);
+
+#ifndef arch_exit_to_user_mode
+static __always_inline void arch_exit_to_user_mode(void) { }
+#endif
+
+/**
+ * arch_do_signal -  Architecture specific signal delivery function
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked from exit_to_user_mode_loop().
+ */
+void arch_do_signal(struct pt_regs *regs);
+
+/**
+ * arch_syscall_exit_tracehook - Wrapper around tracehook_report_syscall_exit()
+ * @regs:	Pointer to currents pt_regs
+ * @step:	Indicator for single step
+ *
+ * Defaults to tracehook_report_syscall_exit(). Can be replaced by
+ * architecture specific code.
+ *
+ * Invoked from syscall_exit_to_user_mode()
+ */
+static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step);
+
+#ifndef arch_syscall_exit_tracehook
+static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
+{
+	tracehook_report_syscall_exit(regs, step);
+}
+#endif
+
+/**
+ * syscall_exit_to_user_mode - Handle work before returning to user mode
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked with interrupts enabled and fully valid regs. Returns with all
+ * work handled, interrupts disabled such that the caller can immediately
+ * switch to user mode. Called from architecture specific syscall and ret
+ * from fork code.
+ *
+ * The call order is:
+ *  1) One-time syscall exit work:
+ *	- rseq syscall exit
+ *      - audit
+ *	- syscall tracing
+ *	- tracehook (single stepping)
+ *
+ *  2) Preparatory work
+ *	- Exit to user mode loop (common TIF handling). Invokes
+ *	  arch_exit_to_user_mode_work() for architecture specific TIF work
+ *	- Architecture specific one time work arch_exit_to_user_mode_prepare()
+ *	- Address limit and lockdep checks
+ *
+ *  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
+ *     arch_exit_to_user_mode() to handle e.g. speculation mitigations
+ */
+void syscall_exit_to_user_mode(struct pt_regs *regs);
+
+/**
  * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
  * @regs:	Pointer to currents pt_regs
  *
@@ -118,4 +292,19 @@ long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
  */
 void irqentry_enter_from_user_mode(struct pt_regs *regs);
 
+/**
+ * irqentry_exit_to_user_mode - Interrupt exit work
+ * @regs:	Pointer to current's pt_regs
+ *
+ * Invoked with interrupts disbled and fully valid regs. Returns with all
+ * work handled, interrupts disabled such that the caller can immediately
+ * switch to user mode. Called from architecture specific interrupt
+ * handling code.
+ *
+ * The call order is #2 and #3 as described in syscall_exit_to_user_mode().
+ * Interrupt exit is not invoking #1 which is the syscall specific one time
+ * work.
+ */
+void irqentry_exit_to_user_mode(struct pt_regs *regs);
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 1d636de..0a051bb 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -2,6 +2,8 @@
 
 #include <linux/context_tracking.h>
 #include <linux/entry-common.h>
+#include <linux/livepatch.h>
+#include <linux/audit.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -82,7 +84,174 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
 	return syscall;
 }
 
+/**
+ * exit_to_user_mode - Fixup state when exiting to user mode
+ *
+ * Syscall/interupt exit enables interrupts, but the kernel state is
+ * interrupts disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Invoke architecture specific last minute exit code, e.g. speculation
+ *    mitigations, etc.
+ * 4) Tell lockdep that interrupts are enabled
+ */
+static __always_inline void exit_to_user_mode(void)
+{
+	instrumentation_begin();
+	trace_hardirqs_on_prepare();
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	instrumentation_end();
+
+	user_enter_irqoff();
+	arch_exit_to_user_mode();
+	lockdep_hardirqs_on(CALLER_ADDR0);
+}
+
+/* Workaround to allow gradual conversion of architecture code */
+void __weak arch_do_signal(struct pt_regs *regs) { }
+
+static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
+					    unsigned long ti_work)
+{
+	/*
+	 * Before returning to user space ensure that all pending work
+	 * items have been completed.
+	 */
+	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+
+		local_irq_enable_exit_to_user(ti_work);
+
+		if (ti_work & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (ti_work & _TIF_UPROBE)
+			uprobe_notify_resume(regs);
+
+		if (ti_work & _TIF_PATCH_PENDING)
+			klp_update_patch_state(current);
+
+		if (ti_work & _TIF_SIGPENDING)
+			arch_do_signal(regs);
+
+		if (ti_work & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+			rseq_handle_notify_resume(NULL, regs);
+		}
+
+		/* Architecture specific TIF work */
+		arch_exit_to_user_mode_work(regs, ti_work);
+
+		/*
+		 * Disable interrupts and reevaluate the work flags as they
+		 * might have changed while interrupts and preemption was
+		 * enabled above.
+		 */
+		local_irq_disable_exit_to_user();
+		ti_work = READ_ONCE(current_thread_info()->flags);
+	}
+
+	/* Return the latest work state for arch_exit_to_user_mode() */
+	return ti_work;
+}
+
+static void exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+	lockdep_assert_irqs_disabled();
+
+	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
+		ti_work = exit_to_user_mode_loop(regs, ti_work);
+
+	arch_exit_to_user_mode_prepare(regs, ti_work);
+
+	/* Ensure that the address limit is intact and no locks are held */
+	addr_limit_user_check();
+	lockdep_assert_irqs_disabled();
+	lockdep_sys_exit();
+}
+
+#ifndef _TIF_SINGLESTEP
+static inline bool report_single_step(unsigned long ti_work)
+{
+	return false;
+}
+#else
+/*
+ * If TIF_SYSCALL_EMU is set, then the only reason to report is when
+ * TIF_SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall
+ * instruction has been already reported in syscall_enter_from_usermode().
+ */
+#define SYSEMU_STEP	(_TIF_SINGLESTEP | _TIF_SYSCALL_EMU)
+
+static inline bool report_single_step(unsigned long ti_work)
+{
+	return (ti_work & SYSEMU_STEP) == _TIF_SINGLESTEP;
+}
+#endif
+
+static void syscall_exit_work(struct pt_regs *regs, unsigned long ti_work)
+{
+	bool step;
+
+	audit_syscall_exit(regs);
+
+	if (ti_work & _TIF_SYSCALL_TRACEPOINT)
+		trace_sys_exit(regs, syscall_get_return_value(current, regs));
+
+	step = report_single_step(ti_work);
+	if (step || ti_work & _TIF_SYSCALL_TRACE)
+		arch_syscall_exit_tracehook(regs, step);
+}
+
+/*
+ * Syscall specific exit to user mode preparation. Runs with interrupts
+ * enabled.
+ */
+static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	u32 cached_flags = READ_ONCE(current_thread_info()->flags);
+	unsigned long nr = syscall_get_nr(current, regs);
+
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
+	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
+		if (WARN(irqs_disabled(), "syscall %lu left IRQs disabled", nr))
+			local_irq_enable();
+	}
+
+	rseq_syscall(regs);
+
+	/*
+	 * Do one-time syscall specific work. If these work items are
+	 * enabled, we want to run them exactly once per syscall exit with
+	 * interrupts enabled.
+	 */
+	if (unlikely(cached_flags & SYSCALL_EXIT_WORK))
+		syscall_exit_work(regs, cached_flags);
+}
+
+__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+{
+	instrumentation_begin();
+	syscall_exit_to_user_mode_prepare(regs);
+	local_irq_disable_exit_to_user();
+	exit_to_user_mode_prepare(regs);
+	instrumentation_end();
+	exit_to_user_mode();
+}
+
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
 	enter_from_user_mode(regs);
 }
+
+noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
+{
+	instrumentation_begin();
+	exit_to_user_mode_prepare(regs);
+	instrumentation_end();
+	exit_to_user_mode();
+}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: core/entry] entry: Provide generic interrupt entry/exit code
  2020-07-22 21:59 ` [patch V5 04/15] entry: Provide generic interrupt entry/exit code Thomas Gleixner
@ 2020-07-24 19:08   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 19:08 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, LKML

The following commit has been merged into the core/entry branch of tip:

Commit-ID:     a5497bab5f72dce38a259a53fd3ac1239a7ecf40
Gitweb:        https://git.kernel.org/tip/a5497bab5f72dce38a259a53fd3ac1239a7ecf40
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 22 Jul 2020 23:59:58 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 14:59:04 +02:00

entry: Provide generic interrupt entry/exit code

Like the syscall entry/exit code interrupt/exception entry after the real
low level ASM bits should not be different accross architectures.

Provide a generic version based on the x86 code.

irqentry_enter() is called after the low level entry code and
irqentry_exit() must be invoked right before returning to the low level
code which just contains the actual return logic. The code before
irqentry_enter() and irqentry_exit() must not be instrumented. Code after
irqentry_enter() and before irqentry_exit() can be instrumented.

irqentry_enter() invokes irqentry_enter_from_user_mode() if the
interrupt/exception came from user mode. If if entered from kernel mode it
handles the kernel mode variant of establishing state for lockdep, RCU and
tracing depending on the kernel context it interrupted (idle, non-idle).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220519.723703209@linutronix.de


---
 include/linux/entry-common.h |  62 ++++++++++++++++++-
 kernel/entry/common.c        | 117 ++++++++++++++++++++++++++++++++++-
 2 files changed, 179 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index c4a57be..efebbff 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -307,4 +307,66 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
  */
 void irqentry_exit_to_user_mode(struct pt_regs *regs);
 
+#ifndef irqentry_state
+typedef struct irqentry_state {
+	bool	exit_rcu;
+} irqentry_state_t;
+#endif
+
+/**
+ * irqentry_enter - Handle state tracking on ordinary interrupt entries
+ * @regs:	Pointer to pt_regs of interrupted context
+ *
+ * Invokes:
+ *  - lockdep irqflag state tracking as low level ASM entry disabled
+ *    interrupts.
+ *
+ *  - Context tracking if the exception hit user mode.
+ *
+ *  - The hardirq tracer to keep the state consistent as low level ASM
+ *    entry disabled interrupts.
+ *
+ * As a precondition, this requires that the entry came from user mode,
+ * idle, or a kernel context in which RCU is watching.
+ *
+ * For kernel mode entries RCU handling is done conditional. If RCU is
+ * watching then the only RCU requirement is to check whether the tick has
+ * to be restarted. If RCU is not watching then rcu_irq_enter() has to be
+ * invoked on entry and rcu_irq_exit() on exit.
+ *
+ * Avoiding the rcu_irq_enter/exit() calls is an optimization but also
+ * solves the problem of kernel mode pagefaults which can schedule, which
+ * is not possible after invoking rcu_irq_enter() without undoing it.
+ *
+ * For user mode entries irqentry_enter_from_user_mode() is invoked to
+ * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
+ * would not be possible.
+ *
+ * Returns: An opaque object that must be passed to idtentry_exit()
+ */
+irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+
+/**
+ * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
+ *
+ * Conditional reschedule with additional sanity checks.
+ */
+void irqentry_exit_cond_resched(void);
+
+/**
+ * irqentry_exit - Handle return from exception that used irqentry_enter()
+ * @regs:	Pointer to pt_regs (exception entry regs)
+ * @state:	Return value from matching call to irqentry_enter()
+ *
+ * Depending on the return target (kernel/user) this runs the necessary
+ * preemption and work checks if possible and reguired and returns to
+ * the caller with interrupts disabled and no further work pending.
+ *
+ * This is the last action before returning to the low level ASM code which
+ * just needs to return to the appropriate context.
+ *
+ * Counterpart to irqentry_enter().
+ */
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 0a051bb..495f5c0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -255,3 +255,120 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 	instrumentation_end();
 	exit_to_user_mode();
 }
+
+irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs)
+{
+	irqentry_state_t ret = {
+		.exit_rcu = false,
+	};
+
+	if (user_mode(regs)) {
+		irqentry_enter_from_user_mode(regs);
+		return ret;
+	}
+
+	/*
+	 * If this entry hit the idle task invoke rcu_irq_enter() whether
+	 * RCU is watching or not.
+	 *
+	 * Interupts can nest when the first interrupt invokes softirq
+	 * processing on return which enables interrupts.
+	 *
+	 * Scheduler ticks in the idle task can mark quiescent state and
+	 * terminate a grace period, if and only if the timer interrupt is
+	 * not nested into another interrupt.
+	 *
+	 * Checking for __rcu_is_watching() here would prevent the nesting
+	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
+	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
+	 * assume that it is the first interupt and eventually claim
+	 * quiescient state and end grace periods prematurely.
+	 *
+	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
+	 * consistent.
+	 *
+	 * TINY_RCU does not support EQS, so let the compiler eliminate
+	 * this part when enabled.
+	 */
+	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
+		/*
+		 * If RCU is not watching then the same careful
+		 * sequence vs. lockdep and tracing is required
+		 * as in irq_enter_from_user_mode().
+		 */
+		lockdep_hardirqs_off(CALLER_ADDR0);
+		rcu_irq_enter();
+		instrumentation_begin();
+		trace_hardirqs_off_finish();
+		instrumentation_end();
+
+		ret.exit_rcu = true;
+		return ret;
+	}
+
+	/*
+	 * If RCU is watching then RCU only wants to check whether it needs
+	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
+	 * already contains a warning when RCU is not watching, so no point
+	 * in having another one here.
+	 */
+	instrumentation_begin();
+	rcu_irq_enter_check_tick();
+	/* Use the combo lockdep/tracing function */
+	trace_hardirqs_off();
+	instrumentation_end();
+
+	return ret;
+}
+
+void irqentry_exit_cond_resched(void)
+{
+	if (!preempt_count()) {
+		/* Sanity check RCU and thread stack */
+		rcu_irq_exit_check_preempt();
+		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
+			WARN_ON_ONCE(!on_thread_stack());
+		if (need_resched())
+			preempt_schedule_irq();
+	}
+}
+
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+{
+	lockdep_assert_irqs_disabled();
+
+	/* Check whether this returns to user mode */
+	if (user_mode(regs)) {
+		irqentry_exit_to_user_mode(regs);
+	} else if (!regs_irqs_disabled(regs)) {
+		/*
+		 * If RCU was not watching on entry this needs to be done
+		 * carefully and needs the same ordering of lockdep/tracing
+		 * and RCU as the return to user mode path.
+		 */
+		if (state.exit_rcu) {
+			instrumentation_begin();
+			/* Tell the tracer that IRET will enable interrupts */
+			trace_hardirqs_on_prepare();
+			lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+			instrumentation_end();
+			rcu_irq_exit();
+			lockdep_hardirqs_on(CALLER_ADDR0);
+			return;
+		}
+
+		instrumentation_begin();
+		if (IS_ENABLED(CONFIG_PREEMPTION))
+			irqentry_exit_cond_resched();
+		/* Covers both tracing and lockdep */
+		trace_hardirqs_on();
+		instrumentation_end();
+	} else {
+		/*
+		 * IRQ flags state is correct already. Just tell RCU if it
+		 * was not watching on entry.
+		 */
+		if (state.exit_rcu)
+			rcu_irq_exit();
+	}
+}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: core/entry] entry: Provide generic syscall entry functionality
  2020-07-22 21:59 ` [patch V5 02/15] entry: Provide generic syscall entry functionality Thomas Gleixner
@ 2020-07-24 19:08   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 19:08 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the core/entry branch of tip:

Commit-ID:     142781e108b13b2b0e8f035cfb5bfbbc8f14d887
Gitweb:        https://git.kernel.org/tip/142781e108b13b2b0e8f035cfb5bfbbc8f14d887
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 22 Jul 2020 23:59:56 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 14:59:03 +02:00

entry: Provide generic syscall entry functionality

On syscall entry certain work needs to be done:

   - Establish state (lockdep, context tracking, tracing)
   - Conditional work (ptrace, seccomp, audit...)

This code is needlessly duplicated and  different in all
architectures.

Provide a generic version based on the x86 implementation which has all the
RCU and instrumentation bits right.

As interrupt/exception entry from user space needs parts of the same
functionality, provide a function for this as well.

syscall_enter_from_user_mode() and irqentry_enter_from_user_mode() must be
called right after the low level ASM entry. The calling code must be
non-instrumentable. After the functions returns state is correct and the
subsequent functions can be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.513463269@linutronix.de

---
 arch/Kconfig                 |   3 +-
 include/linux/entry-common.h | 121 ++++++++++++++++++++++++++++++++++-
 kernel/Makefile              |   1 +-
 kernel/entry/Makefile        |  12 +++-
 kernel/entry/common.c        |  88 +++++++++++++++++++++++++-
 5 files changed, 225 insertions(+)
 create mode 100644 include/linux/entry-common.h
 create mode 100644 kernel/entry/Makefile
 create mode 100644 kernel/entry/common.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc..852a527 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -27,6 +27,9 @@ config HAVE_IMA_KEXEC
 config HOTPLUG_SMT
 	bool
 
+config GENERIC_ENTRY
+       bool
+
 config OPROFILE
 	tristate "OProfile system profiling"
 	depends on PROFILING
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
new file mode 100644
index 0000000..42fc8e4
--- /dev/null
+++ b/include/linux/entry-common.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_ENTRYCOMMON_H
+#define __LINUX_ENTRYCOMMON_H
+
+#include <linux/tracehook.h>
+#include <linux/syscalls.h>
+#include <linux/seccomp.h>
+#include <linux/sched.h>
+
+#include <asm/entry-common.h>
+
+/*
+ * Define dummy _TIF work flags if not defined by the architecture or for
+ * disabled functionality.
+ */
+#ifndef _TIF_SYSCALL_EMU
+# define _TIF_SYSCALL_EMU		(0)
+#endif
+
+#ifndef _TIF_SYSCALL_TRACEPOINT
+# define _TIF_SYSCALL_TRACEPOINT	(0)
+#endif
+
+#ifndef _TIF_SECCOMP
+# define _TIF_SECCOMP			(0)
+#endif
+
+#ifndef _TIF_SYSCALL_AUDIT
+# define _TIF_SYSCALL_AUDIT		(0)
+#endif
+
+/*
+ * TIF flags handled in syscall_enter_from_usermode()
+ */
+#ifndef ARCH_SYSCALL_ENTER_WORK
+# define ARCH_SYSCALL_ENTER_WORK	(0)
+#endif
+
+#define SYSCALL_ENTER_WORK						\
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP |	\
+	 _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU |			\
+	 ARCH_SYSCALL_ENTER_WORK)
+
+/**
+ * arch_check_user_regs - Architecture specific sanity check for user mode regs
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Defaults to an empty implementation. Can be replaced by architecture
+ * specific code.
+ *
+ * Invoked from syscall_enter_from_user_mode() in the non-instrumentable
+ * section. Use __always_inline so the compiler cannot push it out of line
+ * and make it instrumentable.
+ */
+static __always_inline void arch_check_user_regs(struct pt_regs *regs);
+
+#ifndef arch_check_user_regs
+static __always_inline void arch_check_user_regs(struct pt_regs *regs) {}
+#endif
+
+/**
+ * arch_syscall_enter_tracehook - Wrapper around tracehook_report_syscall_entry()
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Returns: 0 on success or an error code to skip the syscall.
+ *
+ * Defaults to tracehook_report_syscall_entry(). Can be replaced by
+ * architecture specific code.
+ *
+ * Invoked from syscall_enter_from_user_mode()
+ */
+static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs);
+
+#ifndef arch_syscall_enter_tracehook
+static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs)
+{
+	return tracehook_report_syscall_entry(regs);
+}
+#endif
+
+/**
+ * syscall_enter_from_user_mode - Check and handle work before invoking
+ *				 a syscall
+ * @regs:	Pointer to currents pt_regs
+ * @syscall:	The syscall number
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * disabled. The calling code has to be non-instrumentable. When the
+ * function returns all state is correct and the subsequent functions can be
+ * instrumented.
+ *
+ * Returns: The original or a modified syscall number
+ *
+ * If the returned syscall number is -1 then the syscall should be
+ * skipped. In this case the caller may invoke syscall_set_error() or
+ * syscall_set_return_value() first.  If neither of those are called and -1
+ * is returned, then the syscall will fail with ENOSYS.
+ *
+ * The following functionality is handled here:
+ *
+ *  1) Establish state (lockdep, RCU (context tracking), tracing)
+ *  2) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
+ *     __secure_computing(), trace_sys_enter()
+ *  3) Invocation of audit_syscall_entry()
+ */
+long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
+
+/**
+ * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked from architecture specific entry code with interrupts disabled.
+ * Can only be called when the interrupt entry came from user mode. The
+ * calling code must be non-instrumentable.  When the function returns all
+ * state is correct and the subsequent functions can be instrumented.
+ *
+ * The function establishes state (lockdep, RCU (context tracking), tracing)
+ */
+void irqentry_enter_from_user_mode(struct pt_regs *regs);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index f3218bc..fde2000 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -48,6 +48,7 @@ obj-y += irq/
 obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
+obj-y += entry/
 
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
diff --git a/kernel/entry/Makefile b/kernel/entry/Makefile
new file mode 100644
index 0000000..c207d20
--- /dev/null
+++ b/kernel/entry/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# Prevent the noinstr section from being pestered by sanitizer and other goodies
+# as long as these things cannot be disabled per function.
+KASAN_SANITIZE := n
+UBSAN_SANITIZE := n
+KCOV_INSTRUMENT := n
+
+CFLAGS_REMOVE_common.o	 = -fstack-protector -fstack-protector-strong
+CFLAGS_common.o		+= -fno-stack-protector
+
+obj-$(CONFIG_GENERIC_ENTRY) += common.o
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
new file mode 100644
index 0000000..1d636de
--- /dev/null
+++ b/kernel/entry/common.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/context_tracking.h>
+#include <linux/entry-common.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+/**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall/interrupt entry disables interrupts, but user mode is traced as
+ * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ */
+static __always_inline void enter_from_user_mode(struct pt_regs *regs)
+{
+	arch_check_user_regs(regs);
+	lockdep_hardirqs_off(CALLER_ADDR0);
+
+	CT_WARN_ON(ct_state() != CONTEXT_USER);
+	user_exit_irqoff();
+
+	instrumentation_begin();
+	trace_hardirqs_off_finish();
+	instrumentation_end();
+}
+
+static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
+{
+	if (unlikely(audit_context())) {
+		unsigned long args[6];
+
+		syscall_get_arguments(current, regs, args);
+		audit_syscall_entry(syscall, args[0], args[1], args[2], args[3]);
+	}
+}
+
+static long syscall_trace_enter(struct pt_regs *regs, long syscall,
+				unsigned long ti_work)
+{
+	long ret = 0;
+
+	/* Handle ptrace */
+	if (ti_work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) {
+		ret = arch_syscall_enter_tracehook(regs);
+		if (ret || (ti_work & _TIF_SYSCALL_EMU))
+			return -1L;
+	}
+
+	/* Do seccomp after ptrace, to catch any tracer changes. */
+	if (ti_work & _TIF_SECCOMP) {
+		ret = __secure_computing(NULL);
+		if (ret == -1L)
+			return ret;
+	}
+
+	if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT))
+		trace_sys_enter(regs, syscall);
+
+	syscall_enter_audit(regs, syscall);
+
+	return ret ? : syscall;
+}
+
+noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
+{
+	unsigned long ti_work;
+
+	enter_from_user_mode(regs);
+	instrumentation_begin();
+
+	local_irq_enable();
+	ti_work = READ_ONCE(current_thread_info()->flags);
+	if (ti_work & SYSCALL_ENTER_WORK)
+		syscall = syscall_trace_enter(regs, syscall, ti_work);
+	instrumentation_end();
+
+	return syscall;
+}
+
+noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+	enter_from_user_mode(regs);
+}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: core/entry] seccomp: Provide stub for __secure_computing()
  2020-07-22 21:59 ` [patch V5 01/15] seccomp: Provide stub for __secure_computing() Thomas Gleixner
@ 2020-07-24 19:08   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 19:08 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the core/entry branch of tip:

Commit-ID:     6823ecabf03031d610a6c5afe7ed4b4fd659a99f
Gitweb:        https://git.kernel.org/tip/6823ecabf03031d610a6c5afe7ed4b4fd659a99f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 22 Jul 2020 23:59:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 14:59:03 +02:00

seccomp: Provide stub for __secure_computing()

To avoid #ifdeffery in the upcoming generic syscall entry work code provide
a stub for __secure_computing() as this is preferred over
secure_computing() because the TIF flag is already evaluated.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.404974280@linutronix.de


---
 include/linux/seccomp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4192369..03d28c3 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -61,6 +61,7 @@ struct seccomp_filter { };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 static inline int secure_computing(void) { return 0; }
+static inline int __secure_computing(void) { return 0; }
 #else
 static inline void secure_computing_strict(int this_syscall) { return; }
 #endif

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/kvm: Use generic xfer to guest work function
  2020-07-22 22:00 ` [patch V5 15/15] x86/kvm: Use generic xfer to guest work function Thomas Gleixner
  2020-07-24  0:17   ` Sean Christopherson
  2020-07-24 14:24   ` Ingo Molnar
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     72c3c0fe54a3f3ddea8f5ca468ddf9deaf2100b7
Gitweb:        https://git.kernel.org/tip/72c3c0fe54a3f3ddea8f5ca468ddf9deaf2100b7
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:09 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:05:01 +02:00

x86/kvm: Use generic xfer to guest work function

Use the generic infrastructure to check for and handle pending work before
transitioning into guest mode.

This now handles TIF_NOTIFY_RESUME as well which was ignored so
far. Handling it is important as this covers task work and task work will
be used to offload the heavy lifting of POSIX CPU timers to thread context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220520.979724969@linutronix.de

---
 arch/x86/kvm/Kconfig   |  1 +
 arch/x86/kvm/vmx/vmx.c | 11 +++++------
 arch/x86/kvm/x86.c     | 15 ++++++---------
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b277a2d..fbd5bd7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -42,6 +42,7 @@ config KVM
 	select HAVE_KVM_MSI
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_NO_POLL
+	select KVM_XFER_TO_GUEST_WORK
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
 	select SRCU
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 13745f2..9909375 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/tboot.h>
 #include <linux/trace_events.h>
+#include <linux/entry-kvm.h>
 
 #include <asm/apic.h>
 #include <asm/asm.h>
@@ -5373,14 +5374,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 		}
 
 		/*
-		 * Note, return 1 and not 0, vcpu_run() is responsible for
-		 * morphing the pending signal into the proper return code.
+		 * Note, return 1 and not 0, vcpu_run() will invoke
+		 * xfer_to_guest_mode() which will create a proper return
+		 * code.
 		 */
-		if (signal_pending(current))
+		if (__xfer_to_guest_mode_work_pending())
 			return 1;
-
-		if (need_resched())
-			schedule();
 	}
 
 	return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f..82d4a9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -56,6 +56,7 @@
 #include <linux/sched/stat.h>
 #include <linux/sched/isolation.h>
 #include <linux/mem_encrypt.h>
+#include <linux/entry-kvm.h>
 
 #include <trace/events/kvm.h>
 
@@ -1587,7 +1588,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr);
 bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
 {
 	return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) ||
-		need_resched() || signal_pending(current);
+		xfer_to_guest_mode_work_pending();
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_exit_request);
 
@@ -8681,15 +8682,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 			break;
 		}
 
-		if (signal_pending(current)) {
-			r = -EINTR;
-			vcpu->run->exit_reason = KVM_EXIT_INTR;
-			++vcpu->stat.signal_exits;
-			break;
-		}
-		if (need_resched()) {
+		if (xfer_to_guest_mode_work_pending()) {
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
-			cond_resched();
+			r = xfer_to_guest_mode_handle_work(vcpu);
+			if (r)
+				return r;
 			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
 		}
 	}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Use generic interrupt entry/exit code
  2020-07-22 22:00 ` [patch V5 13/15] x86/entry: Use generic interrupt entry/exit code Thomas Gleixner
  2020-07-24 14:28   ` Ingo Molnar
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     bdcd178ada90d2413bcc9df4211dcdd511a47586
Gitweb:        https://git.kernel.org/tip/bdcd178ada90d2413bcc9df4211dcdd511a47586
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:07 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:05:00 +02:00

x86/entry: Use generic interrupt entry/exit code

Replace the x86 code with the generic variant. Use temporary defines for
idtentry_* which will be cleaned up in the next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220520.711492752@linutronix.de


---
 arch/x86/entry/common.c         | 167 +-------------------------------
 arch/x86/include/asm/idtentry.h |  10 +--
 2 files changed, 5 insertions(+), 172 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index bc96eb8..297e08e 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -198,171 +198,6 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
-/**
- * idtentry_enter - Handle state tracking on ordinary idtentries
- * @regs:	Pointer to pt_regs of interrupted context
- *
- * Invokes:
- *  - lockdep irqflag state tracking as low level ASM entry disabled
- *    interrupts.
- *
- *  - Context tracking if the exception hit user mode.
- *
- *  - The hardirq tracer to keep the state consistent as low level ASM
- *    entry disabled interrupts.
- *
- * As a precondition, this requires that the entry came from user mode,
- * idle, or a kernel context in which RCU is watching.
- *
- * For kernel mode entries RCU handling is done conditional. If RCU is
- * watching then the only RCU requirement is to check whether the tick has
- * to be restarted. If RCU is not watching then rcu_irq_enter() has to be
- * invoked on entry and rcu_irq_exit() on exit.
- *
- * Avoiding the rcu_irq_enter/exit() calls is an optimization but also
- * solves the problem of kernel mode pagefaults which can schedule, which
- * is not possible after invoking rcu_irq_enter() without undoing it.
- *
- * For user mode entries irqentry_enter_from_user_mode() must be invoked to
- * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
- * would not be possible.
- *
- * Returns: An opaque object that must be passed to idtentry_exit()
- *
- * The return value must be fed into the state argument of
- * idtentry_exit().
- */
-idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
-{
-	idtentry_state_t ret = {
-		.exit_rcu = false,
-	};
-
-	if (user_mode(regs)) {
-		irqentry_enter_from_user_mode(regs);
-		return ret;
-	}
-
-	/*
-	 * If this entry hit the idle task invoke rcu_irq_enter() whether
-	 * RCU is watching or not.
-	 *
-	 * Interupts can nest when the first interrupt invokes softirq
-	 * processing on return which enables interrupts.
-	 *
-	 * Scheduler ticks in the idle task can mark quiescent state and
-	 * terminate a grace period, if and only if the timer interrupt is
-	 * not nested into another interrupt.
-	 *
-	 * Checking for __rcu_is_watching() here would prevent the nesting
-	 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
-	 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
-	 * assume that it is the first interupt and eventually claim
-	 * quiescient state and end grace periods prematurely.
-	 *
-	 * Unconditionally invoke rcu_irq_enter() so RCU state stays
-	 * consistent.
-	 *
-	 * TINY_RCU does not support EQS, so let the compiler eliminate
-	 * this part when enabled.
-	 */
-	if (!IS_ENABLED(CONFIG_TINY_RCU) && is_idle_task(current)) {
-		/*
-		 * If RCU is not watching then the same careful
-		 * sequence vs. lockdep and tracing is required
-		 * as in irqentry_enter_from_user_mode().
-		 */
-		lockdep_hardirqs_off(CALLER_ADDR0);
-		rcu_irq_enter();
-		instrumentation_begin();
-		trace_hardirqs_off_finish();
-		instrumentation_end();
-
-		ret.exit_rcu = true;
-		return ret;
-	}
-
-	/*
-	 * If RCU is watching then RCU only wants to check whether it needs
-	 * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick()
-	 * already contains a warning when RCU is not watching, so no point
-	 * in having another one here.
-	 */
-	instrumentation_begin();
-	rcu_irq_enter_check_tick();
-	/* Use the combo lockdep/tracing function */
-	trace_hardirqs_off();
-	instrumentation_end();
-
-	return ret;
-}
-
-static void idtentry_exit_cond_resched(struct pt_regs *regs, bool may_sched)
-{
-	if (may_sched && !preempt_count()) {
-		/* Sanity check RCU and thread stack */
-		rcu_irq_exit_check_preempt();
-		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
-			WARN_ON_ONCE(!on_thread_stack());
-		if (need_resched())
-			preempt_schedule_irq();
-	}
-	/* Covers both tracing and lockdep */
-	trace_hardirqs_on();
-}
-
-/**
- * idtentry_exit - Handle return from exception that used idtentry_enter()
- * @regs:	Pointer to pt_regs (exception entry regs)
- * @state:	Return value from matching call to idtentry_enter()
- *
- * Depending on the return target (kernel/user) this runs the necessary
- * preemption and work checks if possible and reguired and returns to
- * the caller with interrupts disabled and no further work pending.
- *
- * This is the last action before returning to the low level ASM code which
- * just needs to return to the appropriate context.
- *
- * Counterpart to idtentry_enter(). The return value of the entry
- * function must be fed into the @state argument.
- */
-void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
-{
-	lockdep_assert_irqs_disabled();
-
-	/* Check whether this returns to user mode */
-	if (user_mode(regs)) {
-		irqentry_exit_to_user_mode(regs);
-	} else if (regs->flags & X86_EFLAGS_IF) {
-		/*
-		 * If RCU was not watching on entry this needs to be done
-		 * carefully and needs the same ordering of lockdep/tracing
-		 * and RCU as the return to user mode path.
-		 */
-		if (state.exit_rcu) {
-			instrumentation_begin();
-			/* Tell the tracer that IRET will enable interrupts */
-			trace_hardirqs_on_prepare();
-			lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-			instrumentation_end();
-			rcu_irq_exit();
-			lockdep_hardirqs_on(CALLER_ADDR0);
-			return;
-		}
-
-		instrumentation_begin();
-		idtentry_exit_cond_resched(regs, IS_ENABLED(CONFIG_PREEMPTION));
-		instrumentation_end();
-	} else {
-		/*
-		 * IRQ flags state is correct already. Just tell RCU if it
-		 * was not watching on entry.
-		 */
-		if (state.exit_rcu)
-			rcu_irq_exit();
-	}
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -427,7 +262,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
 		instrumentation_begin();
-		idtentry_exit_cond_resched(regs, true);
+		irqentry_exit_cond_resched();
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index bf59f72..621e25d 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,12 +11,10 @@
 
 #include <asm/irq_stack.h>
 
-typedef struct idtentry_state {
-	bool exit_rcu;
-} idtentry_state_t;
-
-idtentry_state_t idtentry_enter(struct pt_regs *regs);
-void idtentry_exit(struct pt_regs *regs, idtentry_state_t state);
+/* Temporary defines */
+typedef irqentry_state_t idtentry_state_t;
+#define idtentry_enter irqentry_enter
+#define idtentry_exit irqentry_exit
 
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Cleanup idtentry_enter/exit
  2020-07-22 22:00 ` [patch V5 14/15] x86/entry: Cleanup idtentry_enter/exit Thomas Gleixner
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     a27a0a55495cdde4b8d98f82460dc46eb44777fd
Gitweb:        https://git.kernel.org/tip/a27a0a55495cdde4b8d98f82460dc46eb44777fd
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:08 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:05:01 +02:00

x86/entry: Cleanup idtentry_enter/exit

Remove the temporary defines and fixup all references.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.855839271@linutronix.de


---
 arch/x86/entry/common.c         |  6 +++---
 arch/x86/include/asm/idtentry.h | 33 +++++++++++++-------------------
 arch/x86/kernel/kvm.c           |  6 +++---
 arch/x86/kernel/traps.c         |  6 +++---
 arch/x86/mm/fault.c             |  6 +++---
 5 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 297e08e..3de0303 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -248,9 +248,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 {
 	struct pt_regs *old_regs;
 	bool inhcall;
-	idtentry_state_t state;
+	irqentry_state_t state;
 
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 	old_regs = set_irq_regs(regs);
 
 	instrumentation_begin();
@@ -266,7 +266,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
-		idtentry_exit(regs, state);
+		irqentry_exit(regs, state);
 	}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 621e25d..73eb277 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,11 +11,6 @@
 
 #include <asm/irq_stack.h>
 
-/* Temporary defines */
-typedef irqentry_state_t idtentry_state_t;
-#define idtentry_enter irqentry_enter
-#define idtentry_exit irqentry_exit
-
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -45,8 +40,8 @@ typedef irqentry_state_t idtentry_state_t;
  * The macro is written so it acts as function definition. Append the
  * body with a pair of curly brackets.
  *
- * idtentry_enter() contains common code which has to be invoked before
- * arbitrary code in the body. idtentry_exit() contains common code
+ * irqentry_enter() contains common code which has to be invoked before
+ * arbitrary code in the body. irqentry_exit() contains common code
  * which has to run before returning to the low level assembly code.
  */
 #define DEFINE_IDTENTRY(func)						\
@@ -54,12 +49,12 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__##func (regs);						\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -101,12 +96,12 @@ static __always_inline void __##func(struct pt_regs *regs,		\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__##func (regs, error_code);					\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs,		\
@@ -161,7 +156,7 @@ __visible noinstr void func(struct pt_regs *regs)
  * body with a pair of curly brackets.
  *
  * Contrary to DEFINE_IDTENTRY_ERRORCODE() this does not invoke the
- * idtentry_enter/exit() helpers before and after the body invocation. This
+ * irqentry_enter/exit() helpers before and after the body invocation. This
  * needs to be done in the body itself if applicable. Use if extra work
  * is required before the enter/exit() helpers are invoked.
  */
@@ -197,7 +192,7 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector);	\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
@@ -205,7 +200,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	__##func (regs, (u8)error_code);				\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -229,7 +224,7 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector)
  * DEFINE_IDTENTRY_SYSVEC - Emit code for system vector IDT entry points
  * @func:	Function name of the entry point
  *
- * idtentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
+ * irqentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
  * function body. KVM L1D flush request is set.
  *
  * Runs the function on the interrupt stack if the entry hit kernel mode
@@ -239,7 +234,7 @@ static void __##func(struct pt_regs *regs);				\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
@@ -247,7 +242,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	run_on_irqstack_cond(__##func, regs, regs);			\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static noinline void __##func(struct pt_regs *regs)
@@ -268,7 +263,7 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	idtentry_state_t state = idtentry_enter(regs);			\
+	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
 	__irq_enter_raw();						\
@@ -276,7 +271,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	__##func (regs);						\
 	__irq_exit_raw();						\
 	instrumentation_end();						\
-	idtentry_exit(regs, state);					\
+	irqentry_exit(regs, state);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f78482..233c77d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -233,7 +233,7 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
 noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 {
 	u32 reason = kvm_read_and_reset_apf_flags();
-	idtentry_state_t state;
+	irqentry_state_t state;
 
 	switch (reason) {
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
@@ -243,7 +243,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 		return false;
 	}
 
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 	instrumentation_begin();
 
 	/*
@@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 	}
 
 	instrumentation_end();
-	idtentry_exit(regs, state);
+	irqentry_exit(regs, state);
 	return true;
 }
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 59c7f54..be8fcfe 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
 
 DEFINE_IDTENTRY_RAW(exc_invalid_op)
 {
-	idtentry_state_t state;
+	irqentry_state_t state;
 
 	/*
 	 * We use UD2 as a short encoding for 'CALL __WARN', as such
@@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 	if (!user_mode(regs) && handle_bug(regs))
 		return;
 
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 	instrumentation_begin();
 	handle_invalid_op(regs);
 	instrumentation_end();
-	idtentry_exit(regs, state);
+	irqentry_exit(regs, state);
 }
 
 DEFINE_IDTENTRY(exc_coproc_segment_overrun)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5e41949..5e5edd2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1377,7 +1377,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
 	unsigned long address = read_cr2();
-	idtentry_state_t state;
+	irqentry_state_t state;
 
 	prefetchw(&current->mm->mmap_lock);
 
@@ -1412,11 +1412,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	 * code reenabled RCU to avoid subsequent wreckage which helps
 	 * debugability.
 	 */
-	state = idtentry_enter(regs);
+	state = irqentry_enter(regs);
 
 	instrumentation_begin();
 	handle_page_fault(regs, error_code, address);
 	instrumentation_end();
 
-	idtentry_exit(regs, state);
+	irqentry_exit(regs, state);
 }

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Cleanup idtentry_entry/exit_user
  2020-07-22 22:00 ` [patch V5 12/15] x86/entry: Cleanup idtentry_entry/exit_user Thomas Gleixner
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     517e499227bed34acc69166691e2db5df3dc859a
Gitweb:        https://git.kernel.org/tip/517e499227bed34acc69166691e2db5df3dc859a
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:06 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:05:00 +02:00

x86/entry: Cleanup idtentry_entry/exit_user

Cleanup the temporary defines and use irqentry_ instead of idtentry_.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.602603691@linutronix.de


---
 arch/x86/include/asm/idtentry.h |  4 ----
 arch/x86/kernel/cpu/mce/core.c  |  4 ++--
 arch/x86/kernel/traps.c         | 18 +++++++++---------
 3 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index f7d48ea..bf59f72 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,10 +11,6 @@
 
 #include <asm/irq_stack.h>
 
-/* Temporary define */
-#define idtentry_enter_user	irqentry_enter_from_user_mode
-#define idtentry_exit_user	irqentry_exit_to_user_mode
-
 typedef struct idtentry_state {
 	bool exit_rcu;
 } idtentry_state_t;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6d7aa56..97ff831 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1927,11 +1927,11 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
 {
-	idtentry_enter_user(regs);
+	irqentry_enter_from_user_mode(regs);
 	instrumentation_begin();
 	machine_check_vector(regs);
 	instrumentation_end();
-	idtentry_exit_user(regs);
+	irqentry_exit_to_user_mode(regs);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ab6828e..59c7f54 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -638,18 +638,18 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		return;
 
 	/*
-	 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
-	 * can trigger INT3, hence poke_int3_handler() must be done
-	 * before. If the entry came from kernel mode, then use nmi_enter()
-	 * because the INT3 could have been hit in any context including
-	 * NMI.
+	 * irqentry_enter_from_user_mode() uses static_branch_{,un}likely()
+	 * and therefore can trigger INT3, hence poke_int3_handler() must
+	 * be done before. If the entry came from kernel mode, then use
+	 * nmi_enter() because the INT3 could have been hit in any context
+	 * including NMI.
 	 */
 	if (user_mode(regs)) {
-		idtentry_enter_user(regs);
+		irqentry_enter_from_user_mode(regs);
 		instrumentation_begin();
 		do_int3_user(regs);
 		instrumentation_end();
-		idtentry_exit_user(regs);
+		irqentry_exit_to_user_mode(regs);
 	} else {
 		nmi_enter();
 		instrumentation_begin();
@@ -901,12 +901,12 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
 	 */
 	WARN_ON_ONCE(!user_mode(regs));
 
-	idtentry_enter_user(regs);
+	irqentry_enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	handle_debug(regs, dr6, true);
 	instrumentation_end();
-	idtentry_exit_user(regs);
+	irqentry_exit_to_user_mode(regs);
 }
 
 #ifdef CONFIG_X86_64

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Use generic syscall entry function
  2020-07-22 22:00 ` [patch V5 10/15] x86/entry: Use generic syscall entry function Thomas Gleixner
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     27d6b4d14f5c3ab21c4aef87dd04055a2d7adf14
Gitweb:        https://git.kernel.org/tip/27d6b4d14f5c3ab21c4aef87dd04055a2d7adf14
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:04 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:04:59 +02:00

x86/entry: Use generic syscall entry function

Replace the syscall entry work handling with the generic version. Provide
the necessary helper inlines to handle the real architecture specific
parts, e.g. ptrace.

Use a temporary define for idtentry_enter_user which will be cleaned up
seperately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.376213694@linutronix.de


---
 arch/x86/Kconfig                    |   1 +-
 arch/x86/entry/common.c             | 181 +--------------------------
 arch/x86/include/asm/entry-common.h |  32 +++++-
 arch/x86/include/asm/idtentry.h     |   5 +-
 arch/x86/include/asm/thread_info.h  |   5 +-
 5 files changed, 45 insertions(+), 179 deletions(-)
 create mode 100644 arch/x86/include/asm/entry-common.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0a..ccf02e6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -115,6 +115,7 @@ config X86
 	select GENERIC_CPU_AUTOPROBE
 	select GENERIC_CPU_VULNERABILITIES
 	select GENERIC_EARLY_IOREMAP
+	select GENERIC_ENTRY
 	select GENERIC_FIND_FIRST_BIT
 	select GENERIC_IOMAP
 	select GENERIC_IRQ_EFFECTIVE_AFF_MASK	if SMP
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9415ae5..d2fe85f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -10,13 +10,13 @@
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
+#include <linux/entry-common.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/errno.h>
 #include <linux/ptrace.h>
 #include <linux/tracehook.h>
 #include <linux/audit.h>
-#include <linux/seccomp.h>
 #include <linux/signal.h>
 #include <linux/export.h>
 #include <linux/context_tracking.h>
@@ -42,70 +42,8 @@
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
 
-#define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-/* Check that the stack and regs on entry from user mode are sane. */
-static noinstr void check_user_regs(struct pt_regs *regs)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
-		/*
-		 * Make sure that the entry code gave us a sensible EFLAGS
-		 * register.  Native because we want to check the actual CPU
-		 * state, not the interrupt state as imagined by Xen.
-		 */
-		unsigned long flags = native_save_fl();
-		WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
-				      X86_EFLAGS_NT));
-
-		/* We think we came from user mode. Make sure pt_regs agrees. */
-		WARN_ON_ONCE(!user_mode(regs));
-
-		/*
-		 * All entries from user mode (except #DF) should be on the
-		 * normal thread stack and should have user pt_regs in the
-		 * correct location.
-		 */
-		WARN_ON_ONCE(!on_thread_stack());
-		WARN_ON_ONCE(regs != task_pt_regs(current));
-	}
-}
-
-#ifdef CONFIG_CONTEXT_TRACKING
-/**
- * enter_from_user_mode - Establish state when coming from user mode
- *
- * Syscall entry disables interrupts, but user mode is traced as interrupts
- * enabled. Also with NO_HZ_FULL RCU might be idle.
- *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- */
-static noinstr void enter_from_user_mode(struct pt_regs *regs)
-{
-	enum ctx_state state = ct_state();
-
-	check_user_regs(regs);
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	user_exit_irqoff();
-
-	instrumentation_begin();
-	CT_WARN_ON(state != CONTEXT_USER);
-	trace_hardirqs_off_finish();
-	instrumentation_end();
-}
-#else
-static __always_inline void enter_from_user_mode(struct pt_regs *regs)
-{
-	check_user_regs(regs);
-	lockdep_hardirqs_off(CALLER_ADDR0);
-	instrumentation_begin();
-	trace_hardirqs_off_finish();
-	instrumentation_end();
-}
-#endif
-
 /**
  * exit_to_user_mode - Fixup state when exiting to user mode
  *
@@ -129,83 +67,6 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
-static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
-{
-#ifdef CONFIG_X86_64
-	if (arch == AUDIT_ARCH_X86_64) {
-		audit_syscall_entry(regs->orig_ax, regs->di,
-				    regs->si, regs->dx, regs->r10);
-	} else
-#endif
-	{
-		audit_syscall_entry(regs->orig_ax, regs->bx,
-				    regs->cx, regs->dx, regs->si);
-	}
-}
-
-/*
- * Returns the syscall nr to run (which should match regs->orig_ax) or -1
- * to skip the syscall.
- */
-static long syscall_trace_enter(struct pt_regs *regs)
-{
-	u32 arch = in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
-
-	struct thread_info *ti = current_thread_info();
-	unsigned long ret = 0;
-	u32 work;
-
-	work = READ_ONCE(ti->flags);
-
-	if (work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) {
-		ret = tracehook_report_syscall_entry(regs);
-		if (ret || (work & _TIF_SYSCALL_EMU))
-			return -1L;
-	}
-
-#ifdef CONFIG_SECCOMP
-	/*
-	 * Do seccomp after ptrace, to catch any tracer changes.
-	 */
-	if (work & _TIF_SECCOMP) {
-		struct seccomp_data sd;
-
-		sd.arch = arch;
-		sd.nr = regs->orig_ax;
-		sd.instruction_pointer = regs->ip;
-#ifdef CONFIG_X86_64
-		if (arch == AUDIT_ARCH_X86_64) {
-			sd.args[0] = regs->di;
-			sd.args[1] = regs->si;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->r10;
-			sd.args[4] = regs->r8;
-			sd.args[5] = regs->r9;
-		} else
-#endif
-		{
-			sd.args[0] = regs->bx;
-			sd.args[1] = regs->cx;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->si;
-			sd.args[4] = regs->di;
-			sd.args[5] = regs->bp;
-		}
-
-		ret = __secure_computing(&sd);
-		if (ret == -1)
-			return ret;
-	}
-#endif
-
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->orig_ax);
-
-	do_audit_syscall_entry(regs, arch);
-
-	return ret ?: regs->orig_ax;
-}
-
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING)
@@ -366,26 +227,10 @@ __visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
 	exit_to_user_mode();
 }
 
-static noinstr long syscall_enter(struct pt_regs *regs, unsigned long nr)
-{
-	struct thread_info *ti;
-
-	enter_from_user_mode(regs);
-	instrumentation_begin();
-
-	local_irq_enable();
-	ti = current_thread_info();
-	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
-		nr = syscall_trace_enter(regs);
-
-	instrumentation_end();
-	return nr;
-}
-
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
-	nr = syscall_enter(regs, nr);
+	nr = syscall_enter_from_user_mode(regs, nr);
 
 	instrumentation_begin();
 	if (likely(nr < NR_syscalls)) {
@@ -407,6 +252,8 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
 static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
 {
+	unsigned int nr = (unsigned int)regs->orig_ax;
+
 	if (IS_ENABLED(CONFIG_IA32_EMULATION))
 		current_thread_info()->status |= TS_COMPAT;
 	/*
@@ -414,7 +261,7 @@ static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
 	 * orig_ax, the unsigned int return value truncates it.  This may
 	 * or may not be necessary, but it matches the old asm behavior.
 	 */
-	return syscall_enter(regs, (unsigned int)regs->orig_ax);
+	return (unsigned int)syscall_enter_from_user_mode(regs, nr);
 }
 
 /*
@@ -568,7 +415,7 @@ SYSCALL_DEFINE0(ni_syscall)
  * solves the problem of kernel mode pagefaults which can schedule, which
  * is not possible after invoking rcu_irq_enter() without undoing it.
  *
- * For user mode entries enter_from_user_mode() must be invoked to
+ * For user mode entries irqentry_enter_from_user_mode() must be invoked to
  * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
  * would not be possible.
  *
@@ -584,7 +431,7 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
 	};
 
 	if (user_mode(regs)) {
-		enter_from_user_mode(regs);
+		irqentry_enter_from_user_mode(regs);
 		return ret;
 	}
 
@@ -615,7 +462,7 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
 		/*
 		 * If RCU is not watching then the same careful
 		 * sequence vs. lockdep and tracing is required
-		 * as in enter_from_user_mode().
+		 * as in irqentry_enter_from_user_mode().
 		 */
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
@@ -709,18 +556,6 @@ void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
 }
 
 /**
- * idtentry_enter_user - Handle state tracking on idtentry from user mode
- * @regs:	Pointer to pt_regs of interrupted context
- *
- * Invokes enter_from_user_mode() to establish the proper context for
- * NOHZ_FULL. Otherwise scheduling on exit would not be possible.
- */
-void noinstr idtentry_enter_user(struct pt_regs *regs)
-{
-	enter_from_user_mode(regs);
-}
-
-/**
  * idtentry_exit_user - Handle return from exception to user mode
  * @regs:	Pointer to pt_regs (exception entry regs)
  *
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
new file mode 100644
index 0000000..7070b90
--- /dev/null
+++ b/arch/x86/include/asm/entry-common.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_X86_ENTRY_COMMON_H
+#define _ASM_X86_ENTRY_COMMON_H
+
+/* Check that the stack and regs on entry from user mode are sane. */
+static __always_inline void arch_check_user_regs(struct pt_regs *regs)
+{
+	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
+		/*
+		 * Make sure that the entry code gave us a sensible EFLAGS
+		 * register.  Native because we want to check the actual CPU
+		 * state, not the interrupt state as imagined by Xen.
+		 */
+		unsigned long flags = native_save_fl();
+		WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
+				      X86_EFLAGS_NT));
+
+		/* We think we came from user mode. Make sure pt_regs agrees. */
+		WARN_ON_ONCE(!user_mode(regs));
+
+		/*
+		 * All entries from user mode (except #DF) should be on the
+		 * normal thread stack and should have user pt_regs in the
+		 * correct location.
+		 */
+		WARN_ON_ONCE(!on_thread_stack());
+		WARN_ON_ONCE(regs != task_pt_regs(current));
+	}
+}
+#define arch_check_user_regs arch_check_user_regs
+
+#endif
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1bc6f87..449910f 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -6,11 +6,14 @@
 #include <asm/trapnr.h>
 
 #ifndef __ASSEMBLY__
+#include <linux/entry-common.h>
 #include <linux/hardirq.h>
 
 #include <asm/irq_stack.h>
 
-void idtentry_enter_user(struct pt_regs *regs);
+/* Temporary define */
+#define idtentry_enter_user	irqentry_enter_from_user_mode
+
 void idtentry_exit_user(struct pt_regs *regs);
 
 typedef struct idtentry_state {
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8cec..267701a 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -133,11 +133,6 @@ struct thread_info {
 #define _TIF_X32		(1 << TIF_X32)
 #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
 
-/* Work to do before invoking the actual syscall. */
-#define _TIF_WORK_SYSCALL_ENTRY	\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
-
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
 	(_TIF_NOCPUID | _TIF_NOTSC | _TIF_BLOCKSTEP |		\

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Use generic syscall exit functionality
  2020-07-22 22:00 ` [patch V5 11/15] x86/entry: Use generic syscall exit functionality Thomas Gleixner
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     167fd210ec0555d371a20435dac7c2c7052df7ed
Gitweb:        https://git.kernel.org/tip/167fd210ec0555d371a20435dac7c2c7052df7ed
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:05 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:04:59 +02:00

x86/entry: Use generic syscall exit functionality

Replace the x86 variant with the generic version. Provide the relevant
architecture specific helper functions and defines.

Use a temporary define for idtentry_exit_user which will be cleaned up
seperately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.494648601@linutronix.de


---
 arch/x86/entry/common.c             | 221 +---------------------------
 arch/x86/entry/entry_32.S           |   2 +-
 arch/x86/entry/entry_64.S           |   2 +-
 arch/x86/include/asm/entry-common.h |  44 +++++-
 arch/x86/include/asm/idtentry.h     |   3 +-
 arch/x86/include/asm/signal.h       |   1 +-
 arch/x86/kernel/signal.c            |   3 +-
 7 files changed, 54 insertions(+), 222 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d2fe85f..bc96eb8 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -15,15 +15,8 @@
 #include <linux/smp.h>
 #include <linux/errno.h>
 #include <linux/ptrace.h>
-#include <linux/tracehook.h>
-#include <linux/audit.h>
-#include <linux/signal.h>
 #include <linux/export.h>
-#include <linux/context_tracking.h>
-#include <linux/user-return-notifier.h>
 #include <linux/nospec.h>
-#include <linux/uprobes.h>
-#include <linux/livepatch.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 
@@ -42,191 +35,6 @@
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
 
-#include <trace/events/syscalls.h>
-
-/**
- * exit_to_user_mode - Fixup state when exiting to user mode
- *
- * Syscall exit enables interrupts, but the kernel state is interrupts
- * disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Clear CPU buffers if CPU is affected by MDS and the migitation is on.
- * 4) Tell lockdep that interrupts are enabled
- */
-static __always_inline void exit_to_user_mode(void)
-{
-	instrumentation_begin();
-	trace_hardirqs_on_prepare();
-	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-	instrumentation_end();
-
-	user_enter_irqoff();
-	mds_user_clear_cpu_buffers();
-	lockdep_hardirqs_on(CALLER_ADDR0);
-}
-
-#define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING)
-
-static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
-{
-	/*
-	 * In order to return to user mode, we need to have IRQs off with
-	 * none of EXIT_TO_USERMODE_LOOP_FLAGS set.  Several of these flags
-	 * can be set at any time on preemptible kernels if we have IRQs on,
-	 * so we need to loop.  Disabling preemption wouldn't help: doing the
-	 * work to clear some of the flags can sleep.
-	 */
-	while (true) {
-		/* We have work to do. */
-		local_irq_enable();
-
-		if (cached_flags & _TIF_NEED_RESCHED)
-			schedule();
-
-		if (cached_flags & _TIF_UPROBE)
-			uprobe_notify_resume(regs);
-
-		if (cached_flags & _TIF_PATCH_PENDING)
-			klp_update_patch_state(current);
-
-		/* deal with pending signal delivery */
-		if (cached_flags & _TIF_SIGPENDING)
-			do_signal(regs);
-
-		if (cached_flags & _TIF_NOTIFY_RESUME) {
-			clear_thread_flag(TIF_NOTIFY_RESUME);
-			tracehook_notify_resume(regs);
-			rseq_handle_notify_resume(NULL, regs);
-		}
-
-		/* Disable IRQs and retry */
-		local_irq_disable();
-
-		cached_flags = READ_ONCE(current_thread_info()->flags);
-
-		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
-			break;
-	}
-}
-
-static void __prepare_exit_to_usermode(struct pt_regs *regs)
-{
-	struct thread_info *ti = current_thread_info();
-	u32 cached_flags;
-
-	addr_limit_user_check();
-
-	lockdep_assert_irqs_disabled();
-	lockdep_sys_exit();
-
-	cached_flags = READ_ONCE(ti->flags);
-
-	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
-		exit_to_usermode_loop(regs, cached_flags);
-
-	/* Reload ti->flags; we may have rescheduled above. */
-	cached_flags = READ_ONCE(ti->flags);
-
-	if (cached_flags & _TIF_USER_RETURN_NOTIFY)
-		fire_user_return_notifiers();
-
-	if (unlikely(cached_flags & _TIF_IO_BITMAP))
-		tss_update_io_bitmap();
-
-	fpregs_assert_state_consistent();
-	if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
-		switch_fpu_return();
-
-#ifdef CONFIG_COMPAT
-	/*
-	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
-	 * returning to user mode.  We need to clear it *after* signal
-	 * handling, because syscall restart has a fixup for compat
-	 * syscalls.  The fixup is exercised by the ptrace_syscall_32
-	 * selftest.
-	 *
-	 * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
-	 * special case only applies after poking regs and before the
-	 * very next return to user mode.
-	 */
-	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
-#endif
-}
-
-static noinstr void prepare_exit_to_usermode(struct pt_regs *regs)
-{
-	instrumentation_begin();
-	__prepare_exit_to_usermode(regs);
-	instrumentation_end();
-	exit_to_user_mode();
-}
-
-#define SYSCALL_EXIT_WORK_FLAGS				\
-	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
-
-static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
-{
-	bool step;
-
-	audit_syscall_exit(regs);
-
-	if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
-		trace_sys_exit(regs, regs->ax);
-
-	/*
-	 * If TIF_SYSCALL_EMU is set, we only get here because of
-	 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
-	 * We already reported this syscall instruction in
-	 * syscall_trace_enter().
-	 */
-	step = unlikely(
-		(cached_flags & (_TIF_SINGLESTEP | _TIF_SYSCALL_EMU))
-		== _TIF_SINGLESTEP);
-	if (step || cached_flags & _TIF_SYSCALL_TRACE)
-		tracehook_report_syscall_exit(regs, step);
-}
-
-static void __syscall_return_slowpath(struct pt_regs *regs)
-{
-	struct thread_info *ti = current_thread_info();
-	u32 cached_flags = READ_ONCE(ti->flags);
-
-	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
-
-	if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
-	    WARN(irqs_disabled(), "syscall %ld left IRQs disabled", regs->orig_ax))
-		local_irq_enable();
-
-	rseq_syscall(regs);
-
-	/*
-	 * First do one-time work.  If these work items are enabled, we
-	 * want to run them exactly once per syscall exit with IRQs on.
-	 */
-	if (unlikely(cached_flags & SYSCALL_EXIT_WORK_FLAGS))
-		syscall_slow_exit_work(regs, cached_flags);
-
-	local_irq_disable();
-	__prepare_exit_to_usermode(regs);
-}
-
-/*
- * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
- * state such that we can immediately switch to user mode.
- */
-__visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
-{
-	instrumentation_begin();
-	__syscall_return_slowpath(regs);
-	instrumentation_end();
-	exit_to_user_mode();
-}
-
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -245,7 +53,7 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 #endif
 	}
 	instrumentation_end();
-	syscall_return_slowpath(regs);
+	syscall_exit_to_user_mode(regs);
 }
 #endif
 
@@ -284,7 +92,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 	unsigned int nr = syscall_32_enter(regs);
 
 	do_syscall_32_irqs_on(regs, nr);
-	syscall_return_slowpath(regs);
+	syscall_exit_to_user_mode(regs);
 }
 
 static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
@@ -310,13 +118,13 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 	if (res) {
 		/* User code screwed up. */
 		regs->ax = -EFAULT;
-		syscall_return_slowpath(regs);
+		syscall_exit_to_user_mode(regs);
 		return false;
 	}
 
 	/* Now this is just like a normal syscall. */
 	do_syscall_32_irqs_on(regs, nr);
-	syscall_return_slowpath(regs);
+	syscall_exit_to_user_mode(regs);
 	return true;
 }
 
@@ -524,7 +332,7 @@ void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
 
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
-		prepare_exit_to_usermode(regs);
+		irqentry_exit_to_user_mode(regs);
 	} else if (regs->flags & X86_EFLAGS_IF) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
@@ -555,25 +363,6 @@ void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
 	}
 }
 
-/**
- * idtentry_exit_user - Handle return from exception to user mode
- * @regs:	Pointer to pt_regs (exception entry regs)
- *
- * Runs the necessary preemption and work checks and returns to the caller
- * with interrupts disabled and no further work pending.
- *
- * This is the last action before returning to the low level ASM code which
- * just needs to return to the appropriate context.
- *
- * Counterpart to idtentry_enter_user().
- */
-void noinstr idtentry_exit_user(struct pt_regs *regs)
-{
-	lockdep_assert_irqs_disabled();
-
-	prepare_exit_to_usermode(regs);
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2d0bd5d..6addbd1 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -846,7 +846,7 @@ SYM_CODE_START(ret_from_fork)
 2:
 	/* When we fork, we trace the syscall return in the child, too. */
 	movl    %esp, %eax
-	call    syscall_return_slowpath
+	call    syscall_exit_to_user_mode
 	jmp     .Lsyscall_32_done
 
 	/* kernel thread */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d2a00c9..f423ca9 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -283,7 +283,7 @@ SYM_CODE_START(ret_from_fork)
 2:
 	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
-	call	syscall_return_slowpath	/* returns with IRQs disabled */
+	call	syscall_exit_to_user_mode	/* returns with IRQs disabled */
 	jmp	swapgs_restore_regs_and_return_to_usermode
 
 1:
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7070b90..a8f9315 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -2,6 +2,12 @@
 #ifndef _ASM_X86_ENTRY_COMMON_H
 #define _ASM_X86_ENTRY_COMMON_H
 
+#include <linux/user-return-notifier.h>
+
+#include <asm/nospec-branch.h>
+#include <asm/io_bitmap.h>
+#include <asm/fpu/api.h>
+
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
 {
@@ -29,4 +35,42 @@ static __always_inline void arch_check_user_regs(struct pt_regs *regs)
 }
 #define arch_check_user_regs arch_check_user_regs
 
+#define ARCH_SYSCALL_EXIT_WORK		(_TIF_SINGLESTEP)
+
+static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
+						  unsigned long ti_work)
+{
+	if (ti_work & _TIF_USER_RETURN_NOTIFY)
+		fire_user_return_notifiers();
+
+	if (unlikely(ti_work & _TIF_IO_BITMAP))
+		tss_update_io_bitmap();
+
+	fpregs_assert_state_consistent();
+	if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
+		switch_fpu_return();
+
+#ifdef CONFIG_COMPAT
+	/*
+	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
+	 * returning to user mode.  We need to clear it *after* signal
+	 * handling, because syscall restart has a fixup for compat
+	 * syscalls.  The fixup is exercised by the ptrace_syscall_32
+	 * selftest.
+	 *
+	 * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
+	 * special case only applies after poking regs and before the
+	 * very next return to user mode.
+	 */
+	current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
+#endif
+}
+#define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
+
+static __always_inline void arch_exit_to_user_mode(void)
+{
+	mds_user_clear_cpu_buffers();
+}
+#define arch_exit_to_user_mode arch_exit_to_user_mode
+
 #endif
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 449910f..f7d48ea 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -13,8 +13,7 @@
 
 /* Temporary define */
 #define idtentry_enter_user	irqentry_enter_from_user_mode
-
-void idtentry_exit_user(struct pt_regs *regs);
+#define idtentry_exit_user	irqentry_exit_to_user_mode
 
 typedef struct idtentry_state {
 	bool exit_rcu;
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 33d3c88..6fd8410 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -35,7 +35,6 @@ typedef sigset_t compat_sigset_t;
 #endif /* __ASSEMBLY__ */
 #include <uapi/asm/signal.h>
 #ifndef __ASSEMBLY__
-extern void do_signal(struct pt_regs *regs);
 
 #define __ARCH_HAS_SA_RESTORER
 
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 399f97a..d5fa494 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
 #include <linux/context_tracking.h>
+#include <linux/entry-common.h>
 #include <linux/syscalls.h>
 
 #include <asm/processor.h>
@@ -803,7 +804,7 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
  * want to handle. Thus you cannot kill init even with a SIGKILL even by
  * mistake.
  */
-void do_signal(struct pt_regs *regs)
+void arch_do_signal(struct pt_regs *regs)
 {
 	struct ksignal ksig;
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/ptrace: Provide pt_regs helper for entry/exit
  2020-07-22 22:00 ` [patch V5 09/15] x86/ptrace: Provide pt_regs helper for entry/exit Thomas Gleixner
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     0bf019ea59e330770883ede4499d7f711d8c3adf
Gitweb:        https://git.kernel.org/tip/0bf019ea59e330770883ede4499d7f711d8c3adf
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:03 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:04:58 +02:00

x86/ptrace: Provide pt_regs helper for entry/exit

As a preparatory step for moving the syscall and interrupt entry/exit
handling into generic code, provide a pt_regs helper which retrieves the
interrupt state from pt_regs. This is required to check whether interrupts
are reenabled by return from interrupt/exception.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.258511584@linutronix.de


---
 arch/x86/include/asm/ptrace.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 255b2dd..40aa69d 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -209,6 +209,11 @@ static inline void user_stack_pointer_set(struct pt_regs *regs,
 	regs->sp = val;
 }
 
+static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
+{
+	return !(regs->flags & X86_EFLAGS_IF);
+}
+
 /* Query offset/name of register from its name/offset */
 extern int regs_query_register_offset(const char *name);
 extern const char *regs_query_register_name(unsigned int offset);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Move user return notifier out of loop
  2020-07-22 22:00 ` [patch V5 08/15] x86/entry: Move user return notifier out of loop Thomas Gleixner
  2020-07-23 23:41   ` Sean Christopherson
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     a377ac1cd9d7b9ac8d546dceb3d74956fbfd443f
Gitweb:        https://git.kernel.org/tip/a377ac1cd9d7b9ac8d546dceb3d74956fbfd443f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:02 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:04:58 +02:00

x86/entry: Move user return notifier out of loop

Guests and user space share certain MSRs. KVM sets these MSRs to guest
values once and does not set them back to user space values on every VM
exit to spare the costly MSR operations.

User return notifiers ensure that these MSRs are set back to the correct
values before returning to user space in exit_to_usermode_loop().

There is no reason to evaluate the TIF flag indicating that user return
notifiers need to be invoked in the loop. The important point is that they
are invoked before returning to user space.

Move the invocation out of the loop into the section which does the last
preperatory steps before returning to user space. That section is not
preemptible and runs with interrupts disabled until the actual return.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220520.159112003@linutronix.de


---
 arch/x86/entry/common.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 68d5c86..9415ae5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -208,7 +208,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
@@ -242,9 +242,6 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 			rseq_handle_notify_resume(NULL, regs);
 		}
 
-		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
-			fire_user_return_notifiers();
-
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
@@ -273,6 +270,9 @@ static void __prepare_exit_to_usermode(struct pt_regs *regs)
 	/* Reload ti->flags; we may have rescheduled above. */
 	cached_flags = READ_ONCE(ti->flags);
 
+	if (cached_flags & _TIF_USER_RETURN_NOTIFY)
+		fire_user_return_notifiers();
+
 	if (unlikely(cached_flags & _TIF_IO_BITMAP))
 		tss_update_io_bitmap();
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Consolidate 32/64 bit syscall entry
  2020-07-22 22:00 ` [patch V5 07/15] x86/entry: Consolidate 32/64 bit syscall entry Thomas Gleixner
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  2020-07-26 18:33     ` Brian Gerst
  0 siblings, 1 reply; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     0b085e68f4072024ecaa3889aeeaab5f6c8eba5c
Gitweb:        https://git.kernel.org/tip/0b085e68f4072024ecaa3889aeeaab5f6c8eba5c
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:01 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:04:58 +02:00

x86/entry: Consolidate 32/64 bit syscall entry

64bit and 32bit entry code have the same open coded syscall entry handling
after the bitwidth specific bits.

Move it to a helper function and share the code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220520.051234096@linutronix.de


---
 arch/x86/entry/common.c | 93 +++++++++++++++++-----------------------
 1 file changed, 41 insertions(+), 52 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ab6cb86..68d5c86 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -366,8 +366,7 @@ __visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
 	exit_to_user_mode();
 }
 
-#ifdef CONFIG_X86_64
-__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+static noinstr long syscall_enter(struct pt_regs *regs, unsigned long nr)
 {
 	struct thread_info *ti;
 
@@ -379,6 +378,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
 		nr = syscall_trace_enter(regs);
 
+	instrumentation_end();
+	return nr;
+}
+
+#ifdef CONFIG_X86_64
+__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+{
+	nr = syscall_enter(regs, nr);
+
+	instrumentation_begin();
 	if (likely(nr < NR_syscalls)) {
 		nr = array_index_nospec(nr, NR_syscalls);
 		regs->ax = sys_call_table[nr](regs);
@@ -390,64 +399,53 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 		regs->ax = x32_sys_call_table[nr](regs);
 #endif
 	}
-	__syscall_return_slowpath(regs);
-
 	instrumentation_end();
-	exit_to_user_mode();
+	syscall_return_slowpath(regs);
 }
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
+{
+	if (IS_ENABLED(CONFIG_IA32_EMULATION))
+		current_thread_info()->status |= TS_COMPAT;
+	/*
+	 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
+	 * orig_ax, the unsigned int return value truncates it.  This may
+	 * or may not be necessary, but it matches the old asm behavior.
+	 */
+	return syscall_enter(regs, (unsigned int)regs->orig_ax);
+}
+
 /*
- * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
- * all entry and exit work and returns with IRQs off.  This function is
- * extremely hot in workloads that use it, and it's usually called from
- * do_fast_syscall_32, so forcibly inline it to improve performance.
+ * Invoke a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.
  */
-static void do_syscall_32_irqs_on(struct pt_regs *regs)
+static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
+						  unsigned int nr)
 {
-	struct thread_info *ti = current_thread_info();
-	unsigned int nr = (unsigned int)regs->orig_ax;
-
-#ifdef CONFIG_IA32_EMULATION
-	ti->status |= TS_COMPAT;
-#endif
-
-	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
-		/*
-		 * Subtlety here: if ptrace pokes something larger than
-		 * 2^32-1 into orig_ax, this truncates it.  This may or
-		 * may not be necessary, but it matches the old asm
-		 * behavior.
-		 */
-		nr = syscall_trace_enter(regs);
-	}
-
 	if (likely(nr < IA32_NR_syscalls)) {
+		instrumentation_begin();
 		nr = array_index_nospec(nr, IA32_NR_syscalls);
 		regs->ax = ia32_sys_call_table[nr](regs);
+		instrumentation_end();
 	}
-
-	__syscall_return_slowpath(regs);
 }
 
 /* Handles int $0x80 */
 __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
-	enter_from_user_mode(regs);
-	instrumentation_begin();
+	unsigned int nr = syscall_32_enter(regs);
 
-	local_irq_enable();
-	do_syscall_32_irqs_on(regs);
-
-	instrumentation_end();
-	exit_to_user_mode();
+	do_syscall_32_irqs_on(regs, nr);
+	syscall_return_slowpath(regs);
 }
 
-static bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 {
+	unsigned int nr	= syscall_32_enter(regs);
 	int res;
 
+	instrumentation_begin();
 	/* Fetch EBP from where the vDSO stashed it. */
 	if (IS_ENABLED(CONFIG_X86_64)) {
 		/*
@@ -460,17 +458,18 @@ static bool __do_fast_syscall_32(struct pt_regs *regs)
 		res = get_user(*(u32 *)&regs->bp,
 		       (u32 __user __force *)(unsigned long)(u32)regs->sp);
 	}
+	instrumentation_end();
 
 	if (res) {
 		/* User code screwed up. */
 		regs->ax = -EFAULT;
-		local_irq_disable();
-		__prepare_exit_to_usermode(regs);
+		syscall_return_slowpath(regs);
 		return false;
 	}
 
 	/* Now this is just like a normal syscall. */
-	do_syscall_32_irqs_on(regs);
+	do_syscall_32_irqs_on(regs, nr);
+	syscall_return_slowpath(regs);
 	return true;
 }
 
@@ -483,7 +482,6 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 	 */
 	unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
 					vdso_image_32.sym_int80_landing_pad;
-	bool success;
 
 	/*
 	 * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
@@ -492,17 +490,8 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 	 */
 	regs->ip = landing_pad;
 
-	enter_from_user_mode(regs);
-	instrumentation_begin();
-
-	local_irq_enable();
-	success = __do_fast_syscall_32(regs);
-
-	instrumentation_end();
-	exit_to_user_mode();
-
-	/* If it failed, keep it simple: use IRET. */
-	if (!success)
+	/* Invoke the syscall. If it failed, keep it simple: use IRET. */
+	if (!__do_fast_syscall_32(regs))
 		return 0;
 
 #ifdef CONFIG_X86_64

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/entry: Consolidate check_user_regs()
  2020-07-22 22:00 ` [patch V5 06/15] x86/entry: Consolidate check_user_regs() Thomas Gleixner
@ 2020-07-24 20:11   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-24 20:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Kees Cook, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     8d5ea35c5e9139dbd19a3d73985d008d36c9968f
Gitweb:        https://git.kernel.org/tip/8d5ea35c5e9139dbd19a3d73985d008d36c9968f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 23 Jul 2020 00:00:00 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 24 Jul 2020 15:04:57 +02:00

x86/entry: Consolidate check_user_regs()

The user register sanity check is sprinkled all over the place. Move it
into enter_from_user_mode().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.943016204@linutronix.de


---
 arch/x86/entry/common.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 4eae4c1..ab6cb86 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -82,10 +82,11 @@ static noinstr void check_user_regs(struct pt_regs *regs)
  * 2) Invoke context tracking if enabled to reactivate RCU
  * 3) Trace interrupts off state
  */
-static noinstr void enter_from_user_mode(void)
+static noinstr void enter_from_user_mode(struct pt_regs *regs)
 {
 	enum ctx_state state = ct_state();
 
+	check_user_regs(regs);
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	user_exit_irqoff();
 
@@ -95,8 +96,9 @@ static noinstr void enter_from_user_mode(void)
 	instrumentation_end();
 }
 #else
-static __always_inline void enter_from_user_mode(void)
+static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 {
+	check_user_regs(regs);
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
@@ -369,9 +371,7 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
 	struct thread_info *ti;
 
-	check_user_regs(regs);
-
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	local_irq_enable();
@@ -434,9 +434,7 @@ static void do_syscall_32_irqs_on(struct pt_regs *regs)
 /* Handles int $0x80 */
 __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
-	check_user_regs(regs);
-
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	local_irq_enable();
@@ -487,8 +485,6 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 					vdso_image_32.sym_int80_landing_pad;
 	bool success;
 
-	check_user_regs(regs);
-
 	/*
 	 * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
 	 * so that 'regs->ip -= 2' lands back on an int $0x80 instruction.
@@ -496,7 +492,7 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 	 */
 	regs->ip = landing_pad;
 
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 	instrumentation_begin();
 
 	local_irq_enable();
@@ -599,8 +595,7 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
 	};
 
 	if (user_mode(regs)) {
-		check_user_regs(regs);
-		enter_from_user_mode();
+		enter_from_user_mode(regs);
 		return ret;
 	}
 
@@ -733,8 +728,7 @@ void noinstr idtentry_exit(struct pt_regs *regs, idtentry_state_t state)
  */
 void noinstr idtentry_enter_user(struct pt_regs *regs)
 {
-	check_user_regs(regs);
-	enter_from_user_mode();
+	enter_from_user_mode(regs);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest
  2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
                   ` (14 preceding siblings ...)
  2020-07-22 22:00 ` [patch V5 15/15] x86/kvm: Use generic xfer to guest work function Thomas Gleixner
@ 2020-07-24 20:51 ` Thomas Gleixner
  2020-07-29 13:39   ` Steven Price
  15 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-24 20:51 UTC (permalink / raw)
  To: LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

Thomas Gleixner <tglx@linutronix.de> writes:
> This is the 5th version of generic entry/exit functionality for host and
> guest.

I've merged the pile in two steps. Patch 1-5, i.e. the generic code is
here:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/entry

and merged this branch and patch 6-15 into

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/entry

core/entry is immutable and any updates, changes go on top. It's meant
as a base for other architecture developers who want to fiddle with that
without having to get the x86 mess as well.

Thanks for all the help!

       tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [tip: x86/entry] x86/entry: Consolidate 32/64 bit syscall entry
  2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
@ 2020-07-26 18:33     ` Brian Gerst
  2020-07-27 13:38       ` Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Brian Gerst @ 2020-07-26 18:33 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: linux-tip-commits, Thomas Gleixner, x86

On Fri, Jul 24, 2020 at 4:14 PM tip-bot2 for Thomas Gleixner
<tip-bot2@linutronix.de> wrote:
>
> The following commit has been merged into the x86/entry branch of tip:
>
> Commit-ID:     0b085e68f4072024ecaa3889aeeaab5f6c8eba5c
> Gitweb:        https://git.kernel.org/tip/0b085e68f4072024ecaa3889aeeaab5f6c8eba5c
> Author:        Thomas Gleixner <tglx@linutronix.de>
> AuthorDate:    Thu, 23 Jul 2020 00:00:01 +02:00
> Committer:     Thomas Gleixner <tglx@linutronix.de>
> CommitterDate: Fri, 24 Jul 2020 15:04:58 +02:00
>
> x86/entry: Consolidate 32/64 bit syscall entry
>
> 64bit and 32bit entry code have the same open coded syscall entry handling
> after the bitwidth specific bits.
>
> Move it to a helper function and share the code.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lkml.kernel.org/r/20200722220520.051234096@linutronix.de
>
>
> ---
>  arch/x86/entry/common.c | 93 +++++++++++++++++-----------------------
>  1 file changed, 41 insertions(+), 52 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index ab6cb86..68d5c86 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -366,8 +366,7 @@ __visible noinstr void syscall_return_slowpath(struct pt_regs *regs)
>         exit_to_user_mode();
>  }
>
> -#ifdef CONFIG_X86_64
> -__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
> +static noinstr long syscall_enter(struct pt_regs *regs, unsigned long nr)
>  {
>         struct thread_info *ti;
>
> @@ -379,6 +378,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>         if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
>                 nr = syscall_trace_enter(regs);
>
> +       instrumentation_end();
> +       return nr;
> +}
> +
> +#ifdef CONFIG_X86_64
> +__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
> +{
> +       nr = syscall_enter(regs, nr);
> +
> +       instrumentation_begin();
>         if (likely(nr < NR_syscalls)) {
>                 nr = array_index_nospec(nr, NR_syscalls);
>                 regs->ax = sys_call_table[nr](regs);
> @@ -390,64 +399,53 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>                 regs->ax = x32_sys_call_table[nr](regs);
>  #endif
>         }
> -       __syscall_return_slowpath(regs);
> -
>         instrumentation_end();
> -       exit_to_user_mode();
> +       syscall_return_slowpath(regs);
>  }
>  #endif
>
>  #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
> +static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
> +{
> +       if (IS_ENABLED(CONFIG_IA32_EMULATION))
> +               current_thread_info()->status |= TS_COMPAT;
> +       /*
> +        * Subtlety here: if ptrace pokes something larger than 2^32-1 into
> +        * orig_ax, the unsigned int return value truncates it.  This may
> +        * or may not be necessary, but it matches the old asm behavior.
> +        */
> +       return syscall_enter(regs, (unsigned int)regs->orig_ax);
> +}
> +
>  /*
> - * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
> - * all entry and exit work and returns with IRQs off.  This function is
> - * extremely hot in workloads that use it, and it's usually called from
> - * do_fast_syscall_32, so forcibly inline it to improve performance.
> + * Invoke a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.
>   */
> -static void do_syscall_32_irqs_on(struct pt_regs *regs)
> +static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
> +                                                 unsigned int nr)
>  {
> -       struct thread_info *ti = current_thread_info();
> -       unsigned int nr = (unsigned int)regs->orig_ax;
> -
> -#ifdef CONFIG_IA32_EMULATION
> -       ti->status |= TS_COMPAT;
> -#endif
> -
> -       if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
> -               /*
> -                * Subtlety here: if ptrace pokes something larger than
> -                * 2^32-1 into orig_ax, this truncates it.  This may or
> -                * may not be necessary, but it matches the old asm
> -                * behavior.
> -                */
> -               nr = syscall_trace_enter(regs);
> -       }
> -
>         if (likely(nr < IA32_NR_syscalls)) {
> +               instrumentation_begin();
>                 nr = array_index_nospec(nr, IA32_NR_syscalls);
>                 regs->ax = ia32_sys_call_table[nr](regs);
> +               instrumentation_end();
>         }
> -
> -       __syscall_return_slowpath(regs);
>  }
>
>  /* Handles int $0x80 */
>  __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
>  {
> -       enter_from_user_mode(regs);
> -       instrumentation_begin();
> +       unsigned int nr = syscall_32_enter(regs);
>
> -       local_irq_enable();
> -       do_syscall_32_irqs_on(regs);
> -
> -       instrumentation_end();
> -       exit_to_user_mode();
> +       do_syscall_32_irqs_on(regs, nr);
> +       syscall_return_slowpath(regs);
>  }
>
> -static bool __do_fast_syscall_32(struct pt_regs *regs)
> +static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)

Can __do_fast_syscall_32() be merged back into do_fast_syscall_32()
now that both are marked noinstr?

--
Brian Gerst

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [tip: x86/entry] x86/entry: Consolidate 32/64 bit syscall entry
  2020-07-26 18:33     ` Brian Gerst
@ 2020-07-27 13:38       ` Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-27 13:38 UTC (permalink / raw)
  To: Brian Gerst, Linux Kernel Mailing List; +Cc: linux-tip-commits, x86

Brian Gerst <brgerst@gmail.com> writes:
> On Fri, Jul 24, 2020 at 4:14 PM tip-bot2 for Thomas Gleixner
>>
>> -static bool __do_fast_syscall_32(struct pt_regs *regs)
>> +static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
>
> Can __do_fast_syscall_32() be merged back into do_fast_syscall_32()
> now that both are marked noinstr?

It could.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest
  2020-07-24 20:51 ` [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
@ 2020-07-29 13:39   ` Steven Price
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Price @ 2020-07-29 13:39 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson

On 24/07/2020 21:51, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> This is the 5th version of generic entry/exit functionality for host and
>> guest.
> 
> I've merged the pile in two steps. Patch 1-5, i.e. the generic code is
> here:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/entry
> 
> and merged this branch and patch 6-15 into
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/entry
> 
> core/entry is immutable and any updates, changes go on top. It's meant
> as a base for other architecture developers who want to fiddle with that
> without having to get the x86 mess as well.

Hi Thomas,

Just FYI: I'm going to take a look at arm64 support for this. I know 
Mark previously posted[1] a series moving much of the code to C, so 
hopefully I can build on that and make use of the generic code.

Thanks,

Steve

[1] https://lore.kernel.org/r/20200108185634.1163-1-mark.rutland@arm.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode
  2020-07-22 21:59 ` [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode Thomas Gleixner
  2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
@ 2020-07-29 16:55   ` Qian Cai
  2020-07-30  7:19     ` Thomas Gleixner
  1 sibling, 1 reply; 45+ messages in thread
From: Qian Cai @ 2020-07-29 16:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson, sfr, linux-next

On Wed, Jul 22, 2020 at 11:59:59PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Entering a guest is similar to exiting to user space. Pending work like
> handling signals, rescheduling, task work etc. needs to be handled before
> that.
> 
> Provide generic infrastructure to avoid duplication of the same handling
> code all over the place.
> 
> The transfer to guest mode handling is different from the exit to usermode
> handling, e.g. vs. rseq and live patching, so a separate function is used.
> 
> The initial list of work items handled is:
> 
>     TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME
> 
> Architecture specific TIF flags can be added via defines in the
> architecture specific include files.
> 
> The calling convention is also different from the syscall/interrupt entry
> functions as KVM invokes this from the outer vcpu_run() loop with
> interrupts and preemption enabled. To prevent missing a pending work item
> it invokes a check for pending TIF work from interrupt disabled code right
> before transitioning to guest mode. The lockdep, RCU and tracing state
> handling is also done directly around the switch to and from guest mode.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

SR-IOV will start trigger a warning below in this commit,

A reproducer:
# echo 1 > /sys/class/net/eno58/device/sriov_numvfs (bnx2x)
# git clone https://gitlab.com/cailca/linux-mm
# cd linux-mm; make
# ./random -x 0-100 -k0000:02:09.0

https://gitlab.com/cailca/linux-mm/-/blob/master/random.c#L1247
(x86.config is also included in the repo.)

[  765.409027] ------------[ cut here ]------------
[  765.434611] WARNING: CPU: 13 PID: 3377 at include/linux/entry-kvm.h:75 kvm_arch_vcpu_ioctl_run+0xb52/0x1320 [kvm]
[  765.487229] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio loop nls_ascii nls_cp437 vfat fat
[  766.118568] CPU: 13 PID: 3377 Comm: qemu-kvm Not tainted 5.8.0-rc7-next-20200729 #2
[  766.147011]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  766.147016]  ret_from_fork+0x22/0x30
[  766.782999] Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018
[  766.817799] RIP: 0010:kvm_arch_vcpu_ioctl_run+0xb52/0x1320 [kvm]
[  766.850054] Code: e8 03 0f b6 04 18 84 c0 74 06 0f 8e 4a 03 00 00 41 c6 85 50 2d 00 00 00 e9 24 f8 ff ff 4c 89 ef e8 93 a8 02 00 e9 3d f8 ff ff <0f> 0b e9 f2 f8 ff ff 48 81 c6 c0 01 00 00 4c 89 ef e8 18 4c fe
 ff
[  766.942485] RSP: 0018:ffffc90007017b58 EFLAGS: 00010202
[  766.970899] RAX: 0000000000000001 RBX: dffffc0000000000 RCX: ffffffffc0f0ef8a
[  767.008488] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88956a3821a0
[  767.045726] RBP: ffffc90007017bb8 R08: ffffed1269290010 R09: ffffed1269290010
[  767.083242] R10: ffff88934948007f R11: ffffed126929000f R12: ffff889349480424
[  767.120809] R13: ffff889349480040 R14: ffff88934948006c R15: ffff88980e2da000
[  767.160531] FS:  00007f12b0e98700(0000) GS:ffff88985fa40000(0000) knlGS:0000000000000000
[  767.203199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  767.235083] CR2: 0000000000000000 CR3: 000000120d186002 CR4: 00000000001726e0
[  767.272303] Call Trace:
[  767.272303] Call Trace:
[  767.286947]  kvm_vcpu_ioctl+0x3e9/0xb30 [kvm]
[  767.311162]  ? kvm_vcpu_block+0xd30/0xd30 [kvm]
[  767.335281]  ? find_held_lock+0x33/0x1c0
[  767.357492]  ? __fget_files+0x1cb/0x300
[  767.378687]  ? lock_downgrade+0x730/0x730
[  767.401410]  ? rcu_read_lock_held+0xaa/0xc0
[  767.424228]  ? rcu_read_lock_sched_held+0xd0/0xd0
[  767.450078]  ? find_held_lock+0x33/0x1c0
[  767.471876]  ? __fget_files+0x1e5/0x300
[  767.493841]  __x64_sys_ioctl+0x315/0x102f
[  767.516579]  ? generic_block_fiemap+0x60/0x60
[  767.540678]  ? syscall_enter_from_user_mode+0x1b/0x210
[  767.568612]  ? rcu_read_lock_sched_held+0xaa/0xd0
[  767.593950]  ? lockdep_hardirqs_on_prepare+0x33e/0x4e0
[  767.621355]  ? syscall_enter_from_user_mode+0x20/0x210
[  767.649283]  ? trace_hardirqs_on+0x20/0x1b5
[  767.673770]  do_syscall_64+0x33/0x40
[  767.694699]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  767.723392] RIP: 0033:0x7f12b999a87b
[  767.744252] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89  8
[  767.837418] RSP: 002b:00007f12b0e97678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  767.877074] RAX: ffffffffffffffda RBX: 000055712c857c60 RCX: 00007f12b999a87b
[  767.914242] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000014
[  767.951945] RBP: 000055712c857cfb R08: 000055712ac4fad0 R09: 000055712b3f2080
[  767.989036] R10: 0000000000000000 R11: 0000000000000246 R12: 000055712ac38100
[  768.025851] R13: 00007ffd996e9edf R14: 00007f12becc5000 R15: 000055712c857c60
[  768.063363] irq event stamp: 5257
[  768.082559] hardirqs last  enabled at (5273): [<ffffffffa1800ce2>] asm_sysvec_call_function_single+0x12/0x20
[  768.132704] hardirqs last disabled at (5290): [<ffffffffa179ae3d>] irqentry_enter+0x1d/0x50
[  768.176563] softirqs last  enabled at (5108): [<ffffffffa1a0070f>] __do_softirq+0x70f/0xa9f
[  768.221270] softirqs last disabled at (5093): [<ffffffffa1800ec2>] asm_call_on_stack+0x12/0x20
[  768.267273] ---[ end trace 8730450ad8cfee9f ]---

> ---
> V5: Rename exit -> xfer (Sean)
> 
> V3: Reworked and simplified version adopted to recent X86 and KVM changes
>     
> V2: Moved KVM specific functions to kvm (Paolo)
>     Added lockdep assert (Andy)
>     Dropped live patching from enter guest mode work (Miroslav)
> ---
>  include/linux/entry-kvm.h |   80 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kvm_host.h  |    8 ++++
>  kernel/entry/Makefile     |    3 +
>  kernel/entry/kvm.c        |   51 +++++++++++++++++++++++++++++
>  virt/kvm/Kconfig          |    3 +
>  5 files changed, 144 insertions(+), 1 deletion(-)
> 
> --- /dev/null
> +++ b/include/linux/entry-kvm.h
> @@ -0,0 +1,80 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_ENTRYKVM_H
> +#define __LINUX_ENTRYKVM_H
> +
> +#include <linux/entry-common.h>
> +
> +/* Transfer to guest mode work */
> +#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
> +
> +#ifndef ARCH_XFER_TO_GUEST_MODE_WORK
> +# define ARCH_XFER_TO_GUEST_MODE_WORK	(0)
> +#endif
> +
> +#define XFER_TO_GUEST_MODE_WORK					\
> +	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |			\
> +	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> +
> +struct kvm_vcpu;
> +
> +/**
> + * arch_xfer_to_guest_mode_work - Architecture specific xfer to guest mode
> + *				  work function.
> + * @vcpu:	Pointer to current's VCPU data
> + * @ti_work:	Cached TIF flags gathered in xfer_to_guest_mode()
> + *
> + * Invoked from xfer_to_guest_mode_work(). Defaults to NOOP. Can be
> + * replaced by architecture specific code.
> + */
> +static inline int arch_xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> +					      unsigned long ti_work);
> +
> +#ifndef arch_xfer_to_guest_mode_work
> +static inline int arch_xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> +					       unsigned long ti_work)
> +{
> +	return 0;
> +}
> +#endif
> +
> +/**
> + * xfer_to_guest_mode - Check and handle pending work which needs to be
> + *			handled before returning to guest mode
> + * @vcpu:	Pointer to current's VCPU data
> + *
> + * Returns: 0 or an error code
> + */
> +int xfer_to_guest_mode(struct kvm_vcpu *vcpu);
> +
> +/**
> + * __xfer_to_guest_mode_work_pending - Check if work is pending
> + *
> + * Returns: True if work pending, False otherwise.
> + *
> + * Bare variant of xfer_to_guest_mode_work_pending(). Can be called from
> + * interrupt enabled code for racy quick checks with care.
> + */
> +static inline bool __xfer_to_guest_mode_work_pending(void)
> +{
> +	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> +	return !!(ti_work & XFER_TO_GUEST_MODE_WORK);
> +}
> +
> +/**
> + * xfer_to_guest_mode_work_pending - Check if work is pending which needs to be
> + *				     handled before returning to guest mode
> + *
> + * Returns: True if work pending, False otherwise.
> + *
> + * Has to be invoked with interrupts disabled before the transition to
> + * guest mode.
> + */
> +static inline bool xfer_to_guest_mode_work_pending(void)
> +{
> +	lockdep_assert_irqs_disabled();
> +	return __xfer_to_guest_mode_work_pending();
> +}
> +#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
> +
> +#endif
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1439,4 +1439,12 @@ int kvm_vm_create_worker_thread(struct k
>  				uintptr_t data, const char *name,
>  				struct task_struct **thread_ptr);
>  
> +#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
> +static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_INTR;
> +	vcpu->stat.signal_exits++;
> +}
> +#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
> +
>  #endif
> --- a/kernel/entry/Makefile
> +++ b/kernel/entry/Makefile
> @@ -9,4 +9,5 @@ KCOV_INSTRUMENT := n
>  CFLAGS_REMOVE_common.o	 = -fstack-protector -fstack-protector-strong
>  CFLAGS_common.o		+= -fno-stack-protector
>  
> -obj-$(CONFIG_GENERIC_ENTRY) += common.o
> +obj-$(CONFIG_GENERIC_ENTRY) 		+= common.o
> +obj-$(CONFIG_KVM_XFER_TO_GUEST_WORK)	+= kvm.o
> --- /dev/null
> +++ b/kernel/entry/kvm.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/entry-kvm.h>
> +#include <linux/kvm_host.h>
> +
> +static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
> +{
> +	do {
> +		int ret;
> +
> +		if (ti_work & _TIF_SIGPENDING) {
> +			kvm_handle_signal_exit(vcpu);
> +			return -EINTR;
> +		}
> +
> +		if (ti_work & _TIF_NEED_RESCHED)
> +			schedule();
> +
> +		if (ti_work & _TIF_NOTIFY_RESUME) {
> +			clear_thread_flag(TIF_NOTIFY_RESUME);
> +			tracehook_notify_resume(NULL);
> +		}
> +
> +		ret = arch_xfer_to_guest_mode_work(vcpu, ti_work);
> +		if (ret)
> +			return ret;
> +
> +		ti_work = READ_ONCE(current_thread_info()->flags);
> +	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
> +	return 0;
> +}
> +
> +int xfer_to_guest_mode(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long ti_work;
> +
> +	/*
> +	 * This is invoked from the outer guest loop with interrupts and
> +	 * preemption enabled.
> +	 *
> +	 * KVM invokes xfer_to_guest_mode_work_pending() with interrupts
> +	 * disabled in the inner loop before going into guest mode. No need
> +	 * to disable interrupts here.
> +	 */
> +	ti_work = READ_ONCE(current_thread_info()->flags);
> +	if (!(ti_work & XFER_TO_GUEST_MODE_WORK))
> +		return 0;
> +
> +	return xfer_to_guest_mode_work(vcpu, ti_work);
> +}
> +EXPORT_SYMBOL_GPL(xfer_to_guest_mode);
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -60,3 +60,6 @@ config HAVE_KVM_VCPU_RUN_PID_CHANGE
>  
>  config HAVE_KVM_NO_POLL
>         bool
> +
> +config KVM_XFER_TO_GUEST_WORK
> +       bool
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode
  2020-07-29 16:55   ` [patch V5 05/15] " Qian Cai
@ 2020-07-30  7:19     ` Thomas Gleixner
  2020-07-30 10:34       ` [tip: x86/entry] x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() tip-bot2 for Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2020-07-30  7:19 UTC (permalink / raw)
  To: Qian Cai
  Cc: LKML, x86, linux-arch, Will Deacon, Arnd Bergmann, Mark Rutland,
	Kees Cook, Keno Fischer, Paolo Bonzini, kvm,
	Gabriel Krisman Bertazi, Sean Christopherson, sfr, linux-next

Qian Cai <cai@lca.pw> writes:
> On Wed, Jul 22, 2020 at 11:59:59PM +0200, Thomas Gleixner wrote:
> SR-IOV will start trigger a warning below in this commit,
>
> [  765.434611] WARNING: CPU: 13 PID: 3377 at include/linux/entry-kvm.h:75 kvm_arch_vcpu_ioctl_run+0xb52/0x1320 [kvm]

Yes, I'm a moron. Fixed it locally and failed to transfer the fixup when
merging it. Fix below.

> [  768.221270] softirqs last disabled at (5093): [<ffffffffa1800ec2>] asm_call_on_stack+0x12/0x20
> [  768.267273] ---[ end trace 8730450ad8cfee9f ]---

Can you pretty please trim your replies?

>> ---
>> V5: Rename exit -> xfer (Sean)

<removed 200 lines of useless information>

Thanks,

        tglx
---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 82d4a9e88908..532597265c50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8682,7 +8682,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 			break;
 		}
 
-		if (xfer_to_guest_mode_work_pending()) {
+		if (__xfer_to_guest_mode_work_pending()) {
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
 			r = xfer_to_guest_mode_handle_work(vcpu);
 			if (r)


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip: x86/entry] x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu()
  2020-07-30  7:19     ` Thomas Gleixner
@ 2020-07-30 10:34       ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-07-30 10:34 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Qian Cai, Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     f3020b8891b890b48d9e1a83241e3cce518427c1
Gitweb:        https://git.kernel.org/tip/f3020b8891b890b48d9e1a83241e3cce518427c1
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 30 Jul 2020 09:19:01 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 30 Jul 2020 12:31:47 +02:00

x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu()

The comments explicitely explain that the work flags check and handling in
kvm_run_vcpu() is done with preemption and interrupts enabled as KVM
invokes the check again right before entering guest mode with interrupts
disabled which guarantees that the work flags are observed and handled
before VMENTER.

Nevertheless the flag pending check in kvm_run_vcpu() uses the helper
variant which requires interrupts to be disabled triggering an instant
lockdep splat. This was caught in testing before and then not fixed up in
the patch before applying. :(

Use the relaxed and intentionally racy __xfer_to_guest_mode_work_pending()
instead.

Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
Reported-by: Qian Cai <cai@lca.pw> writes:
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87bljxa2sa.fsf@nanos.tec.linutronix.de


---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 82d4a9e..5325972 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8682,7 +8682,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 			break;
 		}
 
-		if (xfer_to_guest_mode_work_pending()) {
+		if (__xfer_to_guest_mode_work_pending()) {
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
 			r = xfer_to_guest_mode_handle_work(vcpu);
 			if (r)

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-07-30 10:34 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22 21:59 [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
2020-07-22 21:59 ` [patch V5 01/15] seccomp: Provide stub for __secure_computing() Thomas Gleixner
2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 21:59 ` [patch V5 02/15] entry: Provide generic syscall entry functionality Thomas Gleixner
2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 21:59 ` [patch V5 03/15] entry: Provide generic syscall exit function Thomas Gleixner
2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 21:59 ` [patch V5 04/15] entry: Provide generic interrupt entry/exit code Thomas Gleixner
2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 21:59 ` [patch V5 05/15] entry: Provide infrastructure for work before transitioning to guest mode Thomas Gleixner
2020-07-24 19:08   ` [tip: core/entry] " tip-bot2 for Thomas Gleixner
2020-07-29 16:55   ` [patch V5 05/15] " Qian Cai
2020-07-30  7:19     ` Thomas Gleixner
2020-07-30 10:34       ` [tip: x86/entry] x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 06/15] x86/entry: Consolidate check_user_regs() Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 07/15] x86/entry: Consolidate 32/64 bit syscall entry Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-26 18:33     ` Brian Gerst
2020-07-27 13:38       ` Thomas Gleixner
2020-07-22 22:00 ` [patch V5 08/15] x86/entry: Move user return notifier out of loop Thomas Gleixner
2020-07-23 23:41   ` Sean Christopherson
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 09/15] x86/ptrace: Provide pt_regs helper for entry/exit Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 10/15] x86/entry: Use generic syscall entry function Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 11/15] x86/entry: Use generic syscall exit functionality Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 12/15] x86/entry: Cleanup idtentry_entry/exit_user Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 13/15] x86/entry: Use generic interrupt entry/exit code Thomas Gleixner
2020-07-24 14:28   ` Ingo Molnar
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 14/15] x86/entry: Cleanup idtentry_enter/exit Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-22 22:00 ` [patch V5 15/15] x86/kvm: Use generic xfer to guest work function Thomas Gleixner
2020-07-24  0:17   ` Sean Christopherson
2020-07-24  0:46     ` Thomas Gleixner
2020-07-24  0:55       ` Sean Christopherson
2020-07-24 14:24   ` Ingo Molnar
2020-07-24 19:08     ` Thomas Gleixner
2020-07-24 20:11   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-07-24 20:51 ` [patch V5 00/15] entry, x86, kvm: Generic entry/exit functionality for host and guest Thomas Gleixner
2020-07-29 13:39   ` Steven Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).