All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
@ 2015-06-16 20:16 Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state Andy Lutomirski
                   ` (14 more replies)
  0 siblings, 15 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

This is incomplete, but it's finally good enough that I think it's
time to get other opinions on it.  It is a complete rewrite of the
slow path code that handles exits to user mode.

The exit-to-usermode code is copied in several places and is written
in a nasty combination of asm and C.  It's not at all clear what
it's supposed to do, and the way it's structured makes it very hard
to work with.  For example, it's not even clear why syscall exit
hooks are called only once per syscall right now.  (It seems to be a
side effect of the way that rdi and rdx are handled in the asm loop,
and it seems reliable, but it's still pointlessly complicated.)  The
existing code also makes context tracking overly complicated and
hard to understand.  Finally, it's nearly impossible for anyone to
change what happens on exit to usermode, since the existing code is
so fragile.

I tried to clean it up incrementally, but I decided it was too hard.
Instead, this series just replaces the code.  It seems to work.

Context tracking in particular works very differently now.  The
low-level entry code checks that we're in CONTEXT_USER and switches
to CONTEXT_KERNEL.  The exit code does the reverse.  There is no
need to track what CONTEXT_XYZ state we came from, because we
already know.  Similarly, SCHEDULE_USER is gone, since we can
reschedule if needed by simply calling schedule() from C code.

The main things that are missing are that I haven't done the 32-bit
parts (anyone want to help?) and therefore I haven't deleted the old
C code.  I also think this may break UML for trivial reasons.

Because I haven't converted the 32-bit code yet, all of the now-unnecessary
unnecessary calls to exception_enter are still present in traps.c.

IRQ context tracking is still duplicated.  We should probably clean
it up by changing the core code to supply something like
irq_enter_we_are_already_in_context_kernel.

Thoughts?

Andy Lutomirski (13):
  context_tracking: Add context_tracking_assert_state
  notifiers: Assert that RCU is watching in notify_die
  x86: Move C entry and exit code to arch/x86/entry/common.c
  x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
  x86/entry: Add enter_from_user_mode and use it in syscalls
  x86/entry: Add new, comprehensible entry and exit hooks
  x86/entry/64: Really create an error-entry-from-usermode code path
  x86/entry/64: Migrate 64-bit syscalls to new exit hooks
  x86/entry/compat: Migrate compat syscalls to new exit hooks
  x86/asm/entry/64: Save all regs on interrupt entry
  x86/asm/entry/64: Simplify irq stack pt_regs handling
  x86/asm/entry/64: Migrate error and interrupt exit work to C
  x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h

 arch/x86/entry/Makefile                 |   1 +
 arch/x86/entry/common.c                 | 372 ++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S               | 176 ++++-----------
 arch/x86/entry/entry_64_compat.S        |   7 +-
 arch/x86/include/asm/context_tracking.h |  10 -
 arch/x86/include/asm/signal.h           |   1 +
 arch/x86/kernel/ptrace.c                | 202 +----------------
 arch/x86/kernel/signal.c                |  28 +--
 arch/x86/kernel/traps.c                 |   9 +
 include/linux/context_tracking.h        |   8 +
 kernel/notifier.c                       |   2 +
 11 files changed, 439 insertions(+), 377 deletions(-)
 create mode 100644 arch/x86/entry/common.c
 delete mode 100644 arch/x86/include/asm/context_tracking.h

-- 
2.4.3


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-17  9:41   ` Ingo Molnar
  2015-06-16 20:16 ` [RFC/INCOMPLETE 02/13] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

This will let us sprinkle sanity checks around the kernel without
making too much of a mess.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/context_tracking.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..0fbea4b152e1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
 	if (context_tracking_is_enabled())
 		__context_tracking_task_switch(prev, next);
 }
+
+static inline void context_tracking_assert_state(enum ctx_state state)
+{
+	rcu_lockdep_assert(!context_tracking_is_enabled() ||
+			   this_cpu_read(context_tracking.state) == state,
+			   "context tracking state was wrong");
+}
 #else
 static inline void user_enter(void) { }
 static inline void user_exit(void) { }
@@ -64,6 +71,7 @@ static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline void context_tracking_task_switch(struct task_struct *prev,
 						struct task_struct *next) { }
+static inline void context_tracking_assert_state(enum ctx_state state) { }
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 02/13] notifiers: Assert that RCU is watching in notify_die
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 03/13] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

Low-level arch entries often call notify_die, and it's easy for arch
code to fail to exit an RCU quiescent state first.  Assert that
we're not quiescent in notify_die.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/notifier.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index ae9fc7cc360e..980e4330fb59 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -544,6 +544,8 @@ int notrace notify_die(enum die_val val, const char *str,
 		.signr	= sig,
 
 	};
+	rcu_lockdep_assert(rcu_is_watching(),
+			   "notify_die called but RCU thinks we're quiescent");
 	return atomic_notifier_call_chain(&die_chain, val, &args);
 }
 NOKPROBE_SYMBOL(notify_die);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 03/13] x86: Move C entry and exit code to arch/x86/entry/common.c
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 02/13] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 04/13] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

The entry and exit C helpers were confusingly scattered between
ptrace.c and signal.c, even though they aren't specific to ptrace or
signal handling.  Move them together in a new file.

This change just moves code around.  It doesn't change anything.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/Makefile       |   1 +
 arch/x86/entry/common.c       | 253 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/signal.h |   1 +
 arch/x86/kernel/ptrace.c      | 202 +--------------------------------
 arch/x86/kernel/signal.c      |  28 +----
 5 files changed, 257 insertions(+), 228 deletions(-)
 create mode 100644 arch/x86/entry/common.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 7a144971db79..bd55dedd7614 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -2,6 +2,7 @@
 # Makefile for the x86 low level entry code
 #
 obj-y				:= entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
+obj-y				+= common.o
 
 obj-y				+= vdso/
 obj-y				+= vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..348465473e55
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,253 @@
+/*
+ * entry.c - C code for kernel entry and exit
+ * Copyright (c) 2015 Andrew Lutomirski
+ * GPL v2
+ *
+ * Based on asm and ptrace code by many authors.  The code here originated
+ * in ptrace.c and signal.c.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/errno.h>
+#include <linux/ptrace.h>
+#include <linux/tracehook.h>
+#include <linux/audit.h>
+#include <linux/seccomp.h>
+#include <linux/signal.h>
+#include <linux/export.h>
+#include <linux/context_tracking.h>
+#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
+
+#include <asm/desc.h>
+#include <asm/traps.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else
+#endif
+	{
+		audit_syscall_entry(regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
+/*
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0:			resume the syscall
+ * 1:			go to phase 2; no seccomp phase 2 needed
+ * anything else:	go to phase 2; pass return value to seccomp
+ */
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	/*
+	 * If TIF_NOHZ is set, we are required to call user_exit() before
+	 * doing anything that could touch RCU.
+	 */
+	if (work & _TIF_NOHZ) {
+		user_exit();
+		work &= ~_TIF_NOHZ;
+	}
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else
+#endif
+		{
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -1;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
+{
+	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	/*
+	 * If we stepped into a sysenter/syscall insn, it trapped in
+	 * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
+	 * If user-mode had set TF itself, then it's still clear from
+	 * do_debug() and we need to set it again to restore the user
+	 * state.  If we entered on the slow path, TF was already set.
+	 */
+	if (work & _TIF_SINGLESTEP)
+		regs->flags |= X86_EFLAGS_TF;
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
+		/* seccomp failures shouldn't expose any additional code. */
+		return -1;
+	}
+#endif
+
+	if (unlikely(work & _TIF_SYSCALL_EMU))
+		ret = -1L;
+
+	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
+	    tracehook_report_syscall_entry(regs))
+		ret = -1L;
+
+	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+		trace_sys_enter(regs, regs->orig_ax);
+
+	do_audit_syscall_entry(regs, arch);
+
+	return ret ?: regs->orig_ax;
+}
+
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
+void syscall_trace_leave(struct pt_regs *regs)
+{
+	bool step;
+
+	/*
+	 * We may come here right after calling schedule_user()
+	 * or do_notify_resume(), in which case we can be in RCU
+	 * user mode.
+	 */
+	user_exit();
+
+	audit_syscall_exit(regs);
+
+	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+		trace_sys_exit(regs, regs->ax);
+
+	/*
+	 * If TIF_SYSCALL_EMU is set, we only get here because of
+	 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+	 * We already reported this syscall instruction in
+	 * syscall_trace_enter().
+	 */
+	step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
+			!test_thread_flag(TIF_SYSCALL_EMU);
+	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
+		tracehook_report_syscall_exit(regs, step);
+
+	user_enter();
+}
+
+/*
+ * notification of userspace execution resumption
+ * - triggered by the TIF_WORK_MASK flags
+ */
+__visible void
+do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
+{
+	user_exit();
+
+	if (thread_info_flags & _TIF_UPROBE)
+		uprobe_notify_resume(regs);
+
+	/* deal with pending signal delivery */
+	if (thread_info_flags & _TIF_SIGPENDING)
+		do_signal(regs);
+
+	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+		clear_thread_flag(TIF_NOTIFY_RESUME);
+		tracehook_notify_resume(regs);
+	}
+	if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
+		fire_user_return_notifiers();
+
+	user_enter();
+}
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 31eab867e6d3..b42408bcf6b5 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -30,6 +30,7 @@ typedef sigset_t compat_sigset_t;
 #endif /* __ASSEMBLY__ */
 #include <uapi/asm/signal.h>
 #ifndef __ASSEMBLY__
+extern void do_signal(struct pt_regs *regs);
 extern void do_notify_resume(struct pt_regs *, void *, __u32);
 
 #define __ARCH_HAS_SA_RESTORER
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..d9417430a4b0 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -37,12 +37,10 @@
 #include <asm/proto.h>
 #include <asm/hw_breakpoint.h>
 #include <asm/traps.h>
+#include <asm/syscall.h>
 
 #include "tls.h"
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
 enum x86_regset {
 	REGSET_GENERAL,
 	REGSET_FP,
@@ -1434,201 +1432,3 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	/* Send us the fake SIGTRAP */
 	force_sig_info(SIGTRAP, &info, tsk);
 }
-
-static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
-{
-#ifdef CONFIG_X86_64
-	if (arch == AUDIT_ARCH_X86_64) {
-		audit_syscall_entry(regs->orig_ax, regs->di,
-				    regs->si, regs->dx, regs->r10);
-	} else
-#endif
-	{
-		audit_syscall_entry(regs->orig_ax, regs->bx,
-				    regs->cx, regs->dx, regs->si);
-	}
-}
-
-/*
- * We can return 0 to resume the syscall or anything else to go to phase
- * 2.  If we resume the syscall, we need to put something appropriate in
- * regs->orig_ax.
- *
- * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
- * are fully functional.
- *
- * For phase 2's benefit, our return value is:
- * 0:			resume the syscall
- * 1:			go to phase 2; no seccomp phase 2 needed
- * anything else:	go to phase 2; pass return value to seccomp
- */
-unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
-{
-	unsigned long ret = 0;
-	u32 work;
-
-	BUG_ON(regs != task_pt_regs(current));
-
-	work = ACCESS_ONCE(current_thread_info()->flags) &
-		_TIF_WORK_SYSCALL_ENTRY;
-
-	/*
-	 * If TIF_NOHZ is set, we are required to call user_exit() before
-	 * doing anything that could touch RCU.
-	 */
-	if (work & _TIF_NOHZ) {
-		user_exit();
-		work &= ~_TIF_NOHZ;
-	}
-
-#ifdef CONFIG_SECCOMP
-	/*
-	 * Do seccomp first -- it should minimize exposure of other
-	 * code, and keeping seccomp fast is probably more valuable
-	 * than the rest of this.
-	 */
-	if (work & _TIF_SECCOMP) {
-		struct seccomp_data sd;
-
-		sd.arch = arch;
-		sd.nr = regs->orig_ax;
-		sd.instruction_pointer = regs->ip;
-#ifdef CONFIG_X86_64
-		if (arch == AUDIT_ARCH_X86_64) {
-			sd.args[0] = regs->di;
-			sd.args[1] = regs->si;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->r10;
-			sd.args[4] = regs->r8;
-			sd.args[5] = regs->r9;
-		} else
-#endif
-		{
-			sd.args[0] = regs->bx;
-			sd.args[1] = regs->cx;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->si;
-			sd.args[4] = regs->di;
-			sd.args[5] = regs->bp;
-		}
-
-		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
-		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
-
-		ret = seccomp_phase1(&sd);
-		if (ret == SECCOMP_PHASE1_SKIP) {
-			regs->orig_ax = -1;
-			ret = 0;
-		} else if (ret != SECCOMP_PHASE1_OK) {
-			return ret;  /* Go directly to phase 2 */
-		}
-
-		work &= ~_TIF_SECCOMP;
-	}
-#endif
-
-	/* Do our best to finish without phase 2. */
-	if (work == 0)
-		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
-
-#ifdef CONFIG_AUDITSYSCALL
-	if (work == _TIF_SYSCALL_AUDIT) {
-		/*
-		 * If there is no more work to be done except auditing,
-		 * then audit in phase 1.  Phase 2 always audits, so, if
-		 * we audit here, then we can't go on to phase 2.
-		 */
-		do_audit_syscall_entry(regs, arch);
-		return 0;
-	}
-#endif
-
-	return 1;  /* Something is enabled that we can't handle in phase 1 */
-}
-
-/* Returns the syscall nr to run (which should match regs->orig_ax). */
-long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
-				unsigned long phase1_result)
-{
-	long ret = 0;
-	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
-		_TIF_WORK_SYSCALL_ENTRY;
-
-	BUG_ON(regs != task_pt_regs(current));
-
-	/*
-	 * If we stepped into a sysenter/syscall insn, it trapped in
-	 * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
-	 * If user-mode had set TF itself, then it's still clear from
-	 * do_debug() and we need to set it again to restore the user
-	 * state.  If we entered on the slow path, TF was already set.
-	 */
-	if (work & _TIF_SINGLESTEP)
-		regs->flags |= X86_EFLAGS_TF;
-
-#ifdef CONFIG_SECCOMP
-	/*
-	 * Call seccomp_phase2 before running the other hooks so that
-	 * they can see any changes made by a seccomp tracer.
-	 */
-	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
-		/* seccomp failures shouldn't expose any additional code. */
-		return -1;
-	}
-#endif
-
-	if (unlikely(work & _TIF_SYSCALL_EMU))
-		ret = -1L;
-
-	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
-	    tracehook_report_syscall_entry(regs))
-		ret = -1L;
-
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->orig_ax);
-
-	do_audit_syscall_entry(regs, arch);
-
-	return ret ?: regs->orig_ax;
-}
-
-long syscall_trace_enter(struct pt_regs *regs)
-{
-	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
-	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
-
-	if (phase1_result == 0)
-		return regs->orig_ax;
-	else
-		return syscall_trace_enter_phase2(regs, arch, phase1_result);
-}
-
-void syscall_trace_leave(struct pt_regs *regs)
-{
-	bool step;
-
-	/*
-	 * We may come here right after calling schedule_user()
-	 * or do_notify_resume(), in which case we can be in RCU
-	 * user mode.
-	 */
-	user_exit();
-
-	audit_syscall_exit(regs);
-
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->ax);
-
-	/*
-	 * If TIF_SYSCALL_EMU is set, we only get here because of
-	 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
-	 * We already reported this syscall instruction in
-	 * syscall_trace_enter().
-	 */
-	step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
-			!test_thread_flag(TIF_SYSCALL_EMU);
-	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
-		tracehook_report_syscall_exit(regs, step);
-
-	user_enter();
-}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 1ea14fd53933..c0ad42a640b9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -683,7 +683,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
  * want to handle. Thus you cannot kill init even with a SIGKILL even by
  * mistake.
  */
-static void do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
 {
 	struct ksignal ksig;
 
@@ -718,32 +718,6 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-/*
- * notification of userspace execution resumption
- * - triggered by the TIF_WORK_MASK flags
- */
-__visible void
-do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
-{
-	user_exit();
-
-	if (thread_info_flags & _TIF_UPROBE)
-		uprobe_notify_resume(regs);
-
-	/* deal with pending signal delivery */
-	if (thread_info_flags & _TIF_SIGPENDING)
-		do_signal(regs);
-
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
-	if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
-		fire_user_return_notifiers();
-
-	user_enter();
-}
-
 void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
 {
 	struct task_struct *me = current;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 04/13] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 03/13] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 05/13] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

Other than the super-atomic exception entries, all exception entries
are supposed to switch our context tracking state to CONTEXT_KERNEL.
Assert that they do.  These assertions appear trivial at this point,
as exception_enter is the function responsible for switching
context, but I'm planning on reworking x86's exception context
tracking, and these assertions will help make sure that all of this
code keeps working.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/traps.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index de379366f6d1..786d359fe824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -291,6 +291,8 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
 	enum ctx_state prev_state = exception_enter();
 	siginfo_t info;
 
+	context_tracking_assert_state(CONTEXT_KERNEL);
+
 	if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
 			NOTIFY_STOP) {
 		conditional_sti(regs);
@@ -377,6 +379,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	siginfo_t *info;
 
 	prev_state = exception_enter();
+	context_tracking_assert_state(CONTEXT_KERNEL);
 	if (notify_die(DIE_TRAP, "bounds", regs, error_code,
 			X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
 		goto exit;
@@ -458,6 +461,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	context_tracking_assert_state(CONTEXT_KERNEL);
 	conditional_sti(regs);
 
 	if (v8086_mode(regs)) {
@@ -515,6 +519,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 		return;
 
 	prev_state = ist_enter(regs);
+	context_tracking_assert_state(CONTEXT_KERNEL);
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
 				SIGTRAP) == NOTIFY_STOP)
@@ -794,6 +799,7 @@ dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	context_tracking_assert_state(CONTEXT_KERNEL);
 	math_error(regs, error_code, X86_TRAP_MF);
 	exception_exit(prev_state);
 }
@@ -804,6 +810,7 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	context_tracking_assert_state(CONTEXT_KERNEL);
 	math_error(regs, error_code, X86_TRAP_XF);
 	exception_exit(prev_state);
 }
@@ -862,6 +869,7 @@ do_device_not_available(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	context_tracking_assert_state(CONTEXT_KERNEL);
 	BUG_ON(use_eager_fpu());
 
 #ifdef CONFIG_MATH_EMULATION
@@ -891,6 +899,7 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	context_tracking_assert_state(CONTEXT_KERNEL);
 	local_irq_enable();
 
 	info.si_signo = SIGILL;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 05/13] x86/entry: Add enter_from_user_mode and use it in syscalls
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (3 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 04/13] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 06/13] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

Changing the x86 context tracking hooks is dangerous because there
are no good checks that we track our context correctly.  Add a
helper to check that we're actually in CONTEXT_USER when we enter
from user mode and wire it up for syscall entries.

Subsequent patches will wire this up for all non-NMI entries as
well.  NMIs are their own special beast and cannot currently switch
overall context tracking state.  Instead, they have their own
special RCU hooks.

This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a layer
of indirection).  Eventually, we should fix up the core context
tracking code to supply a function that does what we want (and can
be much simpler than user_exit), which will enable us to get rid of
the extra call.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 348465473e55..54af0df50080 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -28,6 +28,15 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
+#ifdef CONFIG_CONTEXT_TRACKING
+/* Called on entry from user mode with IRQs off. */
+asmlinkage __visible void enter_from_user_mode(void)
+{
+	context_tracking_assert_state(CONTEXT_USER);
+	user_exit();
+}
+#endif
+
 static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
 {
 #ifdef CONFIG_X86_64
@@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	work = ACCESS_ONCE(current_thread_info()->flags) &
 		_TIF_WORK_SYSCALL_ENTRY;
 
+#ifdef CONFIG_CONTEXT_TRACKING
 	/*
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
 	if (work & _TIF_NOHZ) {
-		user_exit();
+		enter_from_user_mode();
 		work &= ~_TIF_NOHZ;
 	}
+#endif
 
 #ifdef CONFIG_SECCOMP
 	/*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 06/13] x86/entry: Add new, comprehensible entry and exit hooks
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (4 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 05/13] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 07/13] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

The current entry and exit code is incomprehensible, appears to work
primary by luck, and is very difficult to incrementally improve.  Add
new code in preparation for simply deleting the old code.

prepare_exit_to_usermode is a new function that will handle all slow
path exits to user mode.  It is called with IRQs disabled and it
leaves us in a state in which it is safe to immediately return to
user mode.  IRQs must not be re-enabled at any point after
prepare_exit_to_usermode returns and user mode is actually entered.
(We can, of course, fail to enter user mode and treat that failure
as a fresh entry to kernel mode.)  All callers of do_notify_resume
will be migrated to call prepare_exit_to_usermode instead;
prepare_exit_to_usermode needs to do everything that
do_notify_resume does, but it also takes care of scheduling and
context tracking.  Unlike do_notify_resume, it does not need to be
called in a loop.

syscall_return_slowpath is exactly what it sounds like.  It will be
called on any syscall exit slow path.  It will replaces
syscall_trace_leave and it calls prepare_exit_to_usermode on the way
out.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 54af0df50080..c2ad6b60f4c9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 		return syscall_trace_enter_phase2(regs, arch, phase1_result);
 }
 
+/* Deprecated. */
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;
@@ -237,8 +238,115 @@ void syscall_trace_leave(struct pt_regs *regs)
 	user_enter();
 }
 
+static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
+{
+	unsigned long top_of_stack =
+		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
+	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
+}
+
+/* Called with IRQs disabled. */
+asmlinkage __visible void prepare_exit_to_usermode(struct pt_regs *regs)
+{
+	if (WARN_ON(!irqs_disabled()))
+		local_irq_enable();
+
+	/*
+	 * In order to return to user mode, we need to have IRQs off with
+	 * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
+	 * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
+	 * can be set at any time on preemptable kernels if we have IRQs on,
+	 * so we need to loop.  Disabling preemption wouldn't help: doing the
+	 * work to clear some of the flags can sleep.
+	 */
+	while (true) {
+		u32 cached_flags =
+			READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+
+		if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+				      _TIF_UPROBE | _TIF_NEED_RESCHED)))
+			break;
+
+		/* We have work to do. */
+		local_irq_enable();
+
+		if (cached_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (cached_flags & _TIF_UPROBE)
+			uprobe_notify_resume(regs);
+
+		/* deal with pending signal delivery */
+		if (cached_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (cached_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
+			fire_user_return_notifiers();
+
+		/* Disable IRQs and retry */
+		local_irq_disable();
+	}
+
+	user_enter();
+}
+
+/*
+ * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
+ * state such that we can immediately switch to user mode.
+ */
+asmlinkage __visible void syscall_return_slowpath(struct pt_regs *regs)
+{
+	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	u32 cached_flags = READ_ONCE(ti->flags);
+	bool step;
+
+	context_tracking_assert_state(CONTEXT_KERNEL);
+
+	if (WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
+		 regs->orig_ax))
+		local_irq_enable();
+
+	/*
+	 * First do one-time work.  If these work items are enabled, we
+	 * want to run them exactly once per syscall exit with IRQs on.
+	 */
+	if (cached_flags & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |
+			    _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)) {
+		audit_syscall_exit(regs);
+
+		if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+			trace_sys_exit(regs, regs->ax);
+
+		/*
+		 * If TIF_SYSCALL_EMU is set, we only get here because of
+		 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+		 * We already reported this syscall instruction in
+		 * syscall_trace_enter().
+		 */
+		step = unlikely(
+			(cached_flags & (_TIF_SINGLESTEP | _TIF_SYSCALL_EMU))
+			== _TIF_SINGLESTEP);
+		if (step || cached_flags & _TIF_SYSCALL_TRACE)
+			tracehook_report_syscall_exit(regs, step);
+	}
+
+	/*
+	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
+	 * returning to user mode.
+	 */
+	ti->status &= ~TS_COMPAT;
+
+	local_irq_disable();
+	prepare_exit_to_usermode(regs);
+}
+
 /*
- * notification of userspace execution resumption
+ * Deprecated notification of userspace execution resumption
  * - triggered by the TIF_WORK_MASK flags
  */
 __visible void
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 07/13] x86/entry/64: Really create an error-entry-from-usermode code path
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (5 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 06/13] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks Andy Lutomirski
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

In 539f51136500 ("x86/asm/entry/64: Disentangle error_entry/exit
gsbase/ebx/usermode code"), I arranged the code slightly wrong --
IRET faults would skip the code path that was intended to execute on
all error entries from user mode.  Fix it up.

This does not fix a bug, but we'll need it, and it slightly shrinks
the code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3bb2c4302df1..33acc3dcc281 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1145,7 +1145,11 @@ ENTRY(error_entry)
 	testb	$3, CS+8(%rsp)
 	jz	error_kernelspace
 
-	/* We entered from user mode */
+error_entry_from_usermode:
+	/*
+	 * We entered from user mode or we're pretending to have entered
+	 * from user mode due to an IRET fault.
+	 */
 	SWAPGS
 
 error_entry_done:
@@ -1174,8 +1178,7 @@ error_kernelspace:
 	 * gsbase and proceed.  We'll fix up the exception and land in
 	 * gs_change's error handler with kernel gsbase.
 	 */
-	SWAPGS
-	jmp	error_entry_done
+	jmp	error_entry_from_usermode
 
 bstep_iret:
 	/* Fix truncated RIP */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (6 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 07/13] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-17 10:00   ` Ingo Molnar
  2015-06-16 20:16 ` [RFC/INCOMPLETE 09/13] x86/entry/compat: Migrate compat " Andy Lutomirski
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

This is separate for ease of bisection.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
 1 file changed, 7 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 33acc3dcc281..a5044d7a9d43 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
 	 */
 	USERGS_SYSRET64
 
+GLOBAL(int_ret_from_sys_call_irqs_off)
+	TRACE_IRQS_ON
+	ENABLE_INTERRUPTS(CLBR_NONE)
+	jmp int_ret_from_sys_call
+
 	/* Do syscall entry tracing */
 tracesys:
 	movq	%rsp, %rdi
@@ -272,69 +277,10 @@ tracesys_phase2:
  * Has correct iret frame.
  */
 GLOBAL(int_ret_from_sys_call)
-	DISABLE_INTERRUPTS(CLBR_NONE)
-int_ret_from_sys_call_irqs_off: /* jumps come here from the irqs-off SYSRET path */
-	TRACE_IRQS_OFF
-	movl	$_TIF_ALLWORK_MASK, %edi
-	/* edi:	mask to check */
-GLOBAL(int_with_check)
-	LOCKDEP_SYS_EXIT_IRQ
-	GET_THREAD_INFO(%rcx)
-	movl	TI_flags(%rcx), %edx
-	andl	%edi, %edx
-	jnz	int_careful
-	andl	$~TS_COMPAT, TI_status(%rcx)
-	jmp	syscall_return
-
-	/*
-	 * Either reschedule or signal or syscall exit tracking needed.
-	 * First do a reschedule test.
-	 * edx:	work, edi: workmask
-	 */
-int_careful:
-	bt	$TIF_NEED_RESCHED, %edx
-	jnc	int_very_careful
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq	%rdi
-	SCHEDULE_USER
-	popq	%rdi
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp	int_with_check
-
-	/* handle signals and tracing -- both require a full pt_regs */
-int_very_careful:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_EXTRA_REGS
-	/* Check for syscall exit trace */
-	testl	$_TIF_WORK_SYSCALL_EXIT, %edx
-	jz	int_signal
-	pushq	%rdi
-	leaq	8(%rsp), %rdi			/* &ptregs -> arg1 */
-	call	syscall_trace_leave
-	popq	%rdi
-	andl	$~(_TIF_WORK_SYSCALL_EXIT|_TIF_SYSCALL_EMU), %edi
-	jmp	int_restore_rest
-
-int_signal:
-	testl	$_TIF_DO_NOTIFY_MASK, %edx
-	jz	1f
-	movq	%rsp, %rdi			/* &ptregs -> arg1 */
-	xorl	%esi, %esi			/* oldset -> arg2 */
-	call	do_notify_resume
-1:	movl	$_TIF_WORK_MASK, %edi
-int_restore_rest:
+	movq	%rsp, %rdi
+	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	RESTORE_EXTRA_REGS
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp	int_with_check
-
-syscall_return:
-	/* The IRETQ could re-enable interrupts: */
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_IRETQ
 
 	/*
 	 * Try to use SYSRET instead of IRET if we're returning to
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 09/13] x86/entry/compat: Migrate compat syscalls to new exit hooks
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (7 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 10/13] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

This is separate for ease of bisection.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64_compat.S | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index bb187a6a877c..415afa038edf 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -209,10 +209,10 @@ sysexit_from_sys_call:
 	.endm
 
 	.macro auditsys_exit exit
-	testl	$(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
-	jnz	ia32_ret_from_sys_call
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
+	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	jnz ia32_ret_from_sys_call
 	movl	%eax, %esi		/* second arg, syscall return value */
 	cmpl	$-MAX_ERRNO, %eax	/* is it an error ? */
 	jbe	1f
@@ -227,11 +227,10 @@ sysexit_from_sys_call:
 	testl	%edi, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
 	jz	\exit
 	xorl	%eax, %eax		/* Do not leak kernel information */
-	movq	%rax, R11(%rsp)
+	jmp	int_ret_from_sys_call_irqs_off
 	movq	%rax, R10(%rsp)
 	movq	%rax, R9(%rsp)
 	movq	%rax, R8(%rsp)
-	jmp	int_with_check
 	.endm
 
 sysenter_auditsys:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 10/13] x86/asm/entry/64: Save all regs on interrupt entry
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (8 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 09/13] x86/entry/compat: Migrate compat " Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 11/13] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

To prepare for the big rewrite of the error and interrupt exit
paths, we will need pt_regs completely filled in.  It's already
completely filled in when error_exit runs, so rearrange interrupt
handling to match it.  This will slow down interrupt handling very
slightly (eight instructions), but the simplification it enables
will be more than worth it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a5044d7a9d43..43bf5762443c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -501,21 +501,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
-	/*
-	 * Since nothing in interrupt handling code touches r12...r15 members
-	 * of "struct pt_regs", and since interrupts can nest, we can save
-	 * four stack slots and simultaneously provide
-	 * an unwind-friendly stack layout by saving "truncated" pt_regs
-	 * exactly up to rbp slot, without these members.
-	 */
-	ALLOC_PT_GPREGS_ON_STACK -RBP
-	SAVE_C_REGS -RBP
-	/* this goes to 0(%rsp) for unwinder, not for saving the value: */
-	SAVE_EXTRA_REGS_RBP -RBP
+	ALLOC_PT_GPREGS_ON_STACK
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
 
-	leaq	-RBP(%rsp), %rdi		/* arg1 for \func (pointer to pt_regs) */
+	movq	%rsp,%rdi	/* arg1 for \func (pointer to pt_regs) */
 
-	testb	$3, CS-RBP(%rsp)
+	testb	$3, CS(%rsp)
 	jz	1f
 	SWAPGS
 1:
@@ -552,9 +544,7 @@ ret_from_intr:
 	decl	PER_CPU_VAR(irq_count)
 
 	/* Restore saved previous stack */
-	popq	%rsi
-	/* return code expects complete pt_regs - adjust rsp accordingly: */
-	leaq	-RBP(%rsi), %rsp
+	popq	%rsp
 
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
@@ -579,7 +569,7 @@ retint_swapgs:					/* return to user-space */
 	TRACE_IRQS_IRETQ
 
 	SWAPGS
-	jmp	restore_c_regs_and_iret
+	jmp	restore_regs_and_iret
 
 /* Returning to kernel space */
 retint_kernel:
@@ -603,6 +593,8 @@ retint_kernel:
  * At this label, code paths which return to kernel and to user,
  * which come from interrupts/exception and from syscalls, merge.
  */
+restore_regs_and_iret:
+	RESTORE_EXTRA_REGS
 restore_c_regs_and_iret:
 	RESTORE_C_REGS
 	REMOVE_PT_GPREGS_FROM_STACK 8
@@ -673,12 +665,10 @@ retint_signal:
 	jz	retint_swapgs
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_EXTRA_REGS
 	movq	$-1, ORIG_RAX(%rsp)
 	xorl	%esi, %esi			/* oldset */
 	movq	%rsp, %rdi			/* &pt_regs */
 	call	do_notify_resume
-	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)
@@ -1158,7 +1148,6 @@ END(error_entry)
  */
 ENTRY(error_exit)
 	movl	%ebx, %eax
-	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	testl	%eax, %eax
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 11/13] x86/asm/entry/64: Simplify irq stack pt_regs handling
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (9 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 10/13] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 12/13] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

There's no need for both rsi and rdi to point to the original stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 43bf5762443c..ab8cbf602d19 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -505,8 +505,6 @@ END(irq_entries_start)
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
 
-	movq	%rsp,%rdi	/* arg1 for \func (pointer to pt_regs) */
-
 	testb	$3, CS(%rsp)
 	jz	1f
 	SWAPGS
@@ -518,14 +516,14 @@ END(irq_entries_start)
 	 * a little cheaper to use a separate counter in the PDA (short of
 	 * moving irq_enter into assembly, which would be too much work)
 	 */
-	movq	%rsp, %rsi
+	movq	%rsp, %rdi
 	incl	PER_CPU_VAR(irq_count)
 	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
-	pushq	%rsi
+	pushq	%rdi
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
-	call	\func
+	call	\func	/* rdi points to pt_regs */
 	.endm
 
 	/*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 12/13] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (10 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 11/13] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-16 20:16 ` [RFC/INCOMPLETE 13/13] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 63 +++++++++++++----------------------------------
 1 file changed, 17 insertions(+), 46 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ab8cbf602d19..9ae8b8ab91fa 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -507,7 +507,16 @@ END(irq_entries_start)
 
 	testb	$3, CS(%rsp)
 	jz	1f
+
+	/*
+	 * IRQ from user mode.  Switch to kernel gsbase and inform context
+	 * tracking that we're in kernel mode.
+	 */
 	SWAPGS
+#ifdef CONFIG_CONTEXT_TRACKING
+	call enter_from_user_mode
+#endif
+
 1:
 	/*
 	 * Save previous stack pointer, optionally switch to interrupt stack.
@@ -546,26 +555,13 @@ ret_from_intr:
 
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
-	/* Interrupt came from user space */
-retint_user:
-	GET_THREAD_INFO(%rcx)
 
-	/* %rcx: thread info. Interrupts are off. */
-retint_with_reschedule:
-	movl	$_TIF_WORK_MASK, %edi
-retint_check:
+	/* Interrupt came from user space */
 	LOCKDEP_SYS_EXIT_IRQ
-	movl	TI_flags(%rcx), %edx
-	andl	%edi, %edx
-	jnz	retint_careful
-
-retint_swapgs:					/* return to user-space */
-	/*
-	 * The iretq could re-enable interrupts:
-	 */
-	DISABLE_INTERRUPTS(CLBR_ANY)
+retint_user:
+	mov	%rsp,%rdi
+	call	prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
-
 	SWAPGS
 	jmp	restore_regs_and_iret
 
@@ -643,35 +639,6 @@ native_irq_return_ldt:
 	popq	%rax
 	jmp	native_irq_return_iret
 #endif
-
-	/* edi: workmask, edx: work */
-retint_careful:
-	bt	$TIF_NEED_RESCHED, %edx
-	jnc	retint_signal
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq	%rdi
-	SCHEDULE_USER
-	popq	%rdi
-	GET_THREAD_INFO(%rcx)
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp	retint_check
-
-retint_signal:
-	testl	$_TIF_DO_NOTIFY_MASK, %edx
-	jz	retint_swapgs
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	movq	$-1, ORIG_RAX(%rsp)
-	xorl	%esi, %esi			/* oldset */
-	movq	%rsp, %rdi			/* &pt_regs */
-	call	do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	GET_THREAD_INFO(%rcx)
-	jmp	retint_with_reschedule
-
 END(common_interrupt)
 
 /*
@@ -1086,6 +1053,10 @@ error_entry_from_usermode:
 	 */
 	SWAPGS
 
+#ifdef CONFIG_CONTEXT_TRACKING
+	call enter_from_user_mode
+#endif
+
 error_entry_done:
 	TRACE_IRQS_OFF
 	ret
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC/INCOMPLETE 13/13] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (11 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 12/13] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
@ 2015-06-16 20:16 ` Andy Lutomirski
  2015-06-17  9:48 ` [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Ingo Molnar
  2015-06-17 10:32 ` Ingo Molnar
  14 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-16 20:16 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Andy Lutomirski

SCHEDULE_USER is no longer used, and asm/context-tracking.h
contained nothing else.  Remove the header entirely

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S               |  1 -
 arch/x86/include/asm/context_tracking.h | 10 ----------
 2 files changed, 11 deletions(-)
 delete mode 100644 arch/x86/include/asm/context_tracking.h

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ae8b8ab91fa..25cc9c36ca59 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -33,7 +33,6 @@
 #include <asm/paravirt.h>
 #include <asm/percpu.h>
 #include <asm/asm.h>
-#include <asm/context_tracking.h>
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
 #include <linux/err.h>
diff --git a/arch/x86/include/asm/context_tracking.h b/arch/x86/include/asm/context_tracking.h
deleted file mode 100644
index 1fe49704b146..000000000000
--- a/arch/x86/include/asm/context_tracking.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#ifndef _ASM_X86_CONTEXT_TRACKING_H
-#define _ASM_X86_CONTEXT_TRACKING_H
-
-#ifdef CONFIG_CONTEXT_TRACKING
-# define SCHEDULE_USER call schedule_user
-#else
-# define SCHEDULE_USER call schedule
-#endif
-
-#endif
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-16 20:16 ` [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state Andy Lutomirski
@ 2015-06-17  9:41   ` Ingo Molnar
  2015-06-17 14:15     ` Andy Lutomirski
  2015-06-17 15:27     ` Paul E. McKenney
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-17  9:41 UTC (permalink / raw)
  To: Andy Lutomirski, Paul E. McKenney
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Borislav Petkov, Kees Cook,
	Brian Gerst


* Andy Lutomirski <luto@kernel.org> wrote:

> This will let us sprinkle sanity checks around the kernel without
> making too much of a mess.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  include/linux/context_tracking.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 2821838256b4..0fbea4b152e1 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>  	if (context_tracking_is_enabled())
>  		__context_tracking_task_switch(prev, next);
>  }
> +
> +static inline void context_tracking_assert_state(enum ctx_state state)
> +{
> +	rcu_lockdep_assert(!context_tracking_is_enabled() ||
> +			   this_cpu_read(context_tracking.state) == state,
> +			   "context tracking state was wrong");
> +}

Please don't introduce assert() style debug check interfaces!

(And RCU should be fixed too I suspect.)

They are absolutely horrible on the brain when mixed with WARN_ON() interfaces, 
which are the dominant runtime check interface in the kernel.

Instead make it something like:

  #define ct_state() (this_cpu_read(context_tracking.state))

  #define CT_WARN_ON(cond) \
	WARN_ON(context_tracking_is_enabled() && (cond))

and then the debug checks can be written as:

	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);

This is IMHO _far_ more readable than:

	context_tracking_assert_state(CONTEXT_KERNEL);

ok?

(Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (12 preceding siblings ...)
  2015-06-16 20:16 ` [RFC/INCOMPLETE 13/13] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
@ 2015-06-17  9:48 ` Ingo Molnar
  2015-06-17 10:13   ` Richard Weinberger
  2015-06-17 15:16   ` Andy Lutomirski
  2015-06-17 10:32 ` Ingo Molnar
  14 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-17  9:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Borislav Petkov, Kees Cook,
	Brian Gerst, Linus Torvalds, Denys Vlasenko


* Andy Lutomirski <luto@kernel.org> wrote:

> This is incomplete, but it's finally good enough that I think it's
> time to get other opinions on it.  It is a complete rewrite of the
> slow path code that handles exits to user mode.

Modulo the small comments I made about the debug checks interface plus naming 
details the structure and intention of this series gives me warm fuzzy feelings.

> The exit-to-usermode code is copied in several places and is written in a nasty 
> combination of asm and C.  It's not at all clear what it's supposed to do, and 
> the way it's structured makes it very hard to work with.  For example, it's not 
> even clear why syscall exit hooks are called only once per syscall right now.  
> (It seems to be a side effect of the way that rdi and rdx are handled in the asm 
> loop, and it seems reliable, but it's still pointlessly complicated.)  The 
> existing code also makes context tracking overly complicated and hard to 
> understand.  Finally, it's nearly impossible for anyone to change what happens 
> on exit to usermode, since the existing code is so fragile.

Amen.

> I tried to clean it up incrementally, but I decided it was too hard. Instead, 
> this series just replaces the code.  It seems to work.

Any known bugs beyond UML build breakage?

> Context tracking in particular works very differently now.  The low-level entry 
> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL.  The exit 
> code does the reverse.  There is no need to track what CONTEXT_XYZ state we came 
> from, because we already know.  Similarly, SCHEDULE_USER is gone, since we can 
> reschedule if needed by simply calling schedule() from C code.
> 
> The main things that are missing are that I haven't done the 32-bit parts 
> (anyone want to help?) and therefore I haven't deleted the old C code.  I also 
> think this may break UML for trivial reasons.
> 
> Because I haven't converted the 32-bit code yet, all of the now-unnecessary 
> unnecessary calls to exception_enter are still present in traps.c.
> 
> IRQ context tracking is still duplicated.  We should probably clean it up by 
> changing the core code to supply something like 
> irq_enter_we_are_already_in_context_kernel.
> 
> Thoughts?

So assuming you fix the UML build I'm inclined to go for it, even in this 
incomplete form, to increase testing coverage.

Doing that will also decrease ongoing merge friction between your work and other 
entry code cleanups ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks
  2015-06-16 20:16 ` [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks Andy Lutomirski
@ 2015-06-17 10:00   ` Ingo Molnar
  2015-06-17 10:02     ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-17 10:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Borislav Petkov, Kees Cook,
	Brian Gerst, Linus Torvalds


* Andy Lutomirski <luto@kernel.org> wrote:

> This is separate for ease of bisection.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
>  1 file changed, 7 insertions(+), 61 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 33acc3dcc281..a5044d7a9d43 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
>  	 */
>  	USERGS_SYSRET64
>  
> +GLOBAL(int_ret_from_sys_call_irqs_off)
> +	TRACE_IRQS_ON
> +	ENABLE_INTERRUPTS(CLBR_NONE)
> +	jmp int_ret_from_sys_call

Any reason why irq state tracking cannot be done in C as well, like the rest of 
the irq state tracking code?

(Btw., context tracking could in theory be merged with irq state tracking, they 
have similar needs.)

One complication would be that we have not saved pt_regs yet:

> @@ -272,69 +277,10 @@ tracesys_phase2:
>   * Has correct iret frame.
>   */
>  GLOBAL(int_ret_from_sys_call)
>  	SAVE_EXTRA_REGS
> +	movq	%rsp, %rdi
> +	call	syscall_return_slowpath	/* returns with IRQs disabled */
>  	RESTORE_EXTRA_REGS

... but we can stipulate that IRQs can be enabled after we've constructed pt_regs. 
The few cycles constant added latency there isn't an issue - maintainability of 
the code is.

... this would also allow us to get rid of paravirt interface uglies like 
'CLBR_NONE' which has no place in generic entry code.

With such a conversion very little 'high level, complex logic' should be left in 
assembly and as much as possible should be moved to C: that way we can integrate 
it all on the C level and achieve similar speedups as with assembly. Half-done 
will reduce that potential of speedups through integration.

A bit like how we have done it with do_page_fault(): there's very little left at 
the assembly level, still it has not kept people from micro-optimizing the heck 
out of the low level page fault code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks
  2015-06-17 10:00   ` Ingo Molnar
@ 2015-06-17 10:02     ` Ingo Molnar
  2015-06-17 14:12       ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-17 10:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Borislav Petkov, Kees Cook,
	Brian Gerst, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Andy Lutomirski <luto@kernel.org> wrote:
> 
> > This is separate for ease of bisection.
> > 
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >  arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
> >  1 file changed, 7 insertions(+), 61 deletions(-)
> > 
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 33acc3dcc281..a5044d7a9d43 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
> >  	 */
> >  	USERGS_SYSRET64
> >  
> > +GLOBAL(int_ret_from_sys_call_irqs_off)
> > +	TRACE_IRQS_ON
> > +	ENABLE_INTERRUPTS(CLBR_NONE)
> > +	jmp int_ret_from_sys_call
> 
> Any reason why irq state tracking cannot be done in C as well, like the rest of 
> the irq state tracking code?

Never mind, I see you've done exactly that in patch #12.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17  9:48 ` [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Ingo Molnar
@ 2015-06-17 10:13   ` Richard Weinberger
  2015-06-17 11:04     ` Ingo Molnar
  2015-06-17 14:19     ` Andy Lutomirski
  2015-06-17 15:16   ` Andy Lutomirski
  1 sibling, 2 replies; 45+ messages in thread
From: Richard Weinberger @ 2015-06-17 10:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, LKML, Frédéric Weisbecker,
	Rik van Riel, Oleg Nesterov, Denys Vlasenko, Borislav Petkov,
	Kees Cook, Brian Gerst, Linus Torvalds, Denys Vlasenko

On Wed, Jun 17, 2015 at 11:48 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> This is incomplete, but it's finally good enough that I think it's
>> time to get other opinions on it.  It is a complete rewrite of the
>> slow path code that handles exits to user mode.
>
> Modulo the small comments I made about the debug checks interface plus naming
> details the structure and intention of this series gives me warm fuzzy feelings.
>
>> The exit-to-usermode code is copied in several places and is written in a nasty
>> combination of asm and C.  It's not at all clear what it's supposed to do, and
>> the way it's structured makes it very hard to work with.  For example, it's not
>> even clear why syscall exit hooks are called only once per syscall right now.
>> (It seems to be a side effect of the way that rdi and rdx are handled in the asm
>> loop, and it seems reliable, but it's still pointlessly complicated.)  The
>> existing code also makes context tracking overly complicated and hard to
>> understand.  Finally, it's nearly impossible for anyone to change what happens
>> on exit to usermode, since the existing code is so fragile.
>
> Amen.
>
>> I tried to clean it up incrementally, but I decided it was too hard. Instead,
>> this series just replaces the code.  It seems to work.
>
> Any known bugs beyond UML build breakage?
>
>> Context tracking in particular works very differently now.  The low-level entry
>> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL.  The exit
>> code does the reverse.  There is no need to track what CONTEXT_XYZ state we came
>> from, because we already know.  Similarly, SCHEDULE_USER is gone, since we can
>> reschedule if needed by simply calling schedule() from C code.
>>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code.  I also
>> think this may break UML for trivial reasons.
>>
>> Because I haven't converted the 32-bit code yet, all of the now-unnecessary
>> unnecessary calls to exception_enter are still present in traps.c.
>>
>> IRQ context tracking is still duplicated.  We should probably clean it up by
>> changing the core code to supply something like
>> irq_enter_we_are_already_in_context_kernel.
>>
>> Thoughts?
>
> So assuming you fix the UML build I'm inclined to go for it, even in this
> incomplete form, to increase testing coverage.

Andy, can you please share the build breakage you're facing?
I'll happily help you fixing it.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (13 preceding siblings ...)
  2015-06-17  9:48 ` [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Ingo Molnar
@ 2015-06-17 10:32 ` Ingo Molnar
  2015-06-17 11:14   ` Ingo Molnar
  2015-06-17 14:23   ` Andy Lutomirski
  14 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-17 10:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Borislav Petkov, Kees Cook,
	Brian Gerst


* Andy Lutomirski <luto@kernel.org> wrote:

> The main things that are missing are that I haven't done the 32-bit parts 
> (anyone want to help?) and therefore I haven't deleted the old C code.  I also 
> think this may break UML for trivial reasons.

So I'd suggest moving most of the SYSRET fast path to C too.

This is how it looks like now after your patches:

	testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
	jnz	tracesys
entry_SYSCALL_64_fastpath:
#if __SYSCALL_MASK == ~0
	cmpq	$__NR_syscall_max, %rax
#else
	andl	$__SYSCALL_MASK, %eax
	cmpl	$__NR_syscall_max, %eax
#endif
	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
	movq	%r10, %rcx
	call	*sys_call_table(, %rax, 8)
	movq	%rax, RAX(%rsp)
1:
/*
 * Syscall return path ending with SYSRET (fast path).
 * Has incompletely filled pt_regs.
 */
	LOCKDEP_SYS_EXIT
	/*
	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
	 * it is too small to ever cause noticeable irq latency.
	 */
	DISABLE_INTERRUPTS(CLBR_NONE)

	/*
	 * We must check ti flags with interrupts (or at least preemption)
	 * off because we must *never* return to userspace without
	 * processing exit work that is enqueued if we're preempted here.
	 * In particular, returning to userspace with any of the one-shot
	 * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
	 * very bad.
	 */
	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
	jnz	int_ret_from_sys_call_irqs_off	/* Go to the slow path */

Most of that can be done in C.

And I think we could also convert the IRET syscall return slow path to C too:

GLOBAL(int_ret_from_sys_call)
	SAVE_EXTRA_REGS
	movq	%rsp, %rdi
	call	syscall_return_slowpath	/* returns with IRQs disabled */
	RESTORE_EXTRA_REGS

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.
	 */
	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11
	cmpq	%rcx, %r11			/* RCX == RIP */
	jne	opportunistic_sysret_failed

	/*
	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
	 * in kernel space.  This essentially lets the user take over
	 * the kernel, since userspace controls RSP.
	 *
	 * If width of "canonical tail" ever becomes variable, this will need
	 * to be updated to remain correct on both old and new CPUs.
	 */
	.ifne __VIRTUAL_MASK_SHIFT - 47
	.error "virtual address width changed -- SYSRET checks need update"
	.endif

	/* Change top 16 bits to be the sign-extension of 47th bit */
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

	/* If this changed %rcx, it was not canonical */
	cmpq	%rcx, %r11
	jne	opportunistic_sysret_failed

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	opportunistic_sysret_failed

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	opportunistic_sysret_failed

	/*
	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
	 * restoring TF results in a trap from userspace immediately after
	 * SYSRET.  This would cause an infinite loop whenever #DB happens
	 * with register state that satisfies the opportunistic SYSRET
	 * conditions.  For example, single-stepping this user code:
	 *
	 *           movq	$stuck_here, %rcx
	 *           pushfq
	 *           popq %r11
	 *   stuck_here:
	 *
	 * would never get past 'stuck_here'.
	 */
	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	opportunistic_sysret_failed

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	opportunistic_sysret_failed

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	/* rcx and r11 are already restored (see code above) */
	RESTORE_C_REGS_EXCEPT_RCX_R11
	movq	RSP(%rsp), %rsp
	USERGS_SYSRET64

opportunistic_sysret_failed:
	SWAPGS
	jmp	restore_c_regs_and_iret
END(entry_SYSCALL_64)


Basically there would be a single C function we'd call, which returns a condition 
(or fixes up its return address on the stack directly) to determine between the 
SYSRET and IRET return paths.

Moving this to C too has immediate benefits: that way we could easily add 
instrumentation to see how efficient these various return methods are, etc.

I.e. I don't think there's two ways about this: once the entry code moves to the 
domain of C code, we get the best benefits by moving as much of it as possible. 

The only low level bits remaining in assembly will be low level hardware ABI 
details: saving registers and restoring registers to the expected format - no 
'active' code whatsoever.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17 10:13   ` Richard Weinberger
@ 2015-06-17 11:04     ` Ingo Molnar
  2015-06-17 14:19     ` Andy Lutomirski
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-17 11:04 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Andy Lutomirski, x86, LKML, Frédéric Weisbecker,
	Rik van Riel, Oleg Nesterov, Denys Vlasenko, Borislav Petkov,
	Kees Cook, Brian Gerst, Linus Torvalds, Denys Vlasenko


* Richard Weinberger <richard.weinberger@gmail.com> wrote:

> On Wed, Jun 17, 2015 at 11:48 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@kernel.org> wrote:
> >
> >> This is incomplete, but it's finally good enough that I think it's
> >> time to get other opinions on it.  It is a complete rewrite of the
> >> slow path code that handles exits to user mode.
> >
> > Modulo the small comments I made about the debug checks interface plus naming
> > details the structure and intention of this series gives me warm fuzzy feelings.
> >
> >> The exit-to-usermode code is copied in several places and is written in a nasty
> >> combination of asm and C.  It's not at all clear what it's supposed to do, and
> >> the way it's structured makes it very hard to work with.  For example, it's not
> >> even clear why syscall exit hooks are called only once per syscall right now.
> >> (It seems to be a side effect of the way that rdi and rdx are handled in the asm
> >> loop, and it seems reliable, but it's still pointlessly complicated.)  The
> >> existing code also makes context tracking overly complicated and hard to
> >> understand.  Finally, it's nearly impossible for anyone to change what happens
> >> on exit to usermode, since the existing code is so fragile.
> >
> > Amen.
> >
> >> I tried to clean it up incrementally, but I decided it was too hard. Instead,
> >> this series just replaces the code.  It seems to work.
> >
> > Any known bugs beyond UML build breakage?
> >
> >> Context tracking in particular works very differently now.  The low-level entry
> >> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL.  The exit
> >> code does the reverse.  There is no need to track what CONTEXT_XYZ state we came
> >> from, because we already know.  Similarly, SCHEDULE_USER is gone, since we can
> >> reschedule if needed by simply calling schedule() from C code.
> >>
> >> The main things that are missing are that I haven't done the 32-bit parts
> >> (anyone want to help?) and therefore I haven't deleted the old C code.  I also
> >> think this may break UML for trivial reasons.
> >>
> >> Because I haven't converted the 32-bit code yet, all of the now-unnecessary
> >> unnecessary calls to exception_enter are still present in traps.c.
> >>
> >> IRQ context tracking is still duplicated.  We should probably clean it up by
> >> changing the core code to supply something like
> >> irq_enter_we_are_already_in_context_kernel.
> >>
> >> Thoughts?
> >
> > So assuming you fix the UML build I'm inclined to go for it, even in this
> > incomplete form, to increase testing coverage.
> 
> Andy, can you please share the build breakage you're facing?
> I'll happily help you fixing it.

So they come in the form of:

  ./arch/um/include/shared/kern_util.h:25:12: error: conflicting types for ‘do_signal’

which comes from now x86 also having a do_signal().

The patch below fixes it by harmonizing the UML implementation with the x86 one. 
This improves the UML side a bit, and fixes the build failure.

Thanks,

	Ingo

=========================>
Subject: uml: Fix do_signal() prototype
From: Ingo Molnar <mingo@kernel.org>
Date: Wed Jun 17 12:58:37 CEST 2015

Now that x86 exports its do_signal(), the prototypes clash.

Fix the clash and also improve the code a bit: remove the unnecessary
kern_do_signal() indirection. This allows interrupt_end() to share
the 'regs' parameter calculation.

Also remove the unused return code to match x86.

Minimally build and boot tested.

Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/um/include/shared/kern_util.h |    3 ++-
 arch/um/kernel/process.c           |    6 ++++--
 arch/um/kernel/signal.c            |    8 +-------
 arch/um/kernel/tlb.c               |    2 +-
 arch/um/kernel/trap.c              |    2 +-
 5 files changed, 9 insertions(+), 12 deletions(-)

Index: tip/arch/um/include/shared/kern_util.h
===================================================================
--- tip.orig/arch/um/include/shared/kern_util.h
+++ tip/arch/um/include/shared/kern_util.h
@@ -22,7 +22,8 @@ extern int kmalloc_ok;
 extern unsigned long alloc_stack(int order, int atomic);
 extern void free_stack(unsigned long stack, int order);
 
-extern int do_signal(void);
+struct pt_regs;
+extern void do_signal(struct pt_regs *regs);
 extern void interrupt_end(void);
 extern void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs);
 
Index: tip/arch/um/kernel/process.c
===================================================================
--- tip.orig/arch/um/kernel/process.c
+++ tip/arch/um/kernel/process.c
@@ -90,12 +90,14 @@ void *__switch_to(struct task_struct *fr
 
 void interrupt_end(void)
 {
+	struct pt_regs *regs = &current->thread.regs;
+
 	if (need_resched())
 		schedule();
 	if (test_thread_flag(TIF_SIGPENDING))
-		do_signal();
+		do_signal(regs);
 	if (test_and_clear_thread_flag(TIF_NOTIFY_RESUME))
-		tracehook_notify_resume(&current->thread.regs);
+		tracehook_notify_resume(regs);
 }
 
 void exit_thread(void)
Index: tip/arch/um/kernel/signal.c
===================================================================
--- tip.orig/arch/um/kernel/signal.c
+++ tip/arch/um/kernel/signal.c
@@ -64,7 +64,7 @@ static void handle_signal(struct ksignal
 	signal_setup_done(err, ksig, singlestep);
 }
 
-static int kern_do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
 {
 	struct ksignal ksig;
 	int handled_sig = 0;
@@ -110,10 +110,4 @@ static int kern_do_signal(struct pt_regs
 	 */
 	if (!handled_sig)
 		restore_saved_sigmask();
-	return handled_sig;
-}
-
-int do_signal(void)
-{
-	return kern_do_signal(&current->thread.regs);
 }
Index: tip/arch/um/kernel/tlb.c
===================================================================
--- tip.orig/arch/um/kernel/tlb.c
+++ tip/arch/um/kernel/tlb.c
@@ -291,7 +291,7 @@ void fix_range_common(struct mm_struct *
 		/* We are under mmap_sem, release it such that current can terminate */
 		up_write(&current->mm->mmap_sem);
 		force_sig(SIGKILL, current);
-		do_signal();
+		do_signal(&current->thread.regs);
 	}
 }
 
Index: tip/arch/um/kernel/trap.c
===================================================================
--- tip.orig/arch/um/kernel/trap.c
+++ tip/arch/um/kernel/trap.c
@@ -173,7 +173,7 @@ static void bad_segv(struct faultinfo fi
 void fatal_sigsegv(void)
 {
 	force_sigsegv(SIGSEGV, current);
-	do_signal();
+	do_signal(&current->thread.regs);
 	/*
 	 * This is to tell gcc that we're not returning - do_signal
 	 * can, in general, return, but in this case, it's not, since

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17 10:32 ` Ingo Molnar
@ 2015-06-17 11:14   ` Ingo Molnar
  2015-06-17 14:23   ` Andy Lutomirski
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-17 11:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Borislav Petkov, Kees Cook,
	Brian Gerst


* Ingo Molnar <mingo@kernel.org> wrote:

> Basically there would be a single C function we'd call, which returns a 
> condition (or fixes up its return address on the stack directly) to determine 
> between the SYSRET and IRET return paths.

This we could do by returning the syscall result in RAX, and the SYSRET/IRET 
choice in RDX - that's the natural return parameter for 128-bit return values in 
the 64-bit C function ABI, and it's clobbered so it's available 'for free'.

We could do something similar for the IRQ entry/return code as well: there's no 
reason why IRQ flag tracking has to be maintained in assembly. We could move all 
but the IRQ stack switching code to C.

We can safely flip around the IRQ stack setting with the enter_from_user_mode 
call, so that IRQ stack switching becomes part of the register saving and kernel 
mode preparatory preamble.

This would allow further optimizations in the IRQ code as well: for example we 
could inline enter_from_user_mode() and prepare_exit_to_usermode().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks
  2015-06-17 10:02     ` Ingo Molnar
@ 2015-06-17 14:12       ` Andy Lutomirski
  2015-06-18 10:17         ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-17 14:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Linus Torvalds

On Wed, Jun 17, 2015 at 3:02 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>>
>> * Andy Lutomirski <luto@kernel.org> wrote:
>>
>> > This is separate for ease of bisection.
>> >
>> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> > ---
>> >  arch/x86/entry/entry_64.S | 68 +++++------------------------------------------
>> >  1 file changed, 7 insertions(+), 61 deletions(-)
>> >
>> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> > index 33acc3dcc281..a5044d7a9d43 100644
>> > --- a/arch/x86/entry/entry_64.S
>> > +++ b/arch/x86/entry/entry_64.S
>> > @@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
>> >      */
>> >     USERGS_SYSRET64
>> >
>> > +GLOBAL(int_ret_from_sys_call_irqs_off)
>> > +   TRACE_IRQS_ON
>> > +   ENABLE_INTERRUPTS(CLBR_NONE)
>> > +   jmp int_ret_from_sys_call
>>
>> Any reason why irq state tracking cannot be done in C as well, like the rest of
>> the irq state tracking code?
>
> Never mind, I see you've done exactly that in patch #12.
>

There are still some TRACE_IRQS_ON, LOCKDEP_SYS_EXIT, and such
scattered throughout the asm.  it's plausible that even more of that
could be moved to C.

We could also benchmark and find out how bad it would be if we just
always filled pt_regs in completely in syscalls.  If the performance
hit isn't enough to matter, then we could potentially move the entire
syscall path except pt_regs setup and sysret/iret into three C
functions.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-17  9:41   ` Ingo Molnar
@ 2015-06-17 14:15     ` Andy Lutomirski
  2015-06-18  9:57       ` Ingo Molnar
  2015-06-17 15:27     ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-17 14:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Paul E. McKenney, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> This will let us sprinkle sanity checks around the kernel without
>> making too much of a mess.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  include/linux/context_tracking.h | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>> index 2821838256b4..0fbea4b152e1 100644
>> --- a/include/linux/context_tracking.h
>> +++ b/include/linux/context_tracking.h
>> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>>       if (context_tracking_is_enabled())
>>               __context_tracking_task_switch(prev, next);
>>  }
>> +
>> +static inline void context_tracking_assert_state(enum ctx_state state)
>> +{
>> +     rcu_lockdep_assert(!context_tracking_is_enabled() ||
>> +                        this_cpu_read(context_tracking.state) == state,
>> +                        "context tracking state was wrong");
>> +}
>
> Please don't introduce assert() style debug check interfaces!
>
> (And RCU should be fixed too I suspect.)
>
> They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> which are the dominant runtime check interface in the kernel.
>
> Instead make it something like:
>
>   #define ct_state() (this_cpu_read(context_tracking.state))
>
>   #define CT_WARN_ON(cond) \
>         WARN_ON(context_tracking_is_enabled() && (cond))
>
> and then the debug checks can be written as:
>
>         CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>
> This is IMHO _far_ more readable than:
>
>         context_tracking_assert_state(CONTEXT_KERNEL);
>
> ok?
>
> (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)

Hmm, ok I guess.  The part I don't like is having ct_state() at all on
non-context-tracking kernels -- it seems like it's asking for trouble.
We could make CT_WARN_ON not even evaluate its argument if
!CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning
garbage if !context_tracking_is_enabled().

The assert macro avoids all these problems despite being a bit ugly.
It's either a no-op returning void or it does the right thing.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17 10:13   ` Richard Weinberger
  2015-06-17 11:04     ` Ingo Molnar
@ 2015-06-17 14:19     ` Andy Lutomirski
  1 sibling, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-17 14:19 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Ingo Molnar, Andy Lutomirski, x86, LKML,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Linus Torvalds, Denys Vlasenko

On Wed, Jun 17, 2015 at 3:13 AM, Richard Weinberger
<richard.weinberger@gmail.com> wrote:
> On Wed, Jun 17, 2015 at 11:48 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Andy Lutomirski <luto@kernel.org> wrote:
>>
>>> This is incomplete, but it's finally good enough that I think it's
>>> time to get other opinions on it.  It is a complete rewrite of the
>>> slow path code that handles exits to user mode.
>>
>> Modulo the small comments I made about the debug checks interface plus naming
>> details the structure and intention of this series gives me warm fuzzy feelings.
>>
>>> The exit-to-usermode code is copied in several places and is written in a nasty
>>> combination of asm and C.  It's not at all clear what it's supposed to do, and
>>> the way it's structured makes it very hard to work with.  For example, it's not
>>> even clear why syscall exit hooks are called only once per syscall right now.
>>> (It seems to be a side effect of the way that rdi and rdx are handled in the asm
>>> loop, and it seems reliable, but it's still pointlessly complicated.)  The
>>> existing code also makes context tracking overly complicated and hard to
>>> understand.  Finally, it's nearly impossible for anyone to change what happens
>>> on exit to usermode, since the existing code is so fragile.
>>
>> Amen.
>>
>>> I tried to clean it up incrementally, but I decided it was too hard. Instead,
>>> this series just replaces the code.  It seems to work.
>>
>> Any known bugs beyond UML build breakage?
>>
>>> Context tracking in particular works very differently now.  The low-level entry
>>> code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL.  The exit
>>> code does the reverse.  There is no need to track what CONTEXT_XYZ state we came
>>> from, because we already know.  Similarly, SCHEDULE_USER is gone, since we can
>>> reschedule if needed by simply calling schedule() from C code.
>>>
>>> The main things that are missing are that I haven't done the 32-bit parts
>>> (anyone want to help?) and therefore I haven't deleted the old C code.  I also
>>> think this may break UML for trivial reasons.
>>>
>>> Because I haven't converted the 32-bit code yet, all of the now-unnecessary
>>> unnecessary calls to exception_enter are still present in traps.c.
>>>
>>> IRQ context tracking is still duplicated.  We should probably clean it up by
>>> changing the core code to supply something like
>>> irq_enter_we_are_already_in_context_kernel.
>>>
>>> Thoughts?
>>
>> So assuming you fix the UML build I'm inclined to go for it, even in this
>> incomplete form, to increase testing coverage.
>
> Andy, can you please share the build breakage you're facing?
> I'll happily help you fixing it.
>

The do_signal declaration in arch/um/include/shared/kern_util.h
conflicts with the one I added to arch/x86/include/asm/signal.h.  The
latter shouldn't really be included in UML, I think, but I didn't see
how to fix it.  Just renaming one of the functions would resolve the
conflict.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17 10:32 ` Ingo Molnar
  2015-06-17 11:14   ` Ingo Molnar
@ 2015-06-17 14:23   ` Andy Lutomirski
  2015-06-18 10:11     ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-17 14:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Wed, Jun 17, 2015 at 3:32 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code.  I also
>> think this may break UML for trivial reasons.
>
> So I'd suggest moving most of the SYSRET fast path to C too.
>
> This is how it looks like now after your patches:
>
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     tracesys
> entry_SYSCALL_64_fastpath:
> #if __SYSCALL_MASK == ~0
>         cmpq    $__NR_syscall_max, %rax
> #else
>         andl    $__SYSCALL_MASK, %eax
>         cmpl    $__NR_syscall_max, %eax
> #endif
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>         call    *sys_call_table(, %rax, 8)
>         movq    %rax, RAX(%rsp)
> 1:
> /*
>  * Syscall return path ending with SYSRET (fast path).
>  * Has incompletely filled pt_regs.
>  */
>         LOCKDEP_SYS_EXIT
>         /*
>          * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
>          * it is too small to ever cause noticeable irq latency.
>          */
>         DISABLE_INTERRUPTS(CLBR_NONE)
>
>         /*
>          * We must check ti flags with interrupts (or at least preemption)
>          * off because we must *never* return to userspace without
>          * processing exit work that is enqueued if we're preempted here.
>          * In particular, returning to userspace with any of the one-shot
>          * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
>          * very bad.
>          */
>         testl   $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     int_ret_from_sys_call_irqs_off  /* Go to the slow path */
>
> Most of that can be done in C.
>
> And I think we could also convert the IRET syscall return slow path to C too:
>
> GLOBAL(int_ret_from_sys_call)
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         RESTORE_EXTRA_REGS
>
>         /*
>          * Try to use SYSRET instead of IRET if we're returning to
>          * a completely clean 64-bit userspace context.
>          */
>         movq    RCX(%rsp), %rcx
>         movq    RIP(%rsp), %r11
>         cmpq    %rcx, %r11                      /* RCX == RIP */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
>          * in kernel space.  This essentially lets the user take over
>          * the kernel, since userspace controls RSP.
>          *
>          * If width of "canonical tail" ever becomes variable, this will need
>          * to be updated to remain correct on both old and new CPUs.
>          */
>         .ifne __VIRTUAL_MASK_SHIFT - 47
>         .error "virtual address width changed -- SYSRET checks need update"
>         .endif
>
>         /* Change top 16 bits to be the sign-extension of 47th bit */
>         shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>         sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>
>         /* If this changed %rcx, it was not canonical */
>         cmpq    %rcx, %r11
>         jne     opportunistic_sysret_failed
>
>         cmpq    $__USER_CS, CS(%rsp)            /* CS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         movq    R11(%rsp), %r11
>         cmpq    %r11, EFLAGS(%rsp)              /* R11 == RFLAGS */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
>          * restoring TF results in a trap from userspace immediately after
>          * SYSRET.  This would cause an infinite loop whenever #DB happens
>          * with register state that satisfies the opportunistic SYSRET
>          * conditions.  For example, single-stepping this user code:
>          *
>          *           movq       $stuck_here, %rcx
>          *           pushfq
>          *           popq %r11
>          *   stuck_here:
>          *
>          * would never get past 'stuck_here'.
>          */
>         testq   $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
>         jnz     opportunistic_sysret_failed
>
>         /* nothing to check for RSP */
>
>         cmpq    $__USER_DS, SS(%rsp)            /* SS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * We win! This label is here just for ease of understanding
>          * perf profiles. Nothing jumps here.
>          */
> syscall_return_via_sysret:
>         /* rcx and r11 are already restored (see code above) */
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
>         USERGS_SYSRET64
>
> opportunistic_sysret_failed:
>         SWAPGS
>         jmp     restore_c_regs_and_iret
> END(entry_SYSCALL_64)
>
>
> Basically there would be a single C function we'd call, which returns a condition
> (or fixes up its return address on the stack directly) to determine between the
> SYSRET and IRET return paths.
>
> Moving this to C too has immediate benefits: that way we could easily add
> instrumentation to see how efficient these various return methods are, etc.
>
> I.e. I don't think there's two ways about this: once the entry code moves to the
> domain of C code, we get the best benefits by moving as much of it as possible.

This is almost certainly true.  There are a lot more cleanups possible here.

I want to nail down the 32-bit case first so we can delete the old code.

>
> The only low level bits remaining in assembly will be low level hardware ABI
> details: saving registers and restoring registers to the expected format - no
> 'active' code whatsoever.

I think this is true for syscalls.  Getting the weird special cases
(IRET and GS fault) for error_entry to work correctly in C could be
tricky.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17  9:48 ` [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Ingo Molnar
  2015-06-17 10:13   ` Richard Weinberger
@ 2015-06-17 15:16   ` Andy Lutomirski
  2015-06-18 10:14     ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-17 15:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Kees Cook, Borislav Petkov, linux-kernel,
	Oleg Nesterov, Denys Vlasenko, Brian Gerst,
	Frédéric Weisbecker, X86 ML, Linus Torvalds,
	Rik van Riel

On Jun 17, 2015 2:49 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>

>
> > I tried to clean it up incrementally, but I decided it was too hard. Instead,
> > this series just replaces the code.  It seems to work.
>
> Any known bugs beyond UML build breakage?

One minor one: the 64-bit and compat syscall asm changes should be
folded together.  I was overly optimistic about bisectability -- the
intermediate step doesn't build.  I haven't tested well enough to be
really comfortable with it, though.

There's another minor error in my description: the 32-bit code is not
a prerequisite for the exception_enter removal, so v2 will remove a
whole bunch of exception_enter calls.  This considerably improves the
quality of the debugging checks.

>
> > Context tracking in particular works very differently now.  The low-level entry
> > code checks that we're in CONTEXT_USER and switches to CONTEXT_KERNEL.  The exit
> > code does the reverse.  There is no need to track what CONTEXT_XYZ state we came
> > from, because we already know.  Similarly, SCHEDULE_USER is gone, since we can
> > reschedule if needed by simply calling schedule() from C code.
> >
> > The main things that are missing are that I haven't done the 32-bit parts
> > (anyone want to help?) and therefore I haven't deleted the old C code.  I also
> > think this may break UML for trivial reasons.
> >
> > Because I haven't converted the 32-bit code yet, all of the now-unnecessary
> > unnecessary calls to exception_enter are still present in traps.c.
> >
> > IRQ context tracking is still duplicated.  We should probably clean it up by
> > changing the core code to supply something like
> > irq_enter_we_are_already_in_context_kernel.
> >
> > Thoughts?
>
> So assuming you fix the UML build I'm inclined to go for it, even in this
> incomplete form, to increase testing coverage.
>
> Doing that will also decrease ongoing merge friction between your work and other
> entry code cleanups ...

Sounds good to me.  I'm not convinced this is 4.2 material, though.
Would it go in a separate branch for now?

On a related note, do you have any idea what work_notifysig_v86 in
entry_32.S is for?  It seems unnecessary and wrong to me.  Unnecessary
because we have return_to_32bit.  Wrong because I don't see how we can
reliably enter vm86 mode if we have exit work enabled -- one of the
giant turds in vm86_32.c literally jumps from C code to
resume_userspace on vm86 entry, and resume_userspace promptly checks
for work and might land in work_notifysig_v86 before we ever make it
to v8086/user mode.

I think it may actually be impossible to use vm86 under ptrace.  ISTR
I had some trouble when trying to strace my test case...

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-17  9:41   ` Ingo Molnar
  2015-06-17 14:15     ` Andy Lutomirski
@ 2015-06-17 15:27     ` Paul E. McKenney
  2015-06-18  9:59       ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2015-06-17 15:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Wed, Jun 17, 2015 at 11:41:14AM +0200, Ingo Molnar wrote:
> 
> * Andy Lutomirski <luto@kernel.org> wrote:
> 
> > This will let us sprinkle sanity checks around the kernel without
> > making too much of a mess.
> > 
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >  include/linux/context_tracking.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 2821838256b4..0fbea4b152e1 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> >  	if (context_tracking_is_enabled())
> >  		__context_tracking_task_switch(prev, next);
> >  }
> > +
> > +static inline void context_tracking_assert_state(enum ctx_state state)
> > +{
> > +	rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > +			   this_cpu_read(context_tracking.state) == state,
> > +			   "context tracking state was wrong");
> > +}
> 
> Please don't introduce assert() style debug check interfaces!
> 
> (And RCU should be fixed too I suspect.)

The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
by analogy to WARN()?  Easy to do if so!  Or am I missing the point?

						Thanx, Paul

> They are absolutely horrible on the brain when mixed with WARN_ON() interfaces, 
> which are the dominant runtime check interface in the kernel.
> 
> Instead make it something like:
> 
>   #define ct_state() (this_cpu_read(context_tracking.state))
> 
>   #define CT_WARN_ON(cond) \
> 	WARN_ON(context_tracking_is_enabled() && (cond))
> 
> and then the debug checks can be written as:
> 
> 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> 
> This is IMHO _far_ more readable than:
> 
> 	context_tracking_assert_state(CONTEXT_KERNEL);
> 
> ok?
> 
> (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
> 
> Thanks,
> 
> 	Ingo
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-17 14:15     ` Andy Lutomirski
@ 2015-06-18  9:57       ` Ingo Molnar
  2015-06-18 11:07         ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18  9:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Paul E. McKenney, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@kernel.org> wrote:
> >
> >> This will let us sprinkle sanity checks around the kernel without
> >> making too much of a mess.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>  include/linux/context_tracking.h | 8 ++++++++
> >>  1 file changed, 8 insertions(+)
> >>
> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> >> index 2821838256b4..0fbea4b152e1 100644
> >> --- a/include/linux/context_tracking.h
> >> +++ b/include/linux/context_tracking.h
> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> >>       if (context_tracking_is_enabled())
> >>               __context_tracking_task_switch(prev, next);
> >>  }
> >> +
> >> +static inline void context_tracking_assert_state(enum ctx_state state)
> >> +{
> >> +     rcu_lockdep_assert(!context_tracking_is_enabled() ||
> >> +                        this_cpu_read(context_tracking.state) == state,
> >> +                        "context tracking state was wrong");
> >> +}
> >
> > Please don't introduce assert() style debug check interfaces!
> >
> > (And RCU should be fixed too I suspect.)
> >
> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> > which are the dominant runtime check interface in the kernel.
> >
> > Instead make it something like:
> >
> >   #define ct_state() (this_cpu_read(context_tracking.state))
> >
> >   #define CT_WARN_ON(cond) \
> >         WARN_ON(context_tracking_is_enabled() && (cond))
> >
> > and then the debug checks can be written as:
> >
> >         CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> >
> > This is IMHO _far_ more readable than:
> >
> >         context_tracking_assert_state(CONTEXT_KERNEL);
> >
> > ok?
> >
> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
> 
> Hmm, ok I guess.  The part I don't like is having ct_state() at all on
> non-context-tracking kernels -- it seems like it's asking for trouble.

Well:

 - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.

 - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then 
   CT_WARN_ON() will evaluate 'cond', but won't calculate it.

 - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we 
   get as far as ct_state() evaluation.

so I'm not sure I see the problem you are seeing.

> We could make CT_WARN_ON not even evaluate its argument if 
> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if 
> !context_tracking_is_enabled().

My understanding is that if !context_tracking_is_enabled() then the compiler 
should not even try to evaluate the rest. This is why doing a NULL pointer check 
like this is safe:

  if (tsk && tsk->field) {
	...
  }

> The assert macro avoids all these problems despite being a bit ugly.

but writing good kernel code is all about not being ugly...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-17 15:27     ` Paul E. McKenney
@ 2015-06-18  9:59       ` Ingo Molnar
  2015-06-18 22:54         ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18  9:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Wed, Jun 17, 2015 at 11:41:14AM +0200, Ingo Molnar wrote:
> > 
> > * Andy Lutomirski <luto@kernel.org> wrote:
> > 
> > > This will let us sprinkle sanity checks around the kernel without
> > > making too much of a mess.
> > > 
> > > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > > ---
> > >  include/linux/context_tracking.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > > index 2821838256b4..0fbea4b152e1 100644
> > > --- a/include/linux/context_tracking.h
> > > +++ b/include/linux/context_tracking.h
> > > @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> > >  	if (context_tracking_is_enabled())
> > >  		__context_tracking_task_switch(prev, next);
> > >  }
> > > +
> > > +static inline void context_tracking_assert_state(enum ctx_state state)
> > > +{
> > > +	rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > > +			   this_cpu_read(context_tracking.state) == state,
> > > +			   "context tracking state was wrong");
> > > +}
> > 
> > Please don't introduce assert() style debug check interfaces!
> > 
> > (And RCU should be fixed too I suspect.)
> 
> The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
> by analogy to WARN()?  Easy to do if so!  Or am I missing the point?

Yeah, and inverting the condition. Assuming the condition was assert()-style 
inverted to begin with! :-)

and lockdep should be fixed too I suspect, lockdep_assert_held() was really a 
poorly chosen name I suspect, it should be 'lockdep_check_held()' or so? It has 
very little to do with the assert() interface.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17 14:23   ` Andy Lutomirski
@ 2015-06-18 10:11     ` Ingo Molnar
  2015-06-18 11:06       ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18 10:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst


* Andy Lutomirski <luto@amacapital.net> wrote:

> > The only low level bits remaining in assembly will be low level hardware ABI 
> > details: saving registers and restoring registers to the expected format - no 
> > 'active' code whatsoever.
> 
> I think this is true for syscalls.  Getting the weird special cases (IRET and GS 
> fault) for error_entry to work correctly in C could be tricky.

Correct, and I double checked the IRET fault path yesterday (fixup_bad_iret), and 
it looks like a straightforward exception handler with limited control flow. It 
can stay in asm just fine, it seems mostly orthogonal to the rest.

I didn't check the GS fault path, but that only affects 32-bit, as we use SWAPGS 
on 64-bit, right? In any case, that code too (32-bit RESTORE_REGS) belongs into 
the natural 'hardware ABI preparation code' that should stay in assembly. (Unless 
I missed some other code that might cause trouble.)

The most deadly complexity in our asm code are IMHO the intertwined threads of 
control flow - all of that should go into C, where it's much easier to see what's 
going on, because we have named variables, established code patterns and a 
compiler checking for common mistakes and such.

The other big area of complexity are our partial save/restore tricks, which makes 
tracking of what is saved (and what is not) tricky and fragile.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-17 15:16   ` Andy Lutomirski
@ 2015-06-18 10:14     ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18 10:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Kees Cook, Borislav Petkov, linux-kernel,
	Oleg Nesterov, Denys Vlasenko, Brian Gerst,
	Frédéric Weisbecker, X86 ML, Linus Torvalds,
	Rik van Riel


* Andy Lutomirski <luto@amacapital.net> wrote:

> > So assuming you fix the UML build I'm inclined to go for it, even in this 
> > incomplete form, to increase testing coverage.
> >
> > Doing that will also decrease ongoing merge friction between your work and 
> > other entry code cleanups ...
> 
> Sounds good to me.  I'm not convinced this is 4.2 material, though. Would it go 
> in a separate branch for now?

Please send it out as a separate series, on top of -tip, I'll probably stick it 
into a separate branch for the time being.

> On a related note, do you have any idea what work_notifysig_v86 in entry_32.S is 
> for?  It seems unnecessary and wrong to me.  Unnecessary because we have 
> return_to_32bit.  Wrong because I don't see how we can reliably enter vm86 mode 
> if we have exit work enabled -- one of the giant turds in vm86_32.c literally 
> jumps from C code to resume_userspace on vm86 entry, and resume_userspace 
> promptly checks for work and might land in work_notifysig_v86 before we ever 
> make it to v8086/user mode.
> 
> I think it may actually be impossible to use vm86 under ptrace.  ISTR I had some 
> trouble when trying to strace my test case...

Should be tested really, I'm all for removing it if simple vm86 mode games 
continue to work ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks
  2015-06-17 14:12       ` Andy Lutomirski
@ 2015-06-18 10:17         ` Ingo Molnar
  2015-06-18 10:19           ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18 10:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> >> Any reason why irq state tracking cannot be done in C as well, like the rest 
> >> of the irq state tracking code?
> >
> > Never mind, I see you've done exactly that in patch #12.
> 
> There are still some TRACE_IRQS_ON, LOCKDEP_SYS_EXIT, and such scattered 
> throughout the asm.  it's plausible that even more of that could be moved to C.
> 
> We could also benchmark and find out how bad it would be if we just always 
> filled pt_regs in completely in syscalls.  If the performance hit isn't enough 
> to matter, then we could potentially move the entire syscall path except pt_regs 
> setup and sysret/iret into three C functions.

The thing is, I'd not be against simplifying pt_regs handling even if it slows 
down things a tiny bit. If anyone wants to reintroduce that complexity we'll see 
how it looks like in isolation, done cleanly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks
  2015-06-18 10:17         ` Ingo Molnar
@ 2015-06-18 10:19           ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18 10:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst,
	Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > >> Any reason why irq state tracking cannot be done in C as well, like the 
> > >> rest of the irq state tracking code?
> > >
> > > Never mind, I see you've done exactly that in patch #12.
> > 
> > There are still some TRACE_IRQS_ON, LOCKDEP_SYS_EXIT, and such scattered 
> > throughout the asm.  it's plausible that even more of that could be moved to 
> > C.
> > 
> > We could also benchmark and find out how bad it would be if we just always 
> > filled pt_regs in completely in syscalls.  If the performance hit isn't enough 
> > to matter, then we could potentially move the entire syscall path except 
> > pt_regs setup and sysret/iret into three C functions.
> 
> The thing is, I'd not be against simplifying pt_regs handling even if it slows 
> down things a tiny bit. If anyone wants to reintroduce that complexity we'll see 
> how it looks like in isolation, done cleanly.

... and I suspect the reduction of entry points will allow the compiler to do a 
better job - so some of the overhead might be won back.

So I'd say we try this approach and complicate it back in the future only if the 
numbers warrant it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-18 10:11     ` Ingo Molnar
@ 2015-06-18 11:06       ` Andy Lutomirski
  2015-06-18 16:24         ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-18 11:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Thu, Jun 18, 2015 at 3:11 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> > The only low level bits remaining in assembly will be low level hardware ABI
>> > details: saving registers and restoring registers to the expected format - no
>> > 'active' code whatsoever.
>>
>> I think this is true for syscalls.  Getting the weird special cases (IRET and GS
>> fault) for error_entry to work correctly in C could be tricky.
>
> Correct, and I double checked the IRET fault path yesterday (fixup_bad_iret), and
> it looks like a straightforward exception handler with limited control flow. It
> can stay in asm just fine, it seems mostly orthogonal to the rest.
>
> I didn't check the GS fault path, but that only affects 32-bit, as we use SWAPGS
> on 64-bit, right? In any case, that code too (32-bit RESTORE_REGS) belongs into
> the natural 'hardware ABI preparation code' that should stay in assembly. (Unless
> I missed some other code that might cause trouble.)

Look for "gs_change".  To change the gs selector, we do swapgs, then
load gs, then swapgs again.  If the gs load fails, then we trigger a
special fixup.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18  9:57       ` Ingo Molnar
@ 2015-06-18 11:07         ` Andy Lutomirski
  2015-06-18 15:52           ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-18 11:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Paul E. McKenney, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > * Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> >> This will let us sprinkle sanity checks around the kernel without
>> >> making too much of a mess.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> >> ---
>> >>  include/linux/context_tracking.h | 8 ++++++++
>> >>  1 file changed, 8 insertions(+)
>> >>
>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>> >> index 2821838256b4..0fbea4b152e1 100644
>> >> --- a/include/linux/context_tracking.h
>> >> +++ b/include/linux/context_tracking.h
>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>> >>       if (context_tracking_is_enabled())
>> >>               __context_tracking_task_switch(prev, next);
>> >>  }
>> >> +
>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
>> >> +{
>> >> +     rcu_lockdep_assert(!context_tracking_is_enabled() ||
>> >> +                        this_cpu_read(context_tracking.state) == state,
>> >> +                        "context tracking state was wrong");
>> >> +}
>> >
>> > Please don't introduce assert() style debug check interfaces!
>> >
>> > (And RCU should be fixed too I suspect.)
>> >
>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
>> > which are the dominant runtime check interface in the kernel.
>> >
>> > Instead make it something like:
>> >
>> >   #define ct_state() (this_cpu_read(context_tracking.state))
>> >
>> >   #define CT_WARN_ON(cond) \
>> >         WARN_ON(context_tracking_is_enabled() && (cond))
>> >
>> > and then the debug checks can be written as:
>> >
>> >         CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>> >
>> > This is IMHO _far_ more readable than:
>> >
>> >         context_tracking_assert_state(CONTEXT_KERNEL);
>> >
>> > ok?
>> >
>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>>
>> Hmm, ok I guess.  The part I don't like is having ct_state() at all on
>> non-context-tracking kernels -- it seems like it's asking for trouble.
>
> Well:
>
>  - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
>
>  - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
>    CT_WARN_ON() will evaluate 'cond', but won't calculate it.
>
>  - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
>    get as far as ct_state() evaluation.
>
> so I'm not sure I see the problem you are seeing.
>
>> We could make CT_WARN_ON not even evaluate its argument if
>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
>> !context_tracking_is_enabled().
>
> My understanding is that if !context_tracking_is_enabled() then the compiler
> should not even try to evaluate the rest. This is why doing a NULL pointer check
> like this is safe:

I'm fine with everything you just covered.  My only objection is that,
if ct_state() exists, then someone might call it outside CT_WARN_ON,
in which case it will break on non-context-tracking setups.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18 11:07         ` Andy Lutomirski
@ 2015-06-18 15:52           ` Andy Lutomirski
  2015-06-18 16:17             ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-18 15:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Paul E. McKenney, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> >
>>> > * Andy Lutomirski <luto@kernel.org> wrote:
>>> >
>>> >> This will let us sprinkle sanity checks around the kernel without
>>> >> making too much of a mess.
>>> >>
>>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>> >> ---
>>> >>  include/linux/context_tracking.h | 8 ++++++++
>>> >>  1 file changed, 8 insertions(+)
>>> >>
>>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>>> >> index 2821838256b4..0fbea4b152e1 100644
>>> >> --- a/include/linux/context_tracking.h
>>> >> +++ b/include/linux/context_tracking.h
>>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>>> >>       if (context_tracking_is_enabled())
>>> >>               __context_tracking_task_switch(prev, next);
>>> >>  }
>>> >> +
>>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
>>> >> +{
>>> >> +     rcu_lockdep_assert(!context_tracking_is_enabled() ||
>>> >> +                        this_cpu_read(context_tracking.state) == state,
>>> >> +                        "context tracking state was wrong");
>>> >> +}
>>> >
>>> > Please don't introduce assert() style debug check interfaces!
>>> >
>>> > (And RCU should be fixed too I suspect.)
>>> >
>>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
>>> > which are the dominant runtime check interface in the kernel.
>>> >
>>> > Instead make it something like:
>>> >
>>> >   #define ct_state() (this_cpu_read(context_tracking.state))
>>> >
>>> >   #define CT_WARN_ON(cond) \
>>> >         WARN_ON(context_tracking_is_enabled() && (cond))
>>> >
>>> > and then the debug checks can be written as:
>>> >
>>> >         CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>>> >
>>> > This is IMHO _far_ more readable than:
>>> >
>>> >         context_tracking_assert_state(CONTEXT_KERNEL);
>>> >
>>> > ok?
>>> >
>>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>>>
>>> Hmm, ok I guess.  The part I don't like is having ct_state() at all on
>>> non-context-tracking kernels -- it seems like it's asking for trouble.
>>
>> Well:
>>
>>  - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
>>
>>  - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
>>    CT_WARN_ON() will evaluate 'cond', but won't calculate it.
>>
>>  - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
>>    get as far as ct_state() evaluation.
>>
>> so I'm not sure I see the problem you are seeing.
>>
>>> We could make CT_WARN_ON not even evaluate its argument if
>>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
>>> !context_tracking_is_enabled().
>>
>> My understanding is that if !context_tracking_is_enabled() then the compiler
>> should not even try to evaluate the rest. This is why doing a NULL pointer check
>> like this is safe:
>
> I'm fine with everything you just covered.  My only objection is that,
> if ct_state() exists, then someone might call it outside CT_WARN_ON,
> in which case it will break on non-context-tracking setups.

The more I think about it, the more I dislike ct_state().  We have
in_atomic(), which is already problematic because the return value
isn't reliable.  ct_state(), if callable on non context-tracking
kernels, will also be unreliable.  I prefer things like
lockdep_assert_held because they can't be misused.

It would be far too easy for someone to read:

CT_WARN_ON(ct_state() != CONTEXT_KERNEL);

and add:

if (ct_state() == CONTEXT_KERNEL)
  do_something();

and that would be bad.

--Andy

>
> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18 15:52           ` Andy Lutomirski
@ 2015-06-18 16:17             ` Ingo Molnar
  2015-06-18 16:26               ` Frederic Weisbecker
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18 16:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Paul E. McKenney, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >>
> >> * Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >>> >
> >>> > * Andy Lutomirski <luto@kernel.org> wrote:
> >>> >
> >>> >> This will let us sprinkle sanity checks around the kernel without
> >>> >> making too much of a mess.
> >>> >>
> >>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >>> >> ---
> >>> >>  include/linux/context_tracking.h | 8 ++++++++
> >>> >>  1 file changed, 8 insertions(+)
> >>> >>
> >>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> >>> >> index 2821838256b4..0fbea4b152e1 100644
> >>> >> --- a/include/linux/context_tracking.h
> >>> >> +++ b/include/linux/context_tracking.h
> >>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> >>> >>       if (context_tracking_is_enabled())
> >>> >>               __context_tracking_task_switch(prev, next);
> >>> >>  }
> >>> >> +
> >>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
> >>> >> +{
> >>> >> +     rcu_lockdep_assert(!context_tracking_is_enabled() ||
> >>> >> +                        this_cpu_read(context_tracking.state) == state,
> >>> >> +                        "context tracking state was wrong");
> >>> >> +}
> >>> >
> >>> > Please don't introduce assert() style debug check interfaces!
> >>> >
> >>> > (And RCU should be fixed too I suspect.)
> >>> >
> >>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> >>> > which are the dominant runtime check interface in the kernel.
> >>> >
> >>> > Instead make it something like:
> >>> >
> >>> >   #define ct_state() (this_cpu_read(context_tracking.state))
> >>> >
> >>> >   #define CT_WARN_ON(cond) \
> >>> >         WARN_ON(context_tracking_is_enabled() && (cond))
> >>> >
> >>> > and then the debug checks can be written as:
> >>> >
> >>> >         CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> >>> >
> >>> > This is IMHO _far_ more readable than:
> >>> >
> >>> >         context_tracking_assert_state(CONTEXT_KERNEL);
> >>> >
> >>> > ok?
> >>> >
> >>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
> >>>
> >>> Hmm, ok I guess.  The part I don't like is having ct_state() at all on
> >>> non-context-tracking kernels -- it seems like it's asking for trouble.
> >>
> >> Well:
> >>
> >>  - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
> >>
> >>  - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
> >>    CT_WARN_ON() will evaluate 'cond', but won't calculate it.
> >>
> >>  - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
> >>    get as far as ct_state() evaluation.
> >>
> >> so I'm not sure I see the problem you are seeing.
> >>
> >>> We could make CT_WARN_ON not even evaluate its argument if
> >>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
> >>> !context_tracking_is_enabled().
> >>
> >> My understanding is that if !context_tracking_is_enabled() then the compiler
> >> should not even try to evaluate the rest. This is why doing a NULL pointer check
> >> like this is safe:
> >
> > I'm fine with everything you just covered.  My only objection is that,
> > if ct_state() exists, then someone might call it outside CT_WARN_ON,
> > in which case it will break on non-context-tracking setups.
> 
> The more I think about it, the more I dislike ct_state().  We have
> in_atomic(), which is already problematic because the return value
> isn't reliable.  ct_state(), if callable on non context-tracking
> kernels, will also be unreliable.  I prefer things like
> lockdep_assert_held because they can't be misused.
> 
> It would be far too easy for someone to read:
> 
> CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> 
> and add:
> 
> if (ct_state() == CONTEXT_KERNEL)
>   do_something();
> 
> and that would be bad.

But ct_state() could be made reliable: if !context_tracking_is_enabled() then it 
should return -1 or so.

I.e. we could make it something like:

        enum ctx_state {
                CONTEXT_DISABLED	= -1,
                CONTEXT_KERNEL		=  0,
                CONTEXT_USER		=  1,
                CONTEXT_GUEST		=  2,
        } state;

static inline enum ctx_state ct_state(void)
{
	if (context_tracking_is_enabled())
		return this_cpu_read(context_tracking.state))

	return CONTEXT_DISABLED;
}

and then CT_WARN_ON() DTRT.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
  2015-06-18 11:06       ` Andy Lutomirski
@ 2015-06-18 16:24         ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-06-18 16:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, Jun 18, 2015 at 3:11 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >> > The only low level bits remaining in assembly will be low level hardware ABI
> >> > details: saving registers and restoring registers to the expected format - no
> >> > 'active' code whatsoever.
> >>
> >> I think this is true for syscalls.  Getting the weird special cases (IRET and 
> >> GS fault) for error_entry to work correctly in C could be tricky.
> >
> > Correct, and I double checked the IRET fault path yesterday (fixup_bad_iret), 
> > and it looks like a straightforward exception handler with limited control 
> > flow. It can stay in asm just fine, it seems mostly orthogonal to the rest.
> >
> > I didn't check the GS fault path, but that only affects 32-bit, as we use 
> > SWAPGS on 64-bit, right? In any case, that code too (32-bit RESTORE_REGS) 
> > belongs into the natural 'hardware ABI preparation code' that should stay in 
> > assembly. (Unless I missed some other code that might cause trouble.)
> 
> Look for "gs_change".  To change the gs selector, we do swapgs, then load gs, 
> then swapgs again.  If the gs load fails, then we trigger a special fixup.

Yes, but I don't see the connection to moving the syscall (and IRQ) entry code to 
.c: native_load_gs_index() is a separate API we call from regular kernel code, and 
it has a regular exception fixup entry plus a trap handler special case.

It's fine in entry_64.S, but it would be equally fine in inline asm() as well.

I think it's fine in entry_64.S as long as the error trap code (which refers to 
the change_gs RIP) lives there. But it could live in .c as well, as we can 
generate global symbols within inline asm() too.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18 16:17             ` Ingo Molnar
@ 2015-06-18 16:26               ` Frederic Weisbecker
  2015-06-18 19:26                 ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2015-06-18 16:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Andy Lutomirski, Paul E. McKenney, X86 ML,
	linux-kernel, Rik van Riel, Oleg Nesterov, Denys Vlasenko,
	Borislav Petkov, Kees Cook, Brian Gerst

On Thu, Jun 18, 2015 at 06:17:29PM +0200, Ingo Molnar wrote:
> 
> * Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > > On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >>
> > >> * Andy Lutomirski <luto@amacapital.net> wrote:
> > >>
> > >>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >>> >
> > >>> > * Andy Lutomirski <luto@kernel.org> wrote:
> > >>> >
> > >>> >> This will let us sprinkle sanity checks around the kernel without
> > >>> >> making too much of a mess.
> > >>> >>
> > >>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > >>> >> ---
> > >>> >>  include/linux/context_tracking.h | 8 ++++++++
> > >>> >>  1 file changed, 8 insertions(+)
> > >>> >>
> > >>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > >>> >> index 2821838256b4..0fbea4b152e1 100644
> > >>> >> --- a/include/linux/context_tracking.h
> > >>> >> +++ b/include/linux/context_tracking.h
> > >>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> > >>> >>       if (context_tracking_is_enabled())
> > >>> >>               __context_tracking_task_switch(prev, next);
> > >>> >>  }
> > >>> >> +
> > >>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
> > >>> >> +{
> > >>> >> +     rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > >>> >> +                        this_cpu_read(context_tracking.state) == state,
> > >>> >> +                        "context tracking state was wrong");
> > >>> >> +}
> > >>> >
> > >>> > Please don't introduce assert() style debug check interfaces!
> > >>> >
> > >>> > (And RCU should be fixed too I suspect.)
> > >>> >
> > >>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
> > >>> > which are the dominant runtime check interface in the kernel.
> > >>> >
> > >>> > Instead make it something like:
> > >>> >
> > >>> >   #define ct_state() (this_cpu_read(context_tracking.state))
> > >>> >
> > >>> >   #define CT_WARN_ON(cond) \
> > >>> >         WARN_ON(context_tracking_is_enabled() && (cond))
> > >>> >
> > >>> > and then the debug checks can be written as:
> > >>> >
> > >>> >         CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> > >>> >
> > >>> > This is IMHO _far_ more readable than:
> > >>> >
> > >>> >         context_tracking_assert_state(CONTEXT_KERNEL);
> > >>> >
> > >>> > ok?
> > >>> >
> > >>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
> > >>>
> > >>> Hmm, ok I guess.  The part I don't like is having ct_state() at all on
> > >>> non-context-tracking kernels -- it seems like it's asking for trouble.
> > >>
> > >> Well:
> > >>
> > >>  - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
> > >>
> > >>  - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
> > >>    CT_WARN_ON() will evaluate 'cond', but won't calculate it.
> > >>
> > >>  - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
> > >>    get as far as ct_state() evaluation.
> > >>
> > >> so I'm not sure I see the problem you are seeing.
> > >>
> > >>> We could make CT_WARN_ON not even evaluate its argument if
> > >>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
> > >>> !context_tracking_is_enabled().
> > >>
> > >> My understanding is that if !context_tracking_is_enabled() then the compiler
> > >> should not even try to evaluate the rest. This is why doing a NULL pointer check
> > >> like this is safe:
> > >
> > > I'm fine with everything you just covered.  My only objection is that,
> > > if ct_state() exists, then someone might call it outside CT_WARN_ON,
> > > in which case it will break on non-context-tracking setups.
> > 
> > The more I think about it, the more I dislike ct_state().  We have
> > in_atomic(), which is already problematic because the return value
> > isn't reliable.  ct_state(), if callable on non context-tracking
> > kernels, will also be unreliable.  I prefer things like
> > lockdep_assert_held because they can't be misused.
> > 
> > It would be far too easy for someone to read:
> > 
> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> > 
> > and add:
> > 
> > if (ct_state() == CONTEXT_KERNEL)
> >   do_something();
> > 
> > and that would be bad.
> 
> But ct_state() could be made reliable: if !context_tracking_is_enabled() then it 
> should return -1 or so.
> 
> I.e. we could make it something like:
> 
>         enum ctx_state {
>                 CONTEXT_DISABLED	= -1,
>                 CONTEXT_KERNEL		=  0,
>                 CONTEXT_USER		=  1,
>                 CONTEXT_GUEST		=  2,
>         } state;
> 
> static inline enum ctx_state ct_state(void)
> {
> 	if (context_tracking_is_enabled())
> 		return this_cpu_read(context_tracking.state))
> 
> 	return CONTEXT_DISABLED;
> }
> 
> and then CT_WARN_ON() DTRT.

That sounds good. With good layout of these things, the compiler should still be
able to nop related code when context tracking is disabled.

> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18 16:26               ` Frederic Weisbecker
@ 2015-06-18 19:26                 ` Andy Lutomirski
  0 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-06-18 19:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Andy Lutomirski, Paul E. McKenney, X86 ML,
	linux-kernel, Rik van Riel, Oleg Nesterov, Denys Vlasenko,
	Borislav Petkov, Kees Cook, Brian Gerst

On Thu, Jun 18, 2015 at 9:26 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Thu, Jun 18, 2015 at 06:17:29PM +0200, Ingo Molnar wrote:
>>
>> * Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> > On Thu, Jun 18, 2015 at 4:07 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > > On Thu, Jun 18, 2015 at 2:57 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> > >>
>> > >> * Andy Lutomirski <luto@amacapital.net> wrote:
>> > >>
>> > >>> On Wed, Jun 17, 2015 at 2:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> > >>> >
>> > >>> > * Andy Lutomirski <luto@kernel.org> wrote:
>> > >>> >
>> > >>> >> This will let us sprinkle sanity checks around the kernel without
>> > >>> >> making too much of a mess.
>> > >>> >>
>> > >>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> > >>> >> ---
>> > >>> >>  include/linux/context_tracking.h | 8 ++++++++
>> > >>> >>  1 file changed, 8 insertions(+)
>> > >>> >>
>> > >>> >> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
>> > >>> >> index 2821838256b4..0fbea4b152e1 100644
>> > >>> >> --- a/include/linux/context_tracking.h
>> > >>> >> +++ b/include/linux/context_tracking.h
>> > >>> >> @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
>> > >>> >>       if (context_tracking_is_enabled())
>> > >>> >>               __context_tracking_task_switch(prev, next);
>> > >>> >>  }
>> > >>> >> +
>> > >>> >> +static inline void context_tracking_assert_state(enum ctx_state state)
>> > >>> >> +{
>> > >>> >> +     rcu_lockdep_assert(!context_tracking_is_enabled() ||
>> > >>> >> +                        this_cpu_read(context_tracking.state) == state,
>> > >>> >> +                        "context tracking state was wrong");
>> > >>> >> +}
>> > >>> >
>> > >>> > Please don't introduce assert() style debug check interfaces!
>> > >>> >
>> > >>> > (And RCU should be fixed too I suspect.)
>> > >>> >
>> > >>> > They are absolutely horrible on the brain when mixed with WARN_ON() interfaces,
>> > >>> > which are the dominant runtime check interface in the kernel.
>> > >>> >
>> > >>> > Instead make it something like:
>> > >>> >
>> > >>> >   #define ct_state() (this_cpu_read(context_tracking.state))
>> > >>> >
>> > >>> >   #define CT_WARN_ON(cond) \
>> > >>> >         WARN_ON(context_tracking_is_enabled() && (cond))
>> > >>> >
>> > >>> > and then the debug checks can be written as:
>> > >>> >
>> > >>> >         CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>> > >>> >
>> > >>> > This is IMHO _far_ more readable than:
>> > >>> >
>> > >>> >         context_tracking_assert_state(CONTEXT_KERNEL);
>> > >>> >
>> > >>> > ok?
>> > >>> >
>> > >>> > (Assuming people will accept 'ct/CT' as an abbreviation for context tracking.)
>> > >>>
>> > >>> Hmm, ok I guess.  The part I don't like is having ct_state() at all on
>> > >>> non-context-tracking kernels -- it seems like it's asking for trouble.
>> > >>
>> > >> Well:
>> > >>
>> > >>  - if # CONFIG_CONTEXT_TRACKING is not se, then CT_WARN_ON() does nothing.
>> > >>
>> > >>  - if CONFIG_CONTEXT_TRACKING=y, but !context_tracking_is_enabled(), then
>> > >>    CT_WARN_ON() will evaluate 'cond', but won't calculate it.
>> > >>
>> > >>  - only if CONFIG_CONTEXT_TRACKING=y && context_tracking_is_enabled() should we
>> > >>    get as far as ct_state() evaluation.
>> > >>
>> > >> so I'm not sure I see the problem you are seeing.
>> > >>
>> > >>> We could make CT_WARN_ON not even evaluate its argument if
>> > >>> !CONFIG_CONTEXT_TRACKING, but then we still have ct_state() returning garbage if
>> > >>> !context_tracking_is_enabled().
>> > >>
>> > >> My understanding is that if !context_tracking_is_enabled() then the compiler
>> > >> should not even try to evaluate the rest. This is why doing a NULL pointer check
>> > >> like this is safe:
>> > >
>> > > I'm fine with everything you just covered.  My only objection is that,
>> > > if ct_state() exists, then someone might call it outside CT_WARN_ON,
>> > > in which case it will break on non-context-tracking setups.
>> >
>> > The more I think about it, the more I dislike ct_state().  We have
>> > in_atomic(), which is already problematic because the return value
>> > isn't reliable.  ct_state(), if callable on non context-tracking
>> > kernels, will also be unreliable.  I prefer things like
>> > lockdep_assert_held because they can't be misused.
>> >
>> > It would be far too easy for someone to read:
>> >
>> > CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>> >
>> > and add:
>> >
>> > if (ct_state() == CONTEXT_KERNEL)
>> >   do_something();
>> >
>> > and that would be bad.
>>
>> But ct_state() could be made reliable: if !context_tracking_is_enabled() then it
>> should return -1 or so.
>>
>> I.e. we could make it something like:
>>
>>         enum ctx_state {
>>                 CONTEXT_DISABLED      = -1,
>>                 CONTEXT_KERNEL                =  0,
>>                 CONTEXT_USER          =  1,
>>                 CONTEXT_GUEST         =  2,
>>         } state;
>>
>> static inline enum ctx_state ct_state(void)
>> {
>>       if (context_tracking_is_enabled())
>>               return this_cpu_read(context_tracking.state))
>>
>>       return CONTEXT_DISABLED;
>> }
>>
>> and then CT_WARN_ON() DTRT.
>
> That sounds good. With good layout of these things, the compiler should still be
> able to nop related code when context tracking is disabled.

Done.

--Andy

>
>>
>> Thanks,
>>
>>       Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18  9:59       ` Ingo Molnar
@ 2015-06-18 22:54         ` Paul E. McKenney
  2015-06-19  2:19           ` Paul E. McKenney
  2015-06-30 11:04           ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Paul E. McKenney @ 2015-06-18 22:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Thu, Jun 18, 2015 at 11:59:55AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 11:41:14AM +0200, Ingo Molnar wrote:
> > > 
> > > * Andy Lutomirski <luto@kernel.org> wrote:
> > > 
> > > > This will let us sprinkle sanity checks around the kernel without
> > > > making too much of a mess.
> > > > 
> > > > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > > > ---
> > > >  include/linux/context_tracking.h | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > > 
> > > > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > > > index 2821838256b4..0fbea4b152e1 100644
> > > > --- a/include/linux/context_tracking.h
> > > > +++ b/include/linux/context_tracking.h
> > > > @@ -57,6 +57,13 @@ static inline void context_tracking_task_switch(struct task_struct *prev,
> > > >  	if (context_tracking_is_enabled())
> > > >  		__context_tracking_task_switch(prev, next);
> > > >  }
> > > > +
> > > > +static inline void context_tracking_assert_state(enum ctx_state state)
> > > > +{
> > > > +	rcu_lockdep_assert(!context_tracking_is_enabled() ||
> > > > +			   this_cpu_read(context_tracking.state) == state,
> > > > +			   "context tracking state was wrong");
> > > > +}
> > > 
> > > Please don't introduce assert() style debug check interfaces!
> > > 
> > > (And RCU should be fixed too I suspect.)
> > 
> > The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
> > by analogy to WARN()?  Easy to do if so!  Or am I missing the point?
> 
> Yeah, and inverting the condition. Assuming the condition was assert()-style 
> inverted to begin with! :-)

It appears to have been.  ;-)

Please see below for an untested patch.  I made this be one big patch,
but could have one patch add RCU_LOCKDEP_WARN(), a series convert uses
from rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch
remove rcu_lockdep_assert().  Let me know!

> and lockdep should be fixed too I suspect, lockdep_assert_held() was really a 
> poorly chosen name I suspect, it should be 'lockdep_check_held()' or so? It has 
> very little to do with the assert() interface.

------------------------------------------------------------------------

commit a3053da26e914ab9d7fe25b03d35bbe5a2a718c0
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Thu Jun 18 15:50:02 2015 -0700

    rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
    
    This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for
    consistency with the WARN() series of macros.  This also requires
    inverting the sense of the conditional, which this commit also does.
    
    Reported-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 324ab5247687..37f827cb78ed 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -136,7 +136,7 @@ enum ctx_state ist_enter(struct pt_regs *regs)
 	preempt_count_add(HARDIRQ_OFFSET);
 
 	/* This code is a bit fragile.  Test it. */
-	rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");
 
 	return prev_state;
 }
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index 677fb2843553..3b188f20b43f 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -110,8 +110,8 @@ static DEFINE_MUTEX(dev_opp_list_lock);
 
 #define opp_rcu_lockdep_assert()					\
 do {									\
-	rcu_lockdep_assert(rcu_read_lock_held() ||			\
-				lockdep_is_held(&dev_opp_list_lock),	\
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\
+				!lockdep_is_held(&dev_opp_list_lock),	\
 			   "Missing rcu_read_lock() or "		\
 			   "dev_opp_list_lock protection");		\
 } while (0)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 230f87bdf5ad..910c75f6cd0b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,8 +83,8 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i
 
 static inline struct file *fcheck_files(struct files_struct *files, unsigned int fd)
 {
-	rcu_lockdep_assert(rcu_read_lock_held() ||
-			   lockdep_is_held(&files->file_lock),
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+			   !lockdep_is_held(&files->file_lock),
 			   "suspicious rcu_dereference_check() usage");
 	return __fcheck_files(files, fd);
 }
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2f4a93a8ceab..e9e04c32cb4c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -537,14 +537,14 @@ static inline int rcu_read_lock_sched_held(void)
 #ifdef CONFIG_PROVE_RCU
 
 /**
- * rcu_lockdep_assert - emit lockdep splat if specified condition not met
+ * RCU_LOCKDEP_WARN - emit lockdep splat if specified condition is met
  * @c: condition to check
  * @s: informative message
  */
-#define rcu_lockdep_assert(c, s)					\
+#define RCU_LOCKDEP_WARN(c, s)						\
 	do {								\
 		static bool __section(.data.unlikely) __warned;		\
-		if (debug_lockdep_rcu_enabled() && !__warned && !(c)) {	\
+		if (debug_lockdep_rcu_enabled() && !__warned && (c)) {	\
 			__warned = true;				\
 			lockdep_rcu_suspicious(__FILE__, __LINE__, s);	\
 		}							\
@@ -553,8 +553,8 @@ static inline int rcu_read_lock_sched_held(void)
 #if defined(CONFIG_PROVE_RCU) && !defined(CONFIG_PREEMPT_RCU)
 static inline void rcu_preempt_sleep_check(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
-			   "Illegal context switch in RCU read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+			 "Illegal context switch in RCU read-side critical section");
 }
 #else /* #ifdef CONFIG_PROVE_RCU */
 static inline void rcu_preempt_sleep_check(void)
@@ -565,15 +565,15 @@ static inline void rcu_preempt_sleep_check(void)
 #define rcu_sleep_check()						\
 	do {								\
 		rcu_preempt_sleep_check();				\
-		rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),	\
-				   "Illegal context switch in RCU-bh read-side critical section"); \
-		rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),	\
-				   "Illegal context switch in RCU-sched read-side critical section"); \
+		RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map),	\
+				 "Illegal context switch in RCU-bh read-side critical section"); \
+		RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map),	\
+				 "Illegal context switch in RCU-sched read-side critical section"); \
 	} while (0)
 
 #else /* #ifdef CONFIG_PROVE_RCU */
 
-#define rcu_lockdep_assert(c, s) do { } while (0)
+#define RCU_LOCKDEP_WARN(c, s) do { } while (0)
 #define rcu_sleep_check() do { } while (0)
 
 #endif /* #else #ifdef CONFIG_PROVE_RCU */
@@ -604,13 +604,13 @@ static inline void rcu_preempt_sleep_check(void)
 ({ \
 	/* Dependency order vs. p above. */ \
 	typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
-	rcu_lockdep_assert(c, "suspicious rcu_dereference_check() usage"); \
+	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
 	rcu_dereference_sparse(p, space); \
 	((typeof(*p) __force __kernel *)(________p1)); \
 })
 #define __rcu_dereference_protected(p, c, space) \
 ({ \
-	rcu_lockdep_assert(c, "suspicious rcu_dereference_protected() usage"); \
+	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_protected() usage"); \
 	rcu_dereference_sparse(p, space); \
 	((typeof(*p) __force __kernel *)(p)); \
 })
@@ -849,8 +849,8 @@ static inline void rcu_read_lock(void)
 	__rcu_read_lock();
 	__acquire(RCU);
 	rcu_lock_acquire(&rcu_lock_map);
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_lock() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_lock() used illegally while idle");
 }
 
 /*
@@ -900,8 +900,8 @@ static inline void rcu_read_lock(void)
  */
 static inline void rcu_read_unlock(void)
 {
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_unlock() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_unlock() used illegally while idle");
 	__release(RCU);
 	__rcu_read_unlock();
 	rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */
@@ -929,8 +929,8 @@ static inline void rcu_read_lock_bh(void)
 	local_bh_disable();
 	__acquire(RCU_BH);
 	rcu_lock_acquire(&rcu_bh_lock_map);
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_lock_bh() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_lock_bh() used illegally while idle");
 }
 
 /*
@@ -940,8 +940,8 @@ static inline void rcu_read_lock_bh(void)
  */
 static inline void rcu_read_unlock_bh(void)
 {
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_unlock_bh() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_unlock_bh() used illegally while idle");
 	rcu_lock_release(&rcu_bh_lock_map);
 	__release(RCU_BH);
 	local_bh_enable();
@@ -965,8 +965,8 @@ static inline void rcu_read_lock_sched(void)
 	preempt_disable();
 	__acquire(RCU_SCHED);
 	rcu_lock_acquire(&rcu_sched_lock_map);
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_lock_sched() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_lock_sched() used illegally while idle");
 }
 
 /* Used by lockdep and tracing: cannot be traced, cannot call lockdep. */
@@ -983,8 +983,8 @@ static inline notrace void rcu_read_lock_sched_notrace(void)
  */
 static inline void rcu_read_unlock_sched(void)
 {
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_unlock_sched() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_unlock_sched() used illegally while idle");
 	rcu_lock_release(&rcu_sched_lock_map);
 	__release(RCU_SCHED);
 	preempt_enable();
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd547770c..ad2d0162532a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -104,8 +104,8 @@ static DEFINE_SPINLOCK(cgroup_idr_lock);
 static DEFINE_SPINLOCK(release_agent_path_lock);
 
 #define cgroup_assert_mutex_or_rcu_locked()				\
-	rcu_lockdep_assert(rcu_read_lock_held() ||			\
-			   lockdep_is_held(&cgroup_mutex),		\
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\
+			   !lockdep_is_held(&cgroup_mutex),		\
 			   "cgroup_mutex or RCU read lock required");
 
 /*
diff --git a/kernel/pid.c b/kernel/pid.c
index 4fd07d5b7baf..ca368793808e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -451,9 +451,8 @@ EXPORT_SYMBOL(pid_task);
  */
 struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
 {
-	rcu_lockdep_assert(rcu_read_lock_held(),
-			   "find_task_by_pid_ns() needs rcu_read_lock()"
-			   " protection");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+			 "find_task_by_pid_ns() needs rcu_read_lock() protection");
 	return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
 }
 
diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index de35087c92a5..d3fcb2ec8536 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -415,11 +415,11 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
 	struct rcu_head *head = &rcu.head;
 	bool done = false;
 
-	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
-			   !lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&sp->dep_map) ||
+			 lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");
 
 	might_sleep();
 	init_completion(&rcu.completion);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 591af0cb7b9f..e888602d5c56 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -191,10 +191,10 @@ static void rcu_process_callbacks(struct softirq_action *unused)
  */
 void synchronize_sched(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_sched() in RCU read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_sched() in RCU read-side critical section");
 	cond_resched();
 }
 EXPORT_SYMBOL_GPL(synchronize_sched);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c844ef3c2fae..78d0a87ff354 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -644,12 +644,12 @@ static void rcu_eqs_enter_common(long long oldval, bool user)
 	 * It is illegal to enter an extended quiescent state while
 	 * in an RCU read-side critical section.
 	 */
-	rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
-			   "Illegal idle entry in RCU read-side critical section.");
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
-			   "Illegal idle entry in RCU-bh read-side critical section.");
-	rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
-			   "Illegal idle entry in RCU-sched read-side critical section.");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+			 "Illegal idle entry in RCU read-side critical section.");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map),
+			 "Illegal idle entry in RCU-bh read-side critical section.");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map),
+			 "Illegal idle entry in RCU-sched read-side critical section.");
 }
 
 /*
@@ -3162,10 +3162,10 @@ static inline int rcu_blocking_is_gp(void)
  */
 void synchronize_sched(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_sched() in RCU-sched read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_sched() in RCU-sched read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
 	if (rcu_gp_is_expedited())
@@ -3189,10 +3189,10 @@ EXPORT_SYMBOL_GPL(synchronize_sched);
  */
 void synchronize_rcu_bh(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
 	if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 85addcf7d184..07bd9fb393b4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -522,10 +522,10 @@ EXPORT_SYMBOL_GPL(call_rcu);
  */
 void synchronize_rcu(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_rcu() in RCU read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_rcu() in RCU read-side critical section");
 	if (!rcu_scheduler_active)
 		return;
 	if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index a0a0dd03c73a..47268fb1d27b 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
 void synchronize_rcu_tasks(void)
 {
 	/* Complain if the scheduler has not started.  */
-	rcu_lockdep_assert(!rcu_scheduler_active,
-			   "synchronize_rcu_tasks called too soon");
+	RCU_LOCKDEP_WARN(rcu_scheduler_active,
+			 "synchronize_rcu_tasks called too soon");
 
 	/* Wait for the grace period. */
 	wait_rcu_gp(call_rcu_tasks);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f7510bce..7956c016e750 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1987,8 +1987,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
 #ifdef CONFIG_SMP
 inline struct dl_bw *dl_bw_of(int i)
 {
-	rcu_lockdep_assert(rcu_read_lock_sched_held(),
-			   "sched RCU must be held");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
 	return &cpu_rq(i)->rd->dl_bw;
 }
 
@@ -1997,8 +1997,8 @@ static inline int dl_bw_cpus(int i)
 	struct root_domain *rd = cpu_rq(i)->rd;
 	int cpus = 0;
 
-	rcu_lockdep_assert(rcu_read_lock_sched_held(),
-			   "sched RCU must be held");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
 	for_each_cpu_and(i, rd->span, cpu_active_mask)
 		cpus++;
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 586ad91300b0..80bc8dc8d286 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -338,14 +338,14 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
 #include <trace/events/workqueue.h>
 
 #define assert_rcu_or_pool_mutex()					\
-	rcu_lockdep_assert(rcu_read_lock_sched_held() ||		\
-			   lockdep_is_held(&wq_pool_mutex),		\
-			   "sched RCU or wq_pool_mutex should be held")
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() &&			\
+			 !lockdep_is_held(&wq_pool_mutex),		\
+			 "sched RCU or wq_pool_mutex should be held")
 
 #define assert_rcu_or_wq_mutex(wq)					\
-	rcu_lockdep_assert(rcu_read_lock_sched_held() ||		\
-			   lockdep_is_held(&wq->mutex),			\
-			   "sched RCU or wq->mutex should be held")
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() &&			\
+			 !lockdep_is_held(&wq->mutex),			\
+			 "sched RCU or wq->mutex should be held")
 
 #define for_each_cpu_worker_pool(pool, cpu)				\
 	for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0];		\
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 188c1d26393b..73455089feef 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -400,9 +400,9 @@ static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
 {
 	bool match = false;
 
-	rcu_lockdep_assert(rcu_read_lock_held() ||
-			   lockdep_is_held(&devcgroup_mutex),
-			   "device_cgroup:verify_new_ex called without proper synchronization");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+			 lockdep_is_held(&devcgroup_mutex),
+			 "device_cgroup:verify_new_ex called without proper synchronization");
 
 	if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
 		if (behavior == DEVCG_DEFAULT_ALLOW) {


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18 22:54         ` Paul E. McKenney
@ 2015-06-19  2:19           ` Paul E. McKenney
  2015-06-30 11:04           ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2015-06-19  2:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Thu, Jun 18, 2015 at 03:54:20PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 18, 2015 at 11:59:55AM +0200, Ingo Molnar wrote:
> > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

[ . . . ]

> > > The thought is to rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
> > > by analogy to WARN()?  Easy to do if so!  Or am I missing the point?
> > 
> > Yeah, and inverting the condition. Assuming the condition was assert()-style 
> > inverted to begin with! :-)
> 
> It appears to have been.  ;-)
> 
> Please see below for an untested patch.  I made this be one big patch,
> but could have one patch add RCU_LOCKDEP_WARN(), a series convert uses
> from rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch
> remove rcu_lockdep_assert().  Let me know!
> 
> > and lockdep should be fixed too I suspect, lockdep_assert_held() was really a 
> > poorly chosen name I suspect, it should be 'lockdep_check_held()' or so? It has 
> > very little to do with the assert() interface.

And this one actually builds and passes light rcutorture testing.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

commit eeacf89826376570bff42edec59fd1606ce1829f
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Thu Jun 18 15:50:02 2015 -0700

    rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
    
    This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for
    consistency with the WARN() series of macros.  This also requires
    inverting the sense of the conditional, which this commit also does.
    
    Reported-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 5746b0c77f3e..adc2184009c5 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -883,7 +883,7 @@ All:  lockdep-checked RCU-protected pointer access
 
 	rcu_access_pointer
 	rcu_dereference_raw
-	rcu_lockdep_assert
+	RCU_LOCKDEP_WARN
 	rcu_sleep_check
 	RCU_NONIDLE
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d4298feaa6fb..127ecd687c1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -54,9 +54,9 @@ static DEFINE_MUTEX(mce_chrdev_read_mutex);
 
 #define rcu_dereference_check_mce(p) \
 ({ \
-	rcu_lockdep_assert(rcu_read_lock_sched_held() || \
-			   lockdep_is_held(&mce_chrdev_read_mutex), \
-			   "suspicious rcu_dereference_check_mce() usage"); \
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() && \
+			 !lockdep_is_held(&mce_chrdev_read_mutex), \
+			 "suspicious rcu_dereference_check_mce() usage"); \
 	smp_load_acquire(&(p)); \
 })
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 324ab5247687..37f827cb78ed 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -136,7 +136,7 @@ enum ctx_state ist_enter(struct pt_regs *regs)
 	preempt_count_add(HARDIRQ_OFFSET);
 
 	/* This code is a bit fragile.  Test it. */
-	rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");
 
 	return prev_state;
 }
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index 677fb2843553..3b188f20b43f 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -110,8 +110,8 @@ static DEFINE_MUTEX(dev_opp_list_lock);
 
 #define opp_rcu_lockdep_assert()					\
 do {									\
-	rcu_lockdep_assert(rcu_read_lock_held() ||			\
-				lockdep_is_held(&dev_opp_list_lock),	\
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\
+				!lockdep_is_held(&dev_opp_list_lock),	\
 			   "Missing rcu_read_lock() or "		\
 			   "dev_opp_list_lock protection");		\
 } while (0)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 230f87bdf5ad..910c75f6cd0b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,8 +83,8 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i
 
 static inline struct file *fcheck_files(struct files_struct *files, unsigned int fd)
 {
-	rcu_lockdep_assert(rcu_read_lock_held() ||
-			   lockdep_is_held(&files->file_lock),
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+			   !lockdep_is_held(&files->file_lock),
 			   "suspicious rcu_dereference_check() usage");
 	return __fcheck_files(files, fd);
 }
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 2f4a93a8ceab..e9e04c32cb4c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -537,14 +537,14 @@ static inline int rcu_read_lock_sched_held(void)
 #ifdef CONFIG_PROVE_RCU
 
 /**
- * rcu_lockdep_assert - emit lockdep splat if specified condition not met
+ * RCU_LOCKDEP_WARN - emit lockdep splat if specified condition is met
  * @c: condition to check
  * @s: informative message
  */
-#define rcu_lockdep_assert(c, s)					\
+#define RCU_LOCKDEP_WARN(c, s)						\
 	do {								\
 		static bool __section(.data.unlikely) __warned;		\
-		if (debug_lockdep_rcu_enabled() && !__warned && !(c)) {	\
+		if (debug_lockdep_rcu_enabled() && !__warned && (c)) {	\
 			__warned = true;				\
 			lockdep_rcu_suspicious(__FILE__, __LINE__, s);	\
 		}							\
@@ -553,8 +553,8 @@ static inline int rcu_read_lock_sched_held(void)
 #if defined(CONFIG_PROVE_RCU) && !defined(CONFIG_PREEMPT_RCU)
 static inline void rcu_preempt_sleep_check(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
-			   "Illegal context switch in RCU read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+			 "Illegal context switch in RCU read-side critical section");
 }
 #else /* #ifdef CONFIG_PROVE_RCU */
 static inline void rcu_preempt_sleep_check(void)
@@ -565,15 +565,15 @@ static inline void rcu_preempt_sleep_check(void)
 #define rcu_sleep_check()						\
 	do {								\
 		rcu_preempt_sleep_check();				\
-		rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),	\
-				   "Illegal context switch in RCU-bh read-side critical section"); \
-		rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),	\
-				   "Illegal context switch in RCU-sched read-side critical section"); \
+		RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map),	\
+				 "Illegal context switch in RCU-bh read-side critical section"); \
+		RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map),	\
+				 "Illegal context switch in RCU-sched read-side critical section"); \
 	} while (0)
 
 #else /* #ifdef CONFIG_PROVE_RCU */
 
-#define rcu_lockdep_assert(c, s) do { } while (0)
+#define RCU_LOCKDEP_WARN(c, s) do { } while (0)
 #define rcu_sleep_check() do { } while (0)
 
 #endif /* #else #ifdef CONFIG_PROVE_RCU */
@@ -604,13 +604,13 @@ static inline void rcu_preempt_sleep_check(void)
 ({ \
 	/* Dependency order vs. p above. */ \
 	typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
-	rcu_lockdep_assert(c, "suspicious rcu_dereference_check() usage"); \
+	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
 	rcu_dereference_sparse(p, space); \
 	((typeof(*p) __force __kernel *)(________p1)); \
 })
 #define __rcu_dereference_protected(p, c, space) \
 ({ \
-	rcu_lockdep_assert(c, "suspicious rcu_dereference_protected() usage"); \
+	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_protected() usage"); \
 	rcu_dereference_sparse(p, space); \
 	((typeof(*p) __force __kernel *)(p)); \
 })
@@ -849,8 +849,8 @@ static inline void rcu_read_lock(void)
 	__rcu_read_lock();
 	__acquire(RCU);
 	rcu_lock_acquire(&rcu_lock_map);
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_lock() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_lock() used illegally while idle");
 }
 
 /*
@@ -900,8 +900,8 @@ static inline void rcu_read_lock(void)
  */
 static inline void rcu_read_unlock(void)
 {
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_unlock() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_unlock() used illegally while idle");
 	__release(RCU);
 	__rcu_read_unlock();
 	rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */
@@ -929,8 +929,8 @@ static inline void rcu_read_lock_bh(void)
 	local_bh_disable();
 	__acquire(RCU_BH);
 	rcu_lock_acquire(&rcu_bh_lock_map);
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_lock_bh() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_lock_bh() used illegally while idle");
 }
 
 /*
@@ -940,8 +940,8 @@ static inline void rcu_read_lock_bh(void)
  */
 static inline void rcu_read_unlock_bh(void)
 {
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_unlock_bh() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_unlock_bh() used illegally while idle");
 	rcu_lock_release(&rcu_bh_lock_map);
 	__release(RCU_BH);
 	local_bh_enable();
@@ -965,8 +965,8 @@ static inline void rcu_read_lock_sched(void)
 	preempt_disable();
 	__acquire(RCU_SCHED);
 	rcu_lock_acquire(&rcu_sched_lock_map);
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_lock_sched() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_lock_sched() used illegally while idle");
 }
 
 /* Used by lockdep and tracing: cannot be traced, cannot call lockdep. */
@@ -983,8 +983,8 @@ static inline notrace void rcu_read_lock_sched_notrace(void)
  */
 static inline void rcu_read_unlock_sched(void)
 {
-	rcu_lockdep_assert(rcu_is_watching(),
-			   "rcu_read_unlock_sched() used illegally while idle");
+	RCU_LOCKDEP_WARN(!rcu_is_watching(),
+			 "rcu_read_unlock_sched() used illegally while idle");
 	rcu_lock_release(&rcu_sched_lock_map);
 	__release(RCU_SCHED);
 	preempt_enable();
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd547770c..ad2d0162532a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -104,8 +104,8 @@ static DEFINE_SPINLOCK(cgroup_idr_lock);
 static DEFINE_SPINLOCK(release_agent_path_lock);
 
 #define cgroup_assert_mutex_or_rcu_locked()				\
-	rcu_lockdep_assert(rcu_read_lock_held() ||			\
-			   lockdep_is_held(&cgroup_mutex),		\
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\
+			   !lockdep_is_held(&cgroup_mutex),		\
 			   "cgroup_mutex or RCU read lock required");
 
 /*
diff --git a/kernel/pid.c b/kernel/pid.c
index 4fd07d5b7baf..ca368793808e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -451,9 +451,8 @@ EXPORT_SYMBOL(pid_task);
  */
 struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
 {
-	rcu_lockdep_assert(rcu_read_lock_held(),
-			   "find_task_by_pid_ns() needs rcu_read_lock()"
-			   " protection");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+			 "find_task_by_pid_ns() needs rcu_read_lock() protection");
 	return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
 }
 
diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index de35087c92a5..d3fcb2ec8536 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -415,11 +415,11 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
 	struct rcu_head *head = &rcu.head;
 	bool done = false;
 
-	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
-			   !lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&sp->dep_map) ||
+			 lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");
 
 	might_sleep();
 	init_completion(&rcu.completion);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 591af0cb7b9f..e888602d5c56 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -191,10 +191,10 @@ static void rcu_process_callbacks(struct softirq_action *unused)
  */
 void synchronize_sched(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_sched() in RCU read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_sched() in RCU read-side critical section");
 	cond_resched();
 }
 EXPORT_SYMBOL_GPL(synchronize_sched);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c844ef3c2fae..78d0a87ff354 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -644,12 +644,12 @@ static void rcu_eqs_enter_common(long long oldval, bool user)
 	 * It is illegal to enter an extended quiescent state while
 	 * in an RCU read-side critical section.
 	 */
-	rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
-			   "Illegal idle entry in RCU read-side critical section.");
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
-			   "Illegal idle entry in RCU-bh read-side critical section.");
-	rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
-			   "Illegal idle entry in RCU-sched read-side critical section.");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_lock_map),
+			 "Illegal idle entry in RCU read-side critical section.");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map),
+			 "Illegal idle entry in RCU-bh read-side critical section.");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map),
+			 "Illegal idle entry in RCU-sched read-side critical section.");
 }
 
 /*
@@ -3162,10 +3162,10 @@ static inline int rcu_blocking_is_gp(void)
  */
 void synchronize_sched(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_sched() in RCU-sched read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_sched() in RCU-sched read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
 	if (rcu_gp_is_expedited())
@@ -3189,10 +3189,10 @@ EXPORT_SYMBOL_GPL(synchronize_sched);
  */
 void synchronize_rcu_bh(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
 	if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 85addcf7d184..07bd9fb393b4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -522,10 +522,10 @@ EXPORT_SYMBOL_GPL(call_rcu);
  */
 void synchronize_rcu(void)
 {
-	rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
-			   !lock_is_held(&rcu_lock_map) &&
-			   !lock_is_held(&rcu_sched_lock_map),
-			   "Illegal synchronize_rcu() in RCU read-side critical section");
+	RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+			 lock_is_held(&rcu_lock_map) ||
+			 lock_is_held(&rcu_sched_lock_map),
+			 "Illegal synchronize_rcu() in RCU read-side critical section");
 	if (!rcu_scheduler_active)
 		return;
 	if (rcu_gp_is_expedited())
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index a0a0dd03c73a..47268fb1d27b 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
 void synchronize_rcu_tasks(void)
 {
 	/* Complain if the scheduler has not started.  */
-	rcu_lockdep_assert(!rcu_scheduler_active,
-			   "synchronize_rcu_tasks called too soon");
+	RCU_LOCKDEP_WARN(rcu_scheduler_active,
+			 "synchronize_rcu_tasks called too soon");
 
 	/* Wait for the grace period. */
 	wait_rcu_gp(call_rcu_tasks);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f7510bce..7956c016e750 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1987,8 +1987,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
 #ifdef CONFIG_SMP
 inline struct dl_bw *dl_bw_of(int i)
 {
-	rcu_lockdep_assert(rcu_read_lock_sched_held(),
-			   "sched RCU must be held");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
 	return &cpu_rq(i)->rd->dl_bw;
 }
 
@@ -1997,8 +1997,8 @@ static inline int dl_bw_cpus(int i)
 	struct root_domain *rd = cpu_rq(i)->rd;
 	int cpus = 0;
 
-	rcu_lockdep_assert(rcu_read_lock_sched_held(),
-			   "sched RCU must be held");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
 	for_each_cpu_and(i, rd->span, cpu_active_mask)
 		cpus++;
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 586ad91300b0..80bc8dc8d286 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -338,14 +338,14 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq);
 #include <trace/events/workqueue.h>
 
 #define assert_rcu_or_pool_mutex()					\
-	rcu_lockdep_assert(rcu_read_lock_sched_held() ||		\
-			   lockdep_is_held(&wq_pool_mutex),		\
-			   "sched RCU or wq_pool_mutex should be held")
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() &&			\
+			 !lockdep_is_held(&wq_pool_mutex),		\
+			 "sched RCU or wq_pool_mutex should be held")
 
 #define assert_rcu_or_wq_mutex(wq)					\
-	rcu_lockdep_assert(rcu_read_lock_sched_held() ||		\
-			   lockdep_is_held(&wq->mutex),			\
-			   "sched RCU or wq->mutex should be held")
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held() &&			\
+			 !lockdep_is_held(&wq->mutex),			\
+			 "sched RCU or wq->mutex should be held")
 
 #define for_each_cpu_worker_pool(pool, cpu)				\
 	for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0];		\
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 188c1d26393b..73455089feef 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -400,9 +400,9 @@ static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
 {
 	bool match = false;
 
-	rcu_lockdep_assert(rcu_read_lock_held() ||
-			   lockdep_is_held(&devcgroup_mutex),
-			   "device_cgroup:verify_new_ex called without proper synchronization");
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
+			 lockdep_is_held(&devcgroup_mutex),
+			 "device_cgroup:verify_new_ex called without proper synchronization");
 
 	if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
 		if (behavior == DEVCG_DEFAULT_ALLOW) {


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-18 22:54         ` Paul E. McKenney
  2015-06-19  2:19           ` Paul E. McKenney
@ 2015-06-30 11:04           ` Ingo Molnar
  2015-06-30 16:16             ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-06-30 11:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > Yeah, and inverting the condition. Assuming the condition was assert()-style 
> > inverted to begin with! :-)
> 
> It appears to have been.  ;-)
> 
> Please see below for an untested patch.  I made this be one big patch, but could 
> have one patch add RCU_LOCKDEP_WARN(), a series convert uses from 
> rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch remove 
> rcu_lockdep_assert().  Let me know!

One big patch is perfect I think - it's a rename and a relatively mechanic 
inversion of conditions, no point in splitting it up any more IMHO. (But it's your 
call really.)

So I had a quick look at this patch, and IMHO the RCU_LOCKDEP_WARN() lines read a 
lot more 'naturally', because the new, inverted conditions now highlight buggy 
scenarios - which has the same logic parity as the kernel's historic 
WARN_ON()/BUG_ON() patterns:

Reviewed-by: Ingo Molnar <mingo@kernel.org>

This one looked a bit weird:

> index a0a0dd03c73a..47268fb1d27b 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
>  void synchronize_rcu_tasks(void)
>  {
>  	/* Complain if the scheduler has not started.  */
> -	rcu_lockdep_assert(!rcu_scheduler_active,
> -			   "synchronize_rcu_tasks called too soon");
> +	RCU_LOCKDEP_WARN(rcu_scheduler_active,
> +			 "synchronize_rcu_tasks called too soon");
>  

So I'd assume that a flag called 'rcu_scheduler_active' would be 1 if the RCU 
scheduler is active.

So why do we warn on it being active? Shouldn't the condition be:

	RCU_LOCKDEP_WARN(!rcu_scheduler_active,
			 "synchronize_rcu_tasks called too soon");

I.e. we warn when the RCU scheduler is not yet active and we called 
synchronize_rcu_tasks() too soon?

So either the original assert() was wrong, or I'm missing something obvious?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state
  2015-06-30 11:04           ` Ingo Molnar
@ 2015-06-30 16:16             ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2015-06-30 16:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst

On Tue, Jun 30, 2015 at 01:04:14PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > Yeah, and inverting the condition. Assuming the condition was assert()-style 
> > > inverted to begin with! :-)
> > 
> > It appears to have been.  ;-)
> > 
> > Please see below for an untested patch.  I made this be one big patch, but could 
> > have one patch add RCU_LOCKDEP_WARN(), a series convert uses from 
> > rcu_lockdep_assert() to RCU_LOCKDEP_WARN(), and a final patch remove 
> > rcu_lockdep_assert().  Let me know!
> 
> One big patch is perfect I think - it's a rename and a relatively mechanic 
> inversion of conditions, no point in splitting it up any more IMHO. (But it's your 
> call really.)
> 
> So I had a quick look at this patch, and IMHO the RCU_LOCKDEP_WARN() lines read a 
> lot more 'naturally', because the new, inverted conditions now highlight buggy 
> scenarios - which has the same logic parity as the kernel's historic 
> WARN_ON()/BUG_ON() patterns:
> 
> Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thank you, added!

> This one looked a bit weird:
> 
> > index a0a0dd03c73a..47268fb1d27b 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -589,8 +589,8 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks);
> >  void synchronize_rcu_tasks(void)
> >  {
> >  	/* Complain if the scheduler has not started.  */
> > -	rcu_lockdep_assert(!rcu_scheduler_active,
> > -			   "synchronize_rcu_tasks called too soon");
> > +	RCU_LOCKDEP_WARN(rcu_scheduler_active,
> > +			 "synchronize_rcu_tasks called too soon");
> >  
> 
> So I'd assume that a flag called 'rcu_scheduler_active' would be 1 if the RCU 
> scheduler is active.
> 
> So why do we warn on it being active? Shouldn't the condition be:
> 
> 	RCU_LOCKDEP_WARN(!rcu_scheduler_active,
> 			 "synchronize_rcu_tasks called too soon");
> 
> I.e. we warn when the RCU scheduler is not yet active and we called 
> synchronize_rcu_tasks() too soon?
> 
> So either the original assert() was wrong, or I'm missing something obvious?

You are missing nothing!  But I really do test this stuff...

Ah, I see...  I need the following patch in order to enable lockdep-RCU
on one of my RCU-tasks rcutorture scenarios...  :-/

Good catch!  There were at least three bugs hiding behind that one!  ;-)

							Thanx, Paul

------------------------------------------------------------------------

commit dc883f1668c83f9525a13ee1b3cd45e9e85a0fe5
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Jun 30 09:14:01 2015 -0700

    rcutorture: Enable lockdep-RCU on TASKS01
    
    Currently none of the RCU-tasks scenarios enables lockdep-RCU, which
    causes bugs to be missed.  This commit therefore enables lockdep-RCU
    on TASKS01.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
index 2cc0e60eba6e..bafe94cbd739 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS01
@@ -5,6 +5,6 @@ CONFIG_PREEMPT_NONE=n
 CONFIG_PREEMPT_VOLUNTARY=n
 CONFIG_PREEMPT=y
 CONFIG_DEBUG_LOCK_ALLOC=y
-CONFIG_PROVE_LOCKING=n
-#CHECK#CONFIG_PROVE_RCU=n
+CONFIG_PROVE_LOCKING=y
+#CHECK#CONFIG_PROVE_RCU=y
 CONFIG_RCU_EXPERT=y


^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2015-06-30 16:16 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state Andy Lutomirski
2015-06-17  9:41   ` Ingo Molnar
2015-06-17 14:15     ` Andy Lutomirski
2015-06-18  9:57       ` Ingo Molnar
2015-06-18 11:07         ` Andy Lutomirski
2015-06-18 15:52           ` Andy Lutomirski
2015-06-18 16:17             ` Ingo Molnar
2015-06-18 16:26               ` Frederic Weisbecker
2015-06-18 19:26                 ` Andy Lutomirski
2015-06-17 15:27     ` Paul E. McKenney
2015-06-18  9:59       ` Ingo Molnar
2015-06-18 22:54         ` Paul E. McKenney
2015-06-19  2:19           ` Paul E. McKenney
2015-06-30 11:04           ` Ingo Molnar
2015-06-30 16:16             ` Paul E. McKenney
2015-06-16 20:16 ` [RFC/INCOMPLETE 02/13] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 03/13] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 04/13] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 05/13] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 06/13] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 07/13] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks Andy Lutomirski
2015-06-17 10:00   ` Ingo Molnar
2015-06-17 10:02     ` Ingo Molnar
2015-06-17 14:12       ` Andy Lutomirski
2015-06-18 10:17         ` Ingo Molnar
2015-06-18 10:19           ` Ingo Molnar
2015-06-16 20:16 ` [RFC/INCOMPLETE 09/13] x86/entry/compat: Migrate compat " Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 10/13] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 11/13] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 12/13] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 13/13] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
2015-06-17  9:48 ` [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Ingo Molnar
2015-06-17 10:13   ` Richard Weinberger
2015-06-17 11:04     ` Ingo Molnar
2015-06-17 14:19     ` Andy Lutomirski
2015-06-17 15:16   ` Andy Lutomirski
2015-06-18 10:14     ` Ingo Molnar
2015-06-17 10:32 ` Ingo Molnar
2015-06-17 11:14   ` Ingo Molnar
2015-06-17 14:23   ` Andy Lutomirski
2015-06-18 10:11     ` Ingo Molnar
2015-06-18 11:06       ` Andy Lutomirski
2015-06-18 16:24         ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.