All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/17] x86: Rewrite exit-to-userspace code
@ 2015-06-29 19:33 Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 01/17] selftests/x86: Add a test for 32-bit fast syscall arg faults Andy Lutomirski
                   ` (18 more replies)
  0 siblings, 19 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

This is the first big batch of x86 asm-to-C conversion patches.

The exit-to-usermode code is copied in several places and is written
in a nasty combination of asm and C.  It's not at all clear what
it's supposed to do, and the way it's structured makes it very hard
to work with.  For example, it's not even clear why syscall exit
hooks are called only once per syscall right now.  (It seems to be a
side effect of the way that rdi and rdx are handled in the asm loop,
and it seems reliable, but it's still pointlessly complicated.)  The
existing code also makes context tracking overly complicated and
hard to understand.  Finally, it's nearly impossible for anyone to
change what happens on exit to usermode, since the existing code is
so fragile.

I tried to clean it up incrementally, but I decided it was too hard.
Instead, this series just replaces the code.  It seems to work.

Context tracking in particular works very differently now.  The
low-level entry code checks that we're in CONTEXT_USER and switches
to CONTEXT_KERNEL.  The exit code does the reverse.  There is no
need to track what CONTEXT_XYZ state we came from, because we
already know.  Similarly, SCHEDULE_USER is gone, since we can
reschedule if needed by simply calling schedule() from C code.

The main things that are missing are that I haven't done the 32-bit
parts (anyone want to help?) and therefore I haven't deleted the old
C code.  I also think this may break UML for trivial reasons.

IRQ context tracking is still messy.  One the cleanup progresses
to the point that we can enter CONTEXT_KERNEL in syscalls before
enabling interrupts, we can fully clean up IRQ context tracking.

Once these land, I'll send some more :)

Note: we might want to backport patches 1 and 2.

Changes from v3:
 - Add the syscall_arg_fault_32 test.
 - Fix a pre-existing bad syscall arg buglet.
 - Fix an asm glitch due to a bad rebase.
 - Fix a CONFIG_PROVE_LOCKDEP warning.
Borislav: the end result of this series differs from the v3.91 that I
only in the removal of a single trailing tab.  The badarg patch is in
a different place now, though, since we might want to backport it.

Changes from v2: Misplaced the actual list -- sorry.

Changes from v1:
 - Fix bisection failure by squashing the 64-bit native and compat syscall
   conversions together.  The intermediate state didn't built, and fixing
   it isn't worthwhile (the results will be harder to understand).
 - Replace context_tracking_assert_state with CT_WARN_ON and ct_state.
 - The last two patches are now.  I incorrectly thought that we weren't
   ready for them yet on 32-bit kernels, but I was wrong.

Andy Lutomirski (16):
  selftests/x86: Add a test for 32-bit fast syscall arg faults
  x86/entry/64/compat: Fix bad fast syscall arg failure path
  context_tracking: Add ct_state and CT_WARN_ON
  notifiers: Assert that RCU is watching in notify_die
  x86: Move C entry and exit code to arch/x86/entry/common.c
  x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
  x86/entry: Add enter_from_user_mode and use it in syscalls
  x86/entry: Add new, comprehensible entry and exit hooks
  x86/entry/64: Really create an error-entry-from-usermode code path
  x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks
  x86/asm/entry/64: Save all regs on interrupt entry
  x86/asm/entry/64: Simplify irq stack pt_regs handling
  x86/asm/entry/64: Migrate error and interrupt exit work to C
  x86/entry: Remove exception_enter from most trap handlers
  x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h
  x86/irq: Document how IRQ context tracking works and add an assertion

Ingo Molnar (1):
  uml: Fix do_signal() prototype

 arch/um/include/shared/kern_util.h              |   3 +-
 arch/um/kernel/process.c                        |   6 +-
 arch/um/kernel/signal.c                         |   8 +-
 arch/um/kernel/tlb.c                            |   2 +-
 arch/um/kernel/trap.c                           |   2 +-
 arch/x86/entry/Makefile                         |   1 +
 arch/x86/entry/common.c                         | 374 ++++++++++++++++++++++++
 arch/x86/entry/entry_64.S                       | 180 +++---------
 arch/x86/entry/entry_64_compat.S                |  47 ++-
 arch/x86/include/asm/context_tracking.h         |  10 -
 arch/x86/include/asm/signal.h                   |   1 +
 arch/x86/include/asm/traps.h                    |   4 +-
 arch/x86/kernel/cpu/mcheck/mce.c                |   5 +-
 arch/x86/kernel/cpu/mcheck/p5.c                 |   5 +-
 arch/x86/kernel/cpu/mcheck/winchip.c            |   4 +-
 arch/x86/kernel/irq.c                           |  15 +
 arch/x86/kernel/ptrace.c                        | 202 +------------
 arch/x86/kernel/signal.c                        |  28 +-
 arch/x86/kernel/traps.c                         |  87 ++----
 include/linux/context_tracking.h                |  15 +
 include/linux/context_tracking_state.h          |   1 +
 kernel/notifier.c                               |   2 +
 tools/testing/selftests/x86/Makefile            |   2 +-
 tools/testing/selftests/x86/syscall_arg_fault.c | 130 ++++++++
 24 files changed, 672 insertions(+), 462 deletions(-)
 create mode 100644 arch/x86/entry/common.c
 delete mode 100644 arch/x86/include/asm/context_tracking.h
 create mode 100644 tools/testing/selftests/x86/syscall_arg_fault.c

-- 
2.4.3


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v4 01/17] selftests/x86: Add a test for 32-bit fast syscall arg faults
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path Andy Lutomirski
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

This test passes on 4.0 and fails on some newer kernels.  Fortunately,
the failure is likely not a big deal.  This test will make sure that
we don't break it further (e.g. OOPSing) as we clean up the entry
code and that we eventually fix the regression.

There's arguably no need to preserve the old ABI here -- anything
that makes it into a fast (vDSO) syscall with a bad stack is about
to crash no matter what we do.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 tools/testing/selftests/x86/Makefile            |   2 +-
 tools/testing/selftests/x86/syscall_arg_fault.c | 130 ++++++++++++++++++++++++
 2 files changed, 131 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/syscall_arg_fault.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d56d7d1..e8df47e6326c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -5,7 +5,7 @@ include ../lib.mk
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
 TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
-TARGETS_C_32BIT_ONLY := entry_from_vm86
+TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault
 
 TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
 BINARIES_32 := $(TARGETS_C_32BIT_ALL:%=%_32)
diff --git a/tools/testing/selftests/x86/syscall_arg_fault.c b/tools/testing/selftests/x86/syscall_arg_fault.c
new file mode 100644
index 000000000000..7db4fc9fa09f
--- /dev/null
+++ b/tools/testing/selftests/x86/syscall_arg_fault.c
@@ -0,0 +1,130 @@
+/*
+ * syscall_arg_fault.c - tests faults 32-bit fast syscall stack args
+ * Copyright (c) 2015 Andrew Lutomirski
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/signal.h>
+#include <sys/ucontext.h>
+#include <err.h>
+#include <setjmp.h>
+#include <errno.h>
+
+/* Our sigaltstack scratch space. */
+static unsigned char altstack_data[SIGSTKSZ];
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
+static volatile sig_atomic_t sig_traps;
+static sigjmp_buf jmpbuf;
+
+static volatile sig_atomic_t n_errs;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+	ucontext_t *ctx = (ucontext_t*)ctx_void;
+
+	if (ctx->uc_mcontext.gregs[REG_EAX] != -EFAULT) {
+		printf("[FAIL]\tAX had the wrong value: 0x%x\n",
+		       ctx->uc_mcontext.gregs[REG_EAX]);
+		n_errs++;
+	} else {
+		printf("[OK]\tSeems okay\n");
+	}
+
+	siglongjmp(jmpbuf, 1);
+}
+
+static void sigill(int sig, siginfo_t *info, void *ctx_void)
+{
+	printf("[SKIP]\tIllegal instruction\n");
+	siglongjmp(jmpbuf, 1);
+}
+
+int main()
+{
+	stack_t stack = {
+		.ss_sp = altstack_data,
+		.ss_size = SIGSTKSZ,
+	};
+	if (sigaltstack(&stack, NULL) != 0)
+		err(1, "sigaltstack");
+
+	sethandler(SIGSEGV, sigsegv, SA_ONSTACK);
+	sethandler(SIGILL, sigill, SA_ONSTACK);
+
+	/*
+	 * Exercise another nasty special case.  The 32-bit SYSCALL
+	 * and SYSENTER instructions (even in compat mode) each
+	 * clobber one register.  A Linux system call has a syscall
+	 * number and six arguments, and the user stack pointer
+	 * needs to live in some register on return.  That means
+	 * that we need eight registers, but SYSCALL and SYSENTER
+	 * only preserve seven registers.  As a result, one argument
+	 * ends up on the stack.  The stack is user memory, which
+	 * means that the kernel can fail to read it.
+	 *
+	 * The 32-bit fast system calls don't have a defined ABI:
+	 * we're supposed to invoke them through the vDSO.  So we'll
+	 * fudge it: we set all regs to invalid pointer values and
+	 * invoke the entry instruction.  The return will fail no
+	 * matter what, and we completely lose our program state,
+	 * but we can fix it up with a signal handler.
+	 */
+
+	printf("[RUN]\tSYSENTER with invalid state\n");
+	if (sigsetjmp(jmpbuf, 1) == 0) {
+		asm volatile (
+			"movl $-1, %%eax\n\t"
+			"movl $-1, %%ebx\n\t"
+			"movl $-1, %%ecx\n\t"
+			"movl $-1, %%edx\n\t"
+			"movl $-1, %%esi\n\t"
+			"movl $-1, %%edi\n\t"
+			"movl $-1, %%ebp\n\t"
+			"movl $-1, %%esp\n\t"
+			"sysenter"
+			: : : "memory", "flags");
+	}
+
+	printf("[RUN]\tSYSCALL with invalid state\n");
+	if (sigsetjmp(jmpbuf, 1) == 0) {
+		asm volatile (
+			"movl $-1, %%eax\n\t"
+			"movl $-1, %%ebx\n\t"
+			"movl $-1, %%ecx\n\t"
+			"movl $-1, %%edx\n\t"
+			"movl $-1, %%esi\n\t"
+			"movl $-1, %%edi\n\t"
+			"movl $-1, %%ebp\n\t"
+			"movl $-1, %%esp\n\t"
+			"syscall\n\t"
+			"pushl $0"	/* make sure we segfault cleanly */
+			: : : "memory", "flags");
+	}
+
+	return 0;
+}
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 01/17] selftests/x86: Add a test for 32-bit fast syscall arg faults Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-30 10:58   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 03/17] uml: Fix do_signal() prototype Andy Lutomirski
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

If user code does SYSCALL32 or SYSENTER without a valid stack, then
our attempt to determine the syscall args will result in a failed
uaccess fault.  Previously, we would try to recover by jumping to
the syscall exit code, but we'd run the syscall exit work even
though we never made it to the syscall entry work.

Clean it up by treating the failure path as a non-syscall entry and
exit pair.

This fixes strace's output when running the syscall_arg_fault test.
Without this fix, strace would get out of sync and would fail to
associate syscall entries with syscall exits.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S        |  2 +-
 arch/x86/entry/entry_64_compat.S | 35 +++++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3bb2c4302df1..141a5d49dddc 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -613,7 +613,7 @@ ret_from_intr:
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
 	/* Interrupt came from user space */
-retint_user:
+GLOBAL(retint_user)
 	GET_THREAD_INFO(%rcx)
 
 	/* %rcx: thread info. Interrupts are off. */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index bb187a6a877c..efe0b1e499fa 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -425,8 +425,39 @@ cstar_tracesys:
 END(entry_SYSCALL_compat)
 
 ia32_badarg:
-	ASM_CLAC
-	movq	$-EFAULT, RAX(%rsp)
+	/*
+	 * So far, we've entered kernel mode, set AC, turned on IRQs, and
+	 * saved C regs except r8-r11.  We haven't done any of the other
+	 * standard entry work, though.  We want to bail, but we shouldn't
+	 * treat this as a syscall entry since we don't even know what the
+	 * args are.  Instead, treat this as a non-syscall entry, finish
+	 * the entry work, and immediately exit after setting AX = -EFAULT.
+	 *
+	 * We're really just being polite here.  Killing the task outright
+	 * would be a reasonable action, too.  Given that the only valid
+	 * way to have gotten here is through the vDSO, and we already know
+	 * that the stack pointer is bad, the task isn't going to survive
+	 * for long no matter what we do.
+	 */
+
+	ASM_CLAC			/* undo STAC */
+	movq	$-EFAULT, RAX(%rsp)	/* return -EFAULT if possible */
+
+	/* Fill in the rest of pt_regs */
+	xorl	%eax, %eax
+	movq	%rax, R11(%rsp)
+	movq	%rax, R10(%rsp)
+	movq	%rax, R9(%rsp)
+	movq	%rax, R8(%rsp)
+	SAVE_EXTRA_REGS
+
+	/* Turn IRQs back off. */
+	DISABLE_INTERRUPTS(CLBR_NONE)
+	TRACE_IRQS_OFF
+
+	/* And exit again. */
+	jmp retint_user
+
 ia32_ret_from_sys_call:
 	xorl	%eax, %eax		/* Do not leak kernel information */
 	movq	%rax, R11(%rsp)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 03/17] uml: Fix do_signal() prototype
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 01/17] selftests/x86: Add a test for 32-bit fast syscall arg faults Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-29 20:47   ` Richard Weinberger
  2015-06-29 19:33 ` [PATCH v4 04/17] context_tracking: Add ct_state and CT_WARN_ON Andy Lutomirski
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Ingo Molnar, Richard Weinberger, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Andy Lutomirski

From: Ingo Molnar <mingo@kernel.org>

Once x86 exports its do_signal(), the prototypes will clash.

Fix the clash and also improve the code a bit: remove the unnecessary
kern_do_signal() indirection. This allows interrupt_end() to share
the 'regs' parameter calculation.

Also remove the unused return code to match x86.

Minimally build and boot tested.

Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Adjusted the commit message because I reordered the patch. --Andy]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/um/include/shared/kern_util.h | 3 ++-
 arch/um/kernel/process.c           | 6 ++++--
 arch/um/kernel/signal.c            | 8 +-------
 arch/um/kernel/tlb.c               | 2 +-
 arch/um/kernel/trap.c              | 2 +-
 5 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 83a91f976330..35ab97e4bb9b 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -22,7 +22,8 @@ extern int kmalloc_ok;
 extern unsigned long alloc_stack(int order, int atomic);
 extern void free_stack(unsigned long stack, int order);
 
-extern int do_signal(void);
+struct pt_regs;
+extern void do_signal(struct pt_regs *regs);
 extern void interrupt_end(void);
 extern void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs);
 
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 68b9119841cd..a6d922672b9f 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -90,12 +90,14 @@ void *__switch_to(struct task_struct *from, struct task_struct *to)
 
 void interrupt_end(void)
 {
+	struct pt_regs *regs = &current->thread.regs;
+
 	if (need_resched())
 		schedule();
 	if (test_thread_flag(TIF_SIGPENDING))
-		do_signal();
+		do_signal(regs);
 	if (test_and_clear_thread_flag(TIF_NOTIFY_RESUME))
-		tracehook_notify_resume(&current->thread.regs);
+		tracehook_notify_resume(regs);
 }
 
 void exit_thread(void)
diff --git a/arch/um/kernel/signal.c b/arch/um/kernel/signal.c
index 4f60e4aad790..57acbd67d85d 100644
--- a/arch/um/kernel/signal.c
+++ b/arch/um/kernel/signal.c
@@ -64,7 +64,7 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 	signal_setup_done(err, ksig, singlestep);
 }
 
-static int kern_do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
 {
 	struct ksignal ksig;
 	int handled_sig = 0;
@@ -110,10 +110,4 @@ static int kern_do_signal(struct pt_regs *regs)
 	 */
 	if (!handled_sig)
 		restore_saved_sigmask();
-	return handled_sig;
-}
-
-int do_signal(void)
-{
-	return kern_do_signal(&current->thread.regs);
 }
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index f1b3eb14b855..2077248e8a72 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -291,7 +291,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 		/* We are under mmap_sem, release it such that current can terminate */
 		up_write(&current->mm->mmap_sem);
 		force_sig(SIGKILL, current);
-		do_signal();
+		do_signal(&current->thread.regs);
 	}
 }
 
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 47ff9b7f3e5d..1b0f5c59d522 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -173,7 +173,7 @@ static void bad_segv(struct faultinfo fi, unsigned long ip)
 void fatal_sigsegv(void)
 {
 	force_sigsegv(SIGSEGV, current);
-	do_signal();
+	do_signal(&current->thread.regs);
 	/*
 	 * This is to tell gcc that we're not returning - do_signal
 	 * can, in general, return, but in this case, it's not, since
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 04/17] context_tracking: Add ct_state and CT_WARN_ON
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 03/17] uml: Fix do_signal() prototype Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-30 12:20   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 05/17] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

This will let us sprinkle sanity checks around the kernel without
making too much of a mess.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/context_tracking.h       | 15 +++++++++++++++
 include/linux/context_tracking_state.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..008fc67d0d96 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -49,13 +49,28 @@ static inline void exception_exit(enum ctx_state prev_ctx)
 	}
 }
 
+
+/**
+ * ct_state() - return the current context tracking state if known
+ *
+ * Returns the current cpu's context tracking state if context tracking
+ * is enabled.  If context tracking is disabled, returns
+ * CONTEXT_DISABLED.  This should be used primarily for debugging.
+ */
+static inline enum ctx_state ct_state(void)
+{
+	return context_tracking_is_enabled() ?
+		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
+}
 #else
 static inline void user_enter(void) { }
 static inline void user_exit(void) { }
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
+static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
+#define CT_WARN_ON(cond) WARN_ON(context_tracking_is_enabled() && (cond))
 
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE
 extern void context_tracking_init(void);
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 678ecdf90cf6..ee956c528fab 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -14,6 +14,7 @@ struct context_tracking {
 	bool active;
 	int recursion;
 	enum ctx_state {
+		CONTEXT_DISABLED = -1,	/* returned by ct_state() if unknown */
 		CONTEXT_KERNEL = 0,
 		CONTEXT_USER,
 		CONTEXT_GUEST,
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 05/17] notifiers: Assert that RCU is watching in notify_die
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (3 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 04/17] context_tracking: Add ct_state and CT_WARN_ON Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 06/17] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

Low-level arch entries often call notify_die, and it's easy for arch
code to fail to exit an RCU quiescent state first.  Assert that
we're not quiescent in notify_die.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/notifier.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index ae9fc7cc360e..980e4330fb59 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -544,6 +544,8 @@ int notrace notify_die(enum die_val val, const char *str,
 		.signr	= sig,
 
 	};
+	rcu_lockdep_assert(rcu_is_watching(),
+			   "notify_die called but RCU thinks we're quiescent");
 	return atomic_notifier_call_chain(&die_chain, val, &args);
 }
 NOKPROBE_SYMBOL(notify_die);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 06/17] x86: Move C entry and exit code to arch/x86/entry/common.c
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (4 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 05/17] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-30 16:32   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

The entry and exit C helpers were confusingly scattered between
ptrace.c and signal.c, even though they aren't specific to ptrace or
signal handling.  Move them together in a new file.

This change just moves code around.  It doesn't change anything.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/Makefile       |   1 +
 arch/x86/entry/common.c       | 253 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/signal.h |   1 +
 arch/x86/kernel/ptrace.c      | 202 +--------------------------------
 arch/x86/kernel/signal.c      |  28 +----
 5 files changed, 257 insertions(+), 228 deletions(-)
 create mode 100644 arch/x86/entry/common.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 7a144971db79..bd55dedd7614 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -2,6 +2,7 @@
 # Makefile for the x86 low level entry code
 #
 obj-y				:= entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
+obj-y				+= common.o
 
 obj-y				+= vdso/
 obj-y				+= vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..348465473e55
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,253 @@
+/*
+ * entry.c - C code for kernel entry and exit
+ * Copyright (c) 2015 Andrew Lutomirski
+ * GPL v2
+ *
+ * Based on asm and ptrace code by many authors.  The code here originated
+ * in ptrace.c and signal.c.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/errno.h>
+#include <linux/ptrace.h>
+#include <linux/tracehook.h>
+#include <linux/audit.h>
+#include <linux/seccomp.h>
+#include <linux/signal.h>
+#include <linux/export.h>
+#include <linux/context_tracking.h>
+#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
+
+#include <asm/desc.h>
+#include <asm/traps.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else
+#endif
+	{
+		audit_syscall_entry(regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
+/*
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0:			resume the syscall
+ * 1:			go to phase 2; no seccomp phase 2 needed
+ * anything else:	go to phase 2; pass return value to seccomp
+ */
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	/*
+	 * If TIF_NOHZ is set, we are required to call user_exit() before
+	 * doing anything that could touch RCU.
+	 */
+	if (work & _TIF_NOHZ) {
+		user_exit();
+		work &= ~_TIF_NOHZ;
+	}
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else
+#endif
+		{
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -1;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
+{
+	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	/*
+	 * If we stepped into a sysenter/syscall insn, it trapped in
+	 * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
+	 * If user-mode had set TF itself, then it's still clear from
+	 * do_debug() and we need to set it again to restore the user
+	 * state.  If we entered on the slow path, TF was already set.
+	 */
+	if (work & _TIF_SINGLESTEP)
+		regs->flags |= X86_EFLAGS_TF;
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
+		/* seccomp failures shouldn't expose any additional code. */
+		return -1;
+	}
+#endif
+
+	if (unlikely(work & _TIF_SYSCALL_EMU))
+		ret = -1L;
+
+	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
+	    tracehook_report_syscall_entry(regs))
+		ret = -1L;
+
+	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+		trace_sys_enter(regs, regs->orig_ax);
+
+	do_audit_syscall_entry(regs, arch);
+
+	return ret ?: regs->orig_ax;
+}
+
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
+void syscall_trace_leave(struct pt_regs *regs)
+{
+	bool step;
+
+	/*
+	 * We may come here right after calling schedule_user()
+	 * or do_notify_resume(), in which case we can be in RCU
+	 * user mode.
+	 */
+	user_exit();
+
+	audit_syscall_exit(regs);
+
+	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+		trace_sys_exit(regs, regs->ax);
+
+	/*
+	 * If TIF_SYSCALL_EMU is set, we only get here because of
+	 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+	 * We already reported this syscall instruction in
+	 * syscall_trace_enter().
+	 */
+	step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
+			!test_thread_flag(TIF_SYSCALL_EMU);
+	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
+		tracehook_report_syscall_exit(regs, step);
+
+	user_enter();
+}
+
+/*
+ * notification of userspace execution resumption
+ * - triggered by the TIF_WORK_MASK flags
+ */
+__visible void
+do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
+{
+	user_exit();
+
+	if (thread_info_flags & _TIF_UPROBE)
+		uprobe_notify_resume(regs);
+
+	/* deal with pending signal delivery */
+	if (thread_info_flags & _TIF_SIGPENDING)
+		do_signal(regs);
+
+	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+		clear_thread_flag(TIF_NOTIFY_RESUME);
+		tracehook_notify_resume(regs);
+	}
+	if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
+		fire_user_return_notifiers();
+
+	user_enter();
+}
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 31eab867e6d3..b42408bcf6b5 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -30,6 +30,7 @@ typedef sigset_t compat_sigset_t;
 #endif /* __ASSEMBLY__ */
 #include <uapi/asm/signal.h>
 #ifndef __ASSEMBLY__
+extern void do_signal(struct pt_regs *regs);
 extern void do_notify_resume(struct pt_regs *, void *, __u32);
 
 #define __ARCH_HAS_SA_RESTORER
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..4aa1ab6435d3 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -37,12 +37,10 @@
 #include <asm/proto.h>
 #include <asm/hw_breakpoint.h>
 #include <asm/traps.h>
+#include <asm/syscall.h>
 
 #include "tls.h"
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/syscalls.h>
-
 enum x86_regset {
 	REGSET_GENERAL,
 	REGSET_FP,
@@ -1434,201 +1432,3 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	/* Send us the fake SIGTRAP */
 	force_sig_info(SIGTRAP, &info, tsk);
 }
-
-static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
-{
-#ifdef CONFIG_X86_64
-	if (arch == AUDIT_ARCH_X86_64) {
-		audit_syscall_entry(regs->orig_ax, regs->di,
-				    regs->si, regs->dx, regs->r10);
-	} else
-#endif
-	{
-		audit_syscall_entry(regs->orig_ax, regs->bx,
-				    regs->cx, regs->dx, regs->si);
-	}
-}
-
-/*
- * We can return 0 to resume the syscall or anything else to go to phase
- * 2.  If we resume the syscall, we need to put something appropriate in
- * regs->orig_ax.
- *
- * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
- * are fully functional.
- *
- * For phase 2's benefit, our return value is:
- * 0:			resume the syscall
- * 1:			go to phase 2; no seccomp phase 2 needed
- * anything else:	go to phase 2; pass return value to seccomp
- */
-unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
-{
-	unsigned long ret = 0;
-	u32 work;
-
-	BUG_ON(regs != task_pt_regs(current));
-
-	work = ACCESS_ONCE(current_thread_info()->flags) &
-		_TIF_WORK_SYSCALL_ENTRY;
-
-	/*
-	 * If TIF_NOHZ is set, we are required to call user_exit() before
-	 * doing anything that could touch RCU.
-	 */
-	if (work & _TIF_NOHZ) {
-		user_exit();
-		work &= ~_TIF_NOHZ;
-	}
-
-#ifdef CONFIG_SECCOMP
-	/*
-	 * Do seccomp first -- it should minimize exposure of other
-	 * code, and keeping seccomp fast is probably more valuable
-	 * than the rest of this.
-	 */
-	if (work & _TIF_SECCOMP) {
-		struct seccomp_data sd;
-
-		sd.arch = arch;
-		sd.nr = regs->orig_ax;
-		sd.instruction_pointer = regs->ip;
-#ifdef CONFIG_X86_64
-		if (arch == AUDIT_ARCH_X86_64) {
-			sd.args[0] = regs->di;
-			sd.args[1] = regs->si;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->r10;
-			sd.args[4] = regs->r8;
-			sd.args[5] = regs->r9;
-		} else
-#endif
-		{
-			sd.args[0] = regs->bx;
-			sd.args[1] = regs->cx;
-			sd.args[2] = regs->dx;
-			sd.args[3] = regs->si;
-			sd.args[4] = regs->di;
-			sd.args[5] = regs->bp;
-		}
-
-		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
-		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
-
-		ret = seccomp_phase1(&sd);
-		if (ret == SECCOMP_PHASE1_SKIP) {
-			regs->orig_ax = -1;
-			ret = 0;
-		} else if (ret != SECCOMP_PHASE1_OK) {
-			return ret;  /* Go directly to phase 2 */
-		}
-
-		work &= ~_TIF_SECCOMP;
-	}
-#endif
-
-	/* Do our best to finish without phase 2. */
-	if (work == 0)
-		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
-
-#ifdef CONFIG_AUDITSYSCALL
-	if (work == _TIF_SYSCALL_AUDIT) {
-		/*
-		 * If there is no more work to be done except auditing,
-		 * then audit in phase 1.  Phase 2 always audits, so, if
-		 * we audit here, then we can't go on to phase 2.
-		 */
-		do_audit_syscall_entry(regs, arch);
-		return 0;
-	}
-#endif
-
-	return 1;  /* Something is enabled that we can't handle in phase 1 */
-}
-
-/* Returns the syscall nr to run (which should match regs->orig_ax). */
-long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
-				unsigned long phase1_result)
-{
-	long ret = 0;
-	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
-		_TIF_WORK_SYSCALL_ENTRY;
-
-	BUG_ON(regs != task_pt_regs(current));
-
-	/*
-	 * If we stepped into a sysenter/syscall insn, it trapped in
-	 * kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
-	 * If user-mode had set TF itself, then it's still clear from
-	 * do_debug() and we need to set it again to restore the user
-	 * state.  If we entered on the slow path, TF was already set.
-	 */
-	if (work & _TIF_SINGLESTEP)
-		regs->flags |= X86_EFLAGS_TF;
-
-#ifdef CONFIG_SECCOMP
-	/*
-	 * Call seccomp_phase2 before running the other hooks so that
-	 * they can see any changes made by a seccomp tracer.
-	 */
-	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
-		/* seccomp failures shouldn't expose any additional code. */
-		return -1;
-	}
-#endif
-
-	if (unlikely(work & _TIF_SYSCALL_EMU))
-		ret = -1L;
-
-	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
-	    tracehook_report_syscall_entry(regs))
-		ret = -1L;
-
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_enter(regs, regs->orig_ax);
-
-	do_audit_syscall_entry(regs, arch);
-
-	return ret ?: regs->orig_ax;
-}
-
-long syscall_trace_enter(struct pt_regs *regs)
-{
-	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
-	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
-
-	if (phase1_result == 0)
-		return regs->orig_ax;
-	else
-		return syscall_trace_enter_phase2(regs, arch, phase1_result);
-}
-
-void syscall_trace_leave(struct pt_regs *regs)
-{
-	bool step;
-
-	/*
-	 * We may come here right after calling schedule_user()
-	 * or do_notify_resume(), in which case we can be in RCU
-	 * user mode.
-	 */
-	user_exit();
-
-	audit_syscall_exit(regs);
-
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
-		trace_sys_exit(regs, regs->ax);
-
-	/*
-	 * If TIF_SYSCALL_EMU is set, we only get here because of
-	 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
-	 * We already reported this syscall instruction in
-	 * syscall_trace_enter().
-	 */
-	step = unlikely(test_thread_flag(TIF_SINGLESTEP)) &&
-			!test_thread_flag(TIF_SYSCALL_EMU);
-	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
-		tracehook_report_syscall_exit(regs, step);
-
-	user_enter();
-}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 206996c1669d..197c44e8ff8b 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -701,7 +701,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
  * want to handle. Thus you cannot kill init even with a SIGKILL even by
  * mistake.
  */
-static void do_signal(struct pt_regs *regs)
+void do_signal(struct pt_regs *regs)
 {
 	struct ksignal ksig;
 
@@ -736,32 +736,6 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-/*
- * notification of userspace execution resumption
- * - triggered by the TIF_WORK_MASK flags
- */
-__visible void
-do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
-{
-	user_exit();
-
-	if (thread_info_flags & _TIF_UPROBE)
-		uprobe_notify_resume(regs);
-
-	/* deal with pending signal delivery */
-	if (thread_info_flags & _TIF_SIGPENDING)
-		do_signal(regs);
-
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
-	if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
-		fire_user_return_notifiers();
-
-	user_enter();
-}
-
 void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
 {
 	struct task_struct *me = current;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (5 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 06/17] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-30 17:01   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

Other than the super-atomic exception entries, all exception entries
are supposed to switch our context tracking state to CONTEXT_KERNEL.
Assert that they do.  These assertions appear trivial at this point,
as exception_enter is the function responsible for switching
context, but I'm planning on reworking x86's exception context
tracking, and these assertions will help make sure that all of this
code keeps working.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/traps.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f5791927aa64..2a783c4fe0e9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,6 +292,8 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
 	enum ctx_state prev_state = exception_enter();
 	siginfo_t info;
 
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
 	if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
 			NOTIFY_STOP) {
 		conditional_sti(regs);
@@ -376,6 +378,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	siginfo_t *info;
 
 	prev_state = exception_enter();
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	if (notify_die(DIE_TRAP, "bounds", regs, error_code,
 			X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
 		goto exit;
@@ -457,6 +460,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	conditional_sti(regs);
 
 	if (v8086_mode(regs)) {
@@ -514,6 +518,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 		return;
 
 	prev_state = ist_enter(regs);
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
 				SIGTRAP) == NOTIFY_STOP)
@@ -750,6 +755,7 @@ dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	math_error(regs, error_code, X86_TRAP_MF);
 	exception_exit(prev_state);
 }
@@ -760,6 +766,7 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	math_error(regs, error_code, X86_TRAP_XF);
 	exception_exit(prev_state);
 }
@@ -776,6 +783,7 @@ do_device_not_available(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	BUG_ON(use_eager_fpu());
 
 #ifdef CONFIG_MATH_EMULATION
@@ -805,6 +813,7 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	local_irq_enable();
 
 	info.si_signo = SIGILL;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (6 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-07-01 10:24   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

Changing the x86 context tracking hooks is dangerous because there
are no good checks that we track our context correctly.  Add a
helper to check that we're actually in CONTEXT_USER when we enter
from user mode and wire it up for syscall entries.

Subsequent patches will wire this up for all non-NMI entries as
well.  NMIs are their own special beast and cannot currently switch
overall context tracking state.  Instead, they have their own
special RCU hooks.

This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a layer
of indirection).  Eventually, we should fix up the core context
tracking code to supply a function that does what we want (and can
be much simpler than user_exit), which will enable us to get rid of
the extra call.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 348465473e55..8a7e35af7164 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -28,6 +28,15 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
+#ifdef CONFIG_CONTEXT_TRACKING
+/* Called on entry from user mode with IRQs off. */
+__visible void enter_from_user_mode(void)
+{
+	CT_WARN_ON(ct_state() != CONTEXT_USER);
+	user_exit();
+}
+#endif
+
 static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
 {
 #ifdef CONFIG_X86_64
@@ -65,14 +74,16 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	work = ACCESS_ONCE(current_thread_info()->flags) &
 		_TIF_WORK_SYSCALL_ENTRY;
 
+#ifdef CONFIG_CONTEXT_TRACKING
 	/*
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
 	if (work & _TIF_NOHZ) {
-		user_exit();
+		enter_from_user_mode();
 		work &= ~_TIF_NOHZ;
 	}
+#endif
 
 #ifdef CONFIG_SECCOMP
 	/*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (7 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-07-02  9:48   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

The current entry and exit code is incomprehensible, appears to work
primary by luck, and is very difficult to incrementally improve.  Add
new code in preparation for simply deleting the old code.

prepare_exit_to_usermode is a new function that will handle all slow
path exits to user mode.  It is called with IRQs disabled and it
leaves us in a state in which it is safe to immediately return to
user mode.  IRQs must not be re-enabled at any point after
prepare_exit_to_usermode returns and user mode is actually entered.
(We can, of course, fail to enter user mode and treat that failure
as a fresh entry to kernel mode.)  All callers of do_notify_resume
will be migrated to call prepare_exit_to_usermode instead;
prepare_exit_to_usermode needs to do everything that
do_notify_resume does, but it also takes care of scheduling and
context tracking.  Unlike do_notify_resume, it does not need to be
called in a loop.

syscall_return_slowpath is exactly what it sounds like.  It will be
called on any syscall exit slow path.  It will replaces
syscall_trace_leave and it calls prepare_exit_to_usermode on the way
out.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 8a7e35af7164..55530d6dd1bd 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 		return syscall_trace_enter_phase2(regs, arch, phase1_result);
 }
 
+/* Deprecated. */
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;
@@ -237,8 +238,117 @@ void syscall_trace_leave(struct pt_regs *regs)
 	user_enter();
 }
 
+static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
+{
+	unsigned long top_of_stack =
+		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
+	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
+}
+
+/* Called with IRQs disabled. */
+__visible void prepare_exit_to_usermode(struct pt_regs *regs)
+{
+	if (WARN_ON(!irqs_disabled()))
+		local_irq_disable();
+
+	/*
+	 * In order to return to user mode, we need to have IRQs off with
+	 * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
+	 * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
+	 * can be set at any time on preemptable kernels if we have IRQs on,
+	 * so we need to loop.  Disabling preemption wouldn't help: doing the
+	 * work to clear some of the flags can sleep.
+	 */
+	while (true) {
+		u32 cached_flags =
+			READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+
+		if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+				      _TIF_UPROBE | _TIF_NEED_RESCHED)))
+			break;
+
+		/* We have work to do. */
+		local_irq_enable();
+
+		if (cached_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (cached_flags & _TIF_UPROBE)
+			uprobe_notify_resume(regs);
+
+		/* deal with pending signal delivery */
+		if (cached_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (cached_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
+			fire_user_return_notifiers();
+
+		/* Disable IRQs and retry */
+		local_irq_disable();
+	}
+
+	user_enter();
+}
+
+/*
+ * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
+ * state such that we can immediately switch to user mode.
+ */
+__visible void syscall_return_slowpath(struct pt_regs *regs)
+{
+	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	u32 cached_flags = READ_ONCE(ti->flags);
+	bool step;
+
+	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
+
+	if (WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
+		 regs->orig_ax))
+		local_irq_enable();
+
+	/*
+	 * First do one-time work.  If these work items are enabled, we
+	 * want to run them exactly once per syscall exit with IRQs on.
+	 */
+	if (cached_flags & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |
+			    _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)) {
+		audit_syscall_exit(regs);
+
+		if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+			trace_sys_exit(regs, regs->ax);
+
+		/*
+		 * If TIF_SYSCALL_EMU is set, we only get here because of
+		 * TIF_SINGLESTEP (i.e. this is PTRACE_SYSEMU_SINGLESTEP).
+		 * We already reported this syscall instruction in
+		 * syscall_trace_enter().
+		 */
+		step = unlikely(
+			(cached_flags & (_TIF_SINGLESTEP | _TIF_SYSCALL_EMU))
+			== _TIF_SINGLESTEP);
+		if (step || cached_flags & _TIF_SYSCALL_TRACE)
+			tracehook_report_syscall_exit(regs, step);
+	}
+
+#ifdef CONFIG_COMPAT
+	/*
+	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
+	 * returning to user mode.
+	 */
+	ti->status &= ~TS_COMPAT;
+#endif
+
+	local_irq_disable();
+	prepare_exit_to_usermode(regs);
+}
+
 /*
- * notification of userspace execution resumption
+ * Deprecated notification of userspace execution resumption
  * - triggered by the TIF_WORK_MASK flags
  */
 __visible void
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (8 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-07-02 10:25   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks Andy Lutomirski
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

In 539f51136500 ("x86/asm/entry/64: Disentangle error_entry/exit
gsbase/ebx/usermode code"), I arranged the code slightly wrong --
IRET faults would skip the code path that was intended to execute on
all error entries from user mode.  Fix it up.

This does not fix a bug, but we'll need it, and it slightly shrinks
the code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 141a5d49dddc..cd9cbc62159c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1145,9 +1145,14 @@ ENTRY(error_entry)
 	testb	$3, CS+8(%rsp)
 	jz	error_kernelspace
 
-	/* We entered from user mode */
+error_entry_from_usermode_swapgs:
+	/*
+	 * We entered from user mode or we're pretending to have entered
+	 * from user mode due to an IRET fault.
+	 */
 	SWAPGS
 
+error_entry_from_usermode_after_swapgs:
 error_entry_done:
 	TRACE_IRQS_OFF
 	ret
@@ -1174,8 +1179,7 @@ error_kernelspace:
 	 * gsbase and proceed.  We'll fix up the exception and land in
 	 * gs_change's error handler with kernel gsbase.
 	 */
-	SWAPGS
-	jmp	error_entry_done
+	jmp	error_entry_from_usermode_swapgs
 
 bstep_iret:
 	/* Fix truncated RIP */
@@ -1198,7 +1202,7 @@ error_bad_iret:
 	call	fixup_bad_iret
 	mov	%rax, %rsp
 	decl	%ebx
-	jmp	error_entry_done
+	jmp	error_entry_from_usermode_after_swapgs
 END(error_entry)
 
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (9 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-07-02 10:49   ` Borislav Petkov
  2015-07-02 16:56   ` Denys Vlasenko
  2015-06-29 19:33 ` [PATCH v4 12/17] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

These need to be migrated together, as the compat case used to jump
into the middle of the 64-bit exit code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S        | 69 +++++-----------------------------------
 arch/x86/entry/entry_64_compat.S |  7 ++--
 2 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cd9cbc62159c..9bc76766aa71 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
 	 */
 	USERGS_SYSRET64
 
+GLOBAL(int_ret_from_sys_call_irqs_off)
+	TRACE_IRQS_ON
+	ENABLE_INTERRUPTS(CLBR_NONE)
+	jmp int_ret_from_sys_call
+
 	/* Do syscall entry tracing */
 tracesys:
 	movq	%rsp, %rdi
@@ -272,69 +277,11 @@ tracesys_phase2:
  * Has correct iret frame.
  */
 GLOBAL(int_ret_from_sys_call)
-	DISABLE_INTERRUPTS(CLBR_NONE)
-int_ret_from_sys_call_irqs_off: /* jumps come here from the irqs-off SYSRET path */
-	TRACE_IRQS_OFF
-	movl	$_TIF_ALLWORK_MASK, %edi
-	/* edi:	mask to check */
-GLOBAL(int_with_check)
-	LOCKDEP_SYS_EXIT_IRQ
-	GET_THREAD_INFO(%rcx)
-	movl	TI_flags(%rcx), %edx
-	andl	%edi, %edx
-	jnz	int_careful
-	andl	$~TS_COMPAT, TI_status(%rcx)
-	jmp	syscall_return
-
-	/*
-	 * Either reschedule or signal or syscall exit tracking needed.
-	 * First do a reschedule test.
-	 * edx:	work, edi: workmask
-	 */
-int_careful:
-	bt	$TIF_NEED_RESCHED, %edx
-	jnc	int_very_careful
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq	%rdi
-	SCHEDULE_USER
-	popq	%rdi
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp	int_with_check
-
-	/* handle signals and tracing -- both require a full pt_regs */
-int_very_careful:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_EXTRA_REGS
-	/* Check for syscall exit trace */
-	testl	$_TIF_WORK_SYSCALL_EXIT, %edx
-	jz	int_signal
-	pushq	%rdi
-	leaq	8(%rsp), %rdi			/* &ptregs -> arg1 */
-	call	syscall_trace_leave
-	popq	%rdi
-	andl	$~(_TIF_WORK_SYSCALL_EXIT|_TIF_SYSCALL_EMU), %edi
-	jmp	int_restore_rest
-
-int_signal:
-	testl	$_TIF_DO_NOTIFY_MASK, %edx
-	jz	1f
-	movq	%rsp, %rdi			/* &ptregs -> arg1 */
-	xorl	%esi, %esi			/* oldset -> arg2 */
-	call	do_notify_resume
-1:	movl	$_TIF_WORK_MASK, %edi
-int_restore_rest:
+	movq	%rsp, %rdi
+	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	RESTORE_EXTRA_REGS
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp	int_with_check
-
-syscall_return:
-	/* The IRETQ could re-enable interrupts: */
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_IRETQ
+	TRACE_IRQS_IRETQ		/* we're about to change IF */
 
 	/*
 	 * Try to use SYSRET instead of IRET if we're returning to
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index efe0b1e499fa..ac0658142ae1 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -209,10 +209,10 @@ sysexit_from_sys_call:
 	.endm
 
 	.macro auditsys_exit exit
-	testl	$(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
-	jnz	ia32_ret_from_sys_call
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
+	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	jnz ia32_ret_from_sys_call
 	movl	%eax, %esi		/* second arg, syscall return value */
 	cmpl	$-MAX_ERRNO, %eax	/* is it an error ? */
 	jbe	1f
@@ -227,11 +227,10 @@ sysexit_from_sys_call:
 	testl	%edi, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
 	jz	\exit
 	xorl	%eax, %eax		/* Do not leak kernel information */
-	movq	%rax, R11(%rsp)
 	movq	%rax, R10(%rsp)
 	movq	%rax, R9(%rsp)
 	movq	%rax, R8(%rsp)
-	jmp	int_with_check
+	jmp	int_ret_from_sys_call_irqs_off
 	.endm
 
 sysenter_auditsys:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 12/17] x86/asm/entry/64: Save all regs on interrupt entry
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (10 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-07-02 10:52   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 13/17] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

To prepare for the big rewrite of the error and interrupt exit
paths, we will need pt_regs completely filled in.  It's already
completely filled in when error_exit runs, so rearrange interrupt
handling to match it.  This will slow down interrupt handling very
slightly (eight instructions), but the simplification it enables
will be more than worth it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9bc76766aa71..51d34fafe4c0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -502,21 +502,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
-	/*
-	 * Since nothing in interrupt handling code touches r12...r15 members
-	 * of "struct pt_regs", and since interrupts can nest, we can save
-	 * four stack slots and simultaneously provide
-	 * an unwind-friendly stack layout by saving "truncated" pt_regs
-	 * exactly up to rbp slot, without these members.
-	 */
-	ALLOC_PT_GPREGS_ON_STACK -RBP
-	SAVE_C_REGS -RBP
-	/* this goes to 0(%rsp) for unwinder, not for saving the value: */
-	SAVE_EXTRA_REGS_RBP -RBP
+	ALLOC_PT_GPREGS_ON_STACK
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
 
-	leaq	-RBP(%rsp), %rdi		/* arg1 for \func (pointer to pt_regs) */
+	movq	%rsp,%rdi	/* arg1 for \func (pointer to pt_regs) */
 
-	testb	$3, CS-RBP(%rsp)
+	testb	$3, CS(%rsp)
 	jz	1f
 	SWAPGS
 1:
@@ -553,9 +545,7 @@ ret_from_intr:
 	decl	PER_CPU_VAR(irq_count)
 
 	/* Restore saved previous stack */
-	popq	%rsi
-	/* return code expects complete pt_regs - adjust rsp accordingly: */
-	leaq	-RBP(%rsi), %rsp
+	popq	%rsp
 
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
@@ -580,7 +570,7 @@ retint_swapgs:					/* return to user-space */
 	TRACE_IRQS_IRETQ
 
 	SWAPGS
-	jmp	restore_c_regs_and_iret
+	jmp	restore_regs_and_iret
 
 /* Returning to kernel space */
 retint_kernel:
@@ -604,6 +594,8 @@ retint_kernel:
  * At this label, code paths which return to kernel and to user,
  * which come from interrupts/exception and from syscalls, merge.
  */
+restore_regs_and_iret:
+	RESTORE_EXTRA_REGS
 restore_c_regs_and_iret:
 	RESTORE_C_REGS
 	REMOVE_PT_GPREGS_FROM_STACK 8
@@ -674,12 +666,10 @@ retint_signal:
 	jz	retint_swapgs
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_EXTRA_REGS
 	movq	$-1, ORIG_RAX(%rsp)
 	xorl	%esi, %esi			/* oldset */
 	movq	%rsp, %rdi			/* &pt_regs */
 	call	do_notify_resume
-	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)
@@ -1160,7 +1150,6 @@ END(error_entry)
  */
 ENTRY(error_exit)
 	movl	%ebx, %eax
-	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	testl	%eax, %eax
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 13/17] x86/asm/entry/64: Simplify irq stack pt_regs handling
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (11 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 12/17] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

There's no need for both rsi and rdi to point to the original stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 51d34fafe4c0..ce2c9049abef 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -506,8 +506,6 @@ END(irq_entries_start)
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
 
-	movq	%rsp,%rdi	/* arg1 for \func (pointer to pt_regs) */
-
 	testb	$3, CS(%rsp)
 	jz	1f
 	SWAPGS
@@ -519,14 +517,14 @@ END(irq_entries_start)
 	 * a little cheaper to use a separate counter in the PDA (short of
 	 * moving irq_enter into assembly, which would be too much work)
 	 */
-	movq	%rsp, %rsi
+	movq	%rsp, %rdi
 	incl	PER_CPU_VAR(irq_count)
 	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
-	pushq	%rsi
+	pushq	%rdi
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
-	call	\func
+	call	\func	/* rdi points to pt_regs */
 	.endm
 
 	/*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (12 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 13/17] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-07-02 12:09   ` Borislav Petkov
  2015-06-29 19:33 ` [PATCH v4 15/17] x86/entry: Remove exception_enter from most trap handlers Andy Lutomirski
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S        | 63 +++++++++++-----------------------------
 arch/x86/entry/entry_64_compat.S |  5 ++++
 2 files changed, 22 insertions(+), 46 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce2c9049abef..08a37ec049f0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -508,7 +508,16 @@ END(irq_entries_start)
 
 	testb	$3, CS(%rsp)
 	jz	1f
+
+	/*
+	 * IRQ from user mode.  Switch to kernel gsbase and inform context
+	 * tracking that we're in kernel mode.
+	 */
 	SWAPGS
+#ifdef CONFIG_CONTEXT_TRACKING
+	call enter_from_user_mode
+#endif
+
 1:
 	/*
 	 * Save previous stack pointer, optionally switch to interrupt stack.
@@ -547,26 +556,13 @@ ret_from_intr:
 
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
-	/* Interrupt came from user space */
-GLOBAL(retint_user)
-	GET_THREAD_INFO(%rcx)
 
-	/* %rcx: thread info. Interrupts are off. */
-retint_with_reschedule:
-	movl	$_TIF_WORK_MASK, %edi
-retint_check:
+	/* Interrupt came from user space */
 	LOCKDEP_SYS_EXIT_IRQ
-	movl	TI_flags(%rcx), %edx
-	andl	%edi, %edx
-	jnz	retint_careful
-
-retint_swapgs:					/* return to user-space */
-	/*
-	 * The iretq could re-enable interrupts:
-	 */
-	DISABLE_INTERRUPTS(CLBR_ANY)
+GLOBAL(retint_user)
+	mov	%rsp,%rdi
+	call	prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
-
 	SWAPGS
 	jmp	restore_regs_and_iret
 
@@ -644,35 +640,6 @@ native_irq_return_ldt:
 	popq	%rax
 	jmp	native_irq_return_iret
 #endif
-
-	/* edi: workmask, edx: work */
-retint_careful:
-	bt	$TIF_NEED_RESCHED, %edx
-	jnc	retint_signal
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq	%rdi
-	SCHEDULE_USER
-	popq	%rdi
-	GET_THREAD_INFO(%rcx)
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp	retint_check
-
-retint_signal:
-	testl	$_TIF_DO_NOTIFY_MASK, %edx
-	jz	retint_swapgs
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	movq	$-1, ORIG_RAX(%rsp)
-	xorl	%esi, %esi			/* oldset */
-	movq	%rsp, %rdi			/* &pt_regs */
-	call	do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	GET_THREAD_INFO(%rcx)
-	jmp	retint_with_reschedule
-
 END(common_interrupt)
 
 /*
@@ -1088,6 +1055,10 @@ error_entry_from_usermode_swapgs:
 	SWAPGS
 
 error_entry_from_usermode_after_swapgs:
+#ifdef CONFIG_CONTEXT_TRACKING
+	call enter_from_user_mode
+#endif
+
 error_entry_done:
 	TRACE_IRQS_OFF
 	ret
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index ac0658142ae1..566d33f17d0b 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -454,6 +454,11 @@ ia32_badarg:
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 
+	/* Now finish entering normal kernel mode. */
+#ifdef CONFIG_CONTEXT_TRACKING
+	call enter_from_user_mode
+#endif
+
 	/* And exit again. */
 	jmp retint_user
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 15/17] x86/entry: Remove exception_enter from most trap handlers
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (13 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 16/17] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

On 64-bit kernels, we don't need it any more: we handle context
tracking directly on entry from user mode and exit to user mode.  On
32-bit kernels, we don't support context tracking at all, so these
hooks had no effect.

This doesn't change do_page_fault.  Before we do that, we need to
make sure that there is no code that can page fault from kernel mode
with CONTEXT_USER.  The 32-bit fast system call stack argument code
is the only offender I'm aware of right now.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h         |  4 +-
 arch/x86/kernel/cpu/mcheck/mce.c     |  5 +--
 arch/x86/kernel/cpu/mcheck/p5.c      |  5 +--
 arch/x86/kernel/cpu/mcheck/winchip.c |  4 +-
 arch/x86/kernel/traps.c              | 78 +++++++++---------------------------
 5 files changed, 27 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c5380bea2a36..c3496619740a 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -112,8 +112,8 @@ asmlinkage void smp_threshold_interrupt(void);
 asmlinkage void smp_deferred_error_interrupt(void);
 #endif
 
-extern enum ctx_state ist_enter(struct pt_regs *regs);
-extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
+extern void ist_enter(struct pt_regs *regs);
+extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index df919ff103c3..dc87973098dc 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1029,7 +1029,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 {
 	struct mca_config *cfg = &mca_cfg;
 	struct mce m, *final;
-	enum ctx_state prev_state;
 	int i;
 	int worst = 0;
 	int severity;
@@ -1055,7 +1054,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
-	prev_state = ist_enter(regs);
+	ist_enter(regs);
 
 	this_cpu_inc(mce_exception_count);
 
@@ -1227,7 +1226,7 @@ out:
 	local_irq_disable();
 	ist_end_non_atomic();
 done:
-	ist_exit(regs, prev_state);
+	ist_exit(regs);
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
index 737b0ad4e61a..12402e10aeff 100644
--- a/arch/x86/kernel/cpu/mcheck/p5.c
+++ b/arch/x86/kernel/cpu/mcheck/p5.c
@@ -19,10 +19,9 @@ int mce_p5_enabled __read_mostly;
 /* Machine check handler for Pentium class Intel CPUs: */
 static void pentium_machine_check(struct pt_regs *regs, long error_code)
 {
-	enum ctx_state prev_state;
 	u32 loaddr, hi, lotype;
 
-	prev_state = ist_enter(regs);
+	ist_enter(regs);
 
 	rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
 	rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
@@ -39,7 +38,7 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
 
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
-	ist_exit(regs, prev_state);
+	ist_exit(regs);
 }
 
 /* Set up machine check reporting for processors with Intel style MCE: */
diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
index 44f138296fbe..01dd8702880b 100644
--- a/arch/x86/kernel/cpu/mcheck/winchip.c
+++ b/arch/x86/kernel/cpu/mcheck/winchip.c
@@ -15,12 +15,12 @@
 /* Machine check handler for WinChip C6: */
 static void winchip_machine_check(struct pt_regs *regs, long error_code)
 {
-	enum ctx_state prev_state = ist_enter(regs);
+	ist_enter(regs);
 
 	printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
-	ist_exit(regs, prev_state);
+	ist_exit(regs);
 }
 
 /* Set up machine check reporting on the Winchip C6 series */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2a783c4fe0e9..8e65d8a9b8db 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -108,13 +108,10 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
 	preempt_count_dec();
 }
 
-enum ctx_state ist_enter(struct pt_regs *regs)
+void ist_enter(struct pt_regs *regs)
 {
-	enum ctx_state prev_state;
-
 	if (user_mode(regs)) {
-		/* Other than that, we're just an exception. */
-		prev_state = exception_enter();
+		CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	} else {
 		/*
 		 * We might have interrupted pretty much anything.  In
@@ -123,32 +120,25 @@ enum ctx_state ist_enter(struct pt_regs *regs)
 		 * but we need to notify RCU.
 		 */
 		rcu_nmi_enter();
-		prev_state = CONTEXT_KERNEL;  /* the value is irrelevant. */
 	}
 
 	/*
-	 * We are atomic because we're on the IST stack (or we're on x86_32,
-	 * in which case we still shouldn't schedule).
-	 *
-	 * This must be after exception_enter(), because exception_enter()
-	 * won't do anything if in_interrupt() returns true.
+	 * We are atomic because we're on the IST stack; or we're on
+	 * x86_32, in which case we still shouldn't schedule; or we're
+	 * on x86_64 and entered from user mode, in which case we're
+	 * still atomic unless ist_begin_non_atomic is called.
 	 */
 	preempt_count_add(HARDIRQ_OFFSET);
 
 	/* This code is a bit fragile.  Test it. */
 	rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
-
-	return prev_state;
 }
 
-void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
+void ist_exit(struct pt_regs *regs)
 {
-	/* Must be before exception_exit. */
 	preempt_count_sub(HARDIRQ_OFFSET);
 
-	if (user_mode(regs))
-		return exception_exit(prev_state);
-	else
+	if (!user_mode(regs))
 		rcu_nmi_exit();
 }
 
@@ -162,7 +152,7 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
  * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
  * begins a non-atomic section within an ist_enter()/ist_exit() region.
  * Callers are responsible for enabling interrupts themselves inside
- * the non-atomic section, and callers must call is_end_non_atomic()
+ * the non-atomic section, and callers must call ist_end_non_atomic()
  * before ist_exit().
  */
 void ist_begin_non_atomic(struct pt_regs *regs)
@@ -289,7 +279,6 @@ NOKPROBE_SYMBOL(do_trap);
 static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
 			  unsigned long trapnr, int signr)
 {
-	enum ctx_state prev_state = exception_enter();
 	siginfo_t info;
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -300,8 +289,6 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
 		do_trap(trapnr, signr, str, regs, error_code,
 			fill_trap_info(regs, signr, trapnr, &info));
 	}
-
-	exception_exit(prev_state);
 }
 
 #define DO_ERROR(trapnr, signr, str, name)				\
@@ -353,7 +340,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	}
 #endif
 
-	ist_enter(regs);  /* Discard prev_state because we won't return. */
+	ist_enter(regs);
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
 	tsk->thread.error_code = error_code;
@@ -373,15 +360,13 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 
 dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 {
-	enum ctx_state prev_state;
 	const struct bndcsr *bndcsr;
 	siginfo_t *info;
 
-	prev_state = exception_enter();
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	if (notify_die(DIE_TRAP, "bounds", regs, error_code,
 			X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
-		goto exit;
+		return;
 	conditional_sti(regs);
 
 	if (!user_mode(regs))
@@ -438,9 +423,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 		die("bounds", regs, error_code);
 	}
 
-exit:
-	exception_exit(prev_state);
 	return;
+
 exit_trap:
 	/*
 	 * This path out is for all the cases where we could not
@@ -450,36 +434,33 @@ exit_trap:
 	 * time..
 	 */
 	do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
-	exception_exit(prev_state);
 }
 
 dotraplinkage void
 do_general_protection(struct pt_regs *regs, long error_code)
 {
 	struct task_struct *tsk;
-	enum ctx_state prev_state;
 
-	prev_state = exception_enter();
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	conditional_sti(regs);
 
 	if (v8086_mode(regs)) {
 		local_irq_enable();
 		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
-		goto exit;
+		return;
 	}
 
 	tsk = current;
 	if (!user_mode(regs)) {
 		if (fixup_exception(regs))
-			goto exit;
+			return;
 
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_nr = X86_TRAP_GP;
 		if (notify_die(DIE_GPF, "general protection fault", regs, error_code,
 			       X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
 			die("general protection fault", regs, error_code);
-		goto exit;
+		return;
 	}
 
 	tsk->thread.error_code = error_code;
@@ -495,16 +476,12 @@ do_general_protection(struct pt_regs *regs, long error_code)
 	}
 
 	force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
-exit:
-	exception_exit(prev_state);
 }
 NOKPROBE_SYMBOL(do_general_protection);
 
 /* May run on IST stack. */
 dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 {
-	enum ctx_state prev_state;
-
 #ifdef CONFIG_DYNAMIC_FTRACE
 	/*
 	 * ftrace must be first, everything else may cause a recursive crash.
@@ -517,7 +494,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	if (poke_int3_handler(regs))
 		return;
 
-	prev_state = ist_enter(regs);
+	ist_enter(regs);
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
@@ -544,7 +521,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	preempt_conditional_cli(regs);
 	debug_stack_usage_dec();
 exit:
-	ist_exit(regs, prev_state);
+	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_int3);
 
@@ -620,12 +597,11 @@ NOKPROBE_SYMBOL(fixup_bad_iret);
 dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 {
 	struct task_struct *tsk = current;
-	enum ctx_state prev_state;
 	int user_icebp = 0;
 	unsigned long dr6;
 	int si_code;
 
-	prev_state = ist_enter(regs);
+	ist_enter(regs);
 
 	get_debugreg(dr6, 6);
 
@@ -700,7 +676,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-	ist_exit(regs, prev_state);
+	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
 
@@ -752,23 +728,15 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr)
 
 dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
 {
-	enum ctx_state prev_state;
-
-	prev_state = exception_enter();
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	math_error(regs, error_code, X86_TRAP_MF);
-	exception_exit(prev_state);
 }
 
 dotraplinkage void
 do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
 {
-	enum ctx_state prev_state;
-
-	prev_state = exception_enter();
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	math_error(regs, error_code, X86_TRAP_XF);
-	exception_exit(prev_state);
 }
 
 dotraplinkage void
@@ -780,9 +748,6 @@ do_spurious_interrupt_bug(struct pt_regs *regs, long error_code)
 dotraplinkage void
 do_device_not_available(struct pt_regs *regs, long error_code)
 {
-	enum ctx_state prev_state;
-
-	prev_state = exception_enter();
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	BUG_ON(use_eager_fpu());
 
@@ -794,7 +759,6 @@ do_device_not_available(struct pt_regs *regs, long error_code)
 
 		info.regs = regs;
 		math_emulate(&info);
-		exception_exit(prev_state);
 		return;
 	}
 #endif
@@ -802,7 +766,6 @@ do_device_not_available(struct pt_regs *regs, long error_code)
 #ifdef CONFIG_X86_32
 	conditional_sti(regs);
 #endif
-	exception_exit(prev_state);
 }
 NOKPROBE_SYMBOL(do_device_not_available);
 
@@ -810,9 +773,7 @@ NOKPROBE_SYMBOL(do_device_not_available);
 dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
 {
 	siginfo_t info;
-	enum ctx_state prev_state;
 
-	prev_state = exception_enter();
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 	local_irq_enable();
 
@@ -825,7 +786,6 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
 		do_trap(X86_TRAP_IRET, SIGILL, "iret exception", regs, error_code,
 			&info);
 	}
-	exception_exit(prev_state);
 }
 #endif
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 16/17] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (14 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 15/17] x86/entry: Remove exception_enter from most trap handlers Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-29 19:33 ` [PATCH v4 17/17] x86/irq: Document how IRQ context tracking works and add an assertion Andy Lutomirski
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

SCHEDULE_USER is no longer used, and asm/context-tracking.h
contained nothing else.  Remove the header entirely

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S               |  1 -
 arch/x86/include/asm/context_tracking.h | 10 ----------
 2 files changed, 11 deletions(-)
 delete mode 100644 arch/x86/include/asm/context_tracking.h

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 08a37ec049f0..f8f71a567c71 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -33,7 +33,6 @@
 #include <asm/paravirt.h>
 #include <asm/percpu.h>
 #include <asm/asm.h>
-#include <asm/context_tracking.h>
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
 #include <linux/err.h>
diff --git a/arch/x86/include/asm/context_tracking.h b/arch/x86/include/asm/context_tracking.h
deleted file mode 100644
index 1fe49704b146..000000000000
--- a/arch/x86/include/asm/context_tracking.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#ifndef _ASM_X86_CONTEXT_TRACKING_H
-#define _ASM_X86_CONTEXT_TRACKING_H
-
-#ifdef CONFIG_CONTEXT_TRACKING
-# define SCHEDULE_USER call schedule_user
-#else
-# define SCHEDULE_USER call schedule
-#endif
-
-#endif
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 17/17] x86/irq: Document how IRQ context tracking works and add an assertion
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (15 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 16/17] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
@ 2015-06-29 19:33 ` Andy Lutomirski
  2015-06-29 19:46 ` [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Richard Weinberger
  2015-07-02 16:45 ` Borislav Petkov
  18 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 19:33 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/irq.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 88b366487b0e..6233de046c08 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -216,8 +216,23 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 	unsigned vector = ~regs->orig_ax;
 	unsigned irq;
 
+	/*
+	 * NB: Unlike exception entries, IRQ entries do not reliably
+	 * handle context tracking in the low-level entry code.  This is
+	 * because syscall entries execute briefly with IRQs on before
+	 * updating context tracking state, so we can take an IRQ from
+	 * kernel mode with CONTEXT_USER.  The low-level entry code only
+	 * updates the context if we came from user mode, so we won't
+	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
+	 * code is cleaned up enough that we can cleanly defer enabling
+	 * IRQs.
+	 */
+
 	entering_irq();
 
+	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
+	rcu_lockdep_assert(rcu_is_watching(), "IRQ failed to wake up RCU");
+
 	irq = __this_cpu_read(vector_irq[vector]);
 
 	if (!handle_irq(irq, regs)) {
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/17] x86: Rewrite exit-to-userspace code
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (16 preceding siblings ...)
  2015-06-29 19:33 ` [PATCH v4 17/17] x86/irq: Document how IRQ context tracking works and add an assertion Andy Lutomirski
@ 2015-06-29 19:46 ` Richard Weinberger
  2015-06-29 20:14   ` Andy Lutomirski
  2015-07-02 16:45 ` Borislav Petkov
  18 siblings, 1 reply; 54+ messages in thread
From: Richard Weinberger @ 2015-06-29 19:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Borislav Petkov, Kees Cook,
	Brian Gerst, Paul McKenney

Andy,

On Mon, Jun 29, 2015 at 9:33 PM, Andy Lutomirski <luto@kernel.org> wrote:
> This is the first big batch of x86 asm-to-C conversion patches.
>
> The exit-to-usermode code is copied in several places and is written
> in a nasty combination of asm and C.  It's not at all clear what
> it's supposed to do, and the way it's structured makes it very hard
> to work with.  For example, it's not even clear why syscall exit
> hooks are called only once per syscall right now.  (It seems to be a
> side effect of the way that rdi and rdx are handled in the asm loop,
> and it seems reliable, but it's still pointlessly complicated.)  The
> existing code also makes context tracking overly complicated and
> hard to understand.  Finally, it's nearly impossible for anyone to
> change what happens on exit to usermode, since the existing code is
> so fragile.
>
> I tried to clean it up incrementally, but I decided it was too hard.
> Instead, this series just replaces the code.  It seems to work.
>
> Context tracking in particular works very differently now.  The
> low-level entry code checks that we're in CONTEXT_USER and switches
> to CONTEXT_KERNEL.  The exit code does the reverse.  There is no
> need to track what CONTEXT_XYZ state we came from, because we
> already know.  Similarly, SCHEDULE_USER is gone, since we can
> reschedule if needed by simply calling schedule() from C code.
>
> The main things that are missing are that I haven't done the 32-bit
> parts (anyone want to help?) and therefore I haven't deleted the old
> C code.  I also think this may break UML for trivial reasons.
>
> IRQ context tracking is still messy.  One the cleanup progresses
> to the point that we can enter CONTEXT_KERNEL in syscalls before
> enabling interrupts, we can fully clean up IRQ context tracking.
>
> Once these land, I'll send some more :)
>
> Note: we might want to backport patches 1 and 2.
>
> Changes from v3:
>  - Add the syscall_arg_fault_32 test.
>  - Fix a pre-existing bad syscall arg buglet.
>  - Fix an asm glitch due to a bad rebase.
>  - Fix a CONFIG_PROVE_LOCKDEP warning.
> Borislav: the end result of this series differs from the v3.91 that I
> only in the removal of a single trailing tab.  The badarg patch is in
> a different place now, though, since we might want to backport it.
>
> Changes from v2: Misplaced the actual list -- sorry.
>
> Changes from v1:
>  - Fix bisection failure by squashing the 64-bit native and compat syscall
>    conversions together.  The intermediate state didn't built, and fixing
>    it isn't worthwhile (the results will be harder to understand).
>  - Replace context_tracking_assert_state with CT_WARN_ON and ct_state.
>  - The last two patches are now.  I incorrectly thought that we weren't
>    ready for them yet on 32-bit kernels, but I was wrong.
>
> Andy Lutomirski (16):
>   selftests/x86: Add a test for 32-bit fast syscall arg faults
>   x86/entry/64/compat: Fix bad fast syscall arg failure path
>   context_tracking: Add ct_state and CT_WARN_ON
>   notifiers: Assert that RCU is watching in notify_die
>   x86: Move C entry and exit code to arch/x86/entry/common.c
>   x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
>   x86/entry: Add enter_from_user_mode and use it in syscalls
>   x86/entry: Add new, comprehensible entry and exit hooks
>   x86/entry/64: Really create an error-entry-from-usermode code path
>   x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks
>   x86/asm/entry/64: Save all regs on interrupt entry
>   x86/asm/entry/64: Simplify irq stack pt_regs handling
>   x86/asm/entry/64: Migrate error and interrupt exit work to C
>   x86/entry: Remove exception_enter from most trap handlers
>   x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h
>   x86/irq: Document how IRQ context tracking works and add an assertion
>
> Ingo Molnar (1):
>   uml: Fix do_signal() prototype

Do you have a git tree for that?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/17] x86: Rewrite exit-to-userspace code
  2015-06-29 19:46 ` [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Richard Weinberger
@ 2015-06-29 20:14   ` Andy Lutomirski
  0 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-29 20:14 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Andy Lutomirski, x86, LKML, Frédéric Weisbecker,
	Rik van Riel, Oleg Nesterov, Denys Vlasenko, Borislav Petkov,
	Kees Cook, Brian Gerst, Paul McKenney

On Mon, Jun 29, 2015 at 12:46 PM, Richard Weinberger
<richard.weinberger@gmail.com> wrote:
> Andy,
>
> On Mon, Jun 29, 2015 at 9:33 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> This is the first big batch of x86 asm-to-C conversion patches.
>>
>
> Do you have a git tree for that?

I just tagged it here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=patch-20150629-entry-v4

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 03/17] uml: Fix do_signal() prototype
  2015-06-29 19:33 ` [PATCH v4 03/17] uml: Fix do_signal() prototype Andy Lutomirski
@ 2015-06-29 20:47   ` Richard Weinberger
  0 siblings, 0 replies; 54+ messages in thread
From: Richard Weinberger @ 2015-06-29 20:47 UTC (permalink / raw)
  To: Andy Lutomirski, x86, linux-kernel
  Cc: Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Borislav Petkov, Kees Cook, Brian Gerst, paulmck,
	Ingo Molnar, Richard Weinberger, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner

Am 29.06.2015 um 21:33 schrieb Andy Lutomirski:
> From: Ingo Molnar <mingo@kernel.org>
> 
> Once x86 exports its do_signal(), the prototypes will clash.
> 
> Fix the clash and also improve the code a bit: remove the unnecessary
> kern_do_signal() indirection. This allows interrupt_end() to share
> the 'regs' parameter calculation.
> 
> Also remove the unused return code to match x86.
> 
> Minimally build and boot tested.
> 
> Cc: Richard Weinberger <richard.weinberger@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> [Adjusted the commit message because I reordered the patch. --Andy]
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Richard Weinberger <richard@nod.at>

Thanks,
//richard

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path
  2015-06-29 19:33 ` [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path Andy Lutomirski
@ 2015-06-30 10:58   ` Borislav Petkov
  2015-06-30 11:06     ` Ingo Molnar
  2015-06-30 16:04     ` Andy Lutomirski
  0 siblings, 2 replies; 54+ messages in thread
From: Borislav Petkov @ 2015-06-30 10:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:34PM -0700, Andy Lutomirski wrote:
> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> index bb187a6a877c..efe0b1e499fa 100644
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -425,8 +425,39 @@ cstar_tracesys:
>  END(entry_SYSCALL_compat)
>  
>  ia32_badarg:
> -	ASM_CLAC
> -	movq	$-EFAULT, RAX(%rsp)
> +	/*
> +	 * So far, we've entered kernel mode, set AC, turned on IRQs, and
> +	 * saved C regs except r8-r11.  We haven't done any of the other
> +	 * standard entry work, though.  We want to bail, but we shouldn't
> +	 * treat this as a syscall entry since we don't even know what the
> +	 * args are.  Instead, treat this as a non-syscall entry, finish
> +	 * the entry work, and immediately exit after setting AX = -EFAULT.
> +	 *
> +	 * We're really just being polite here.  Killing the task outright
> +	 * would be a reasonable action, too.  Given that the only valid
> +	 * way to have gotten here is through the vDSO, and we already know
> +	 * that the stack pointer is bad, the task isn't going to survive
> +	 * for long no matter what we do.

You mean something like

	force_sig_info(SIGSEGV, &si, current);

?

I'd say we do it and not noodle unnecessarily with zeroing out pt_regs
if the task is going to die anyway. IOW, make it die faster. :)

> +	 */
> +
> +	ASM_CLAC			/* undo STAC */
> +	movq	$-EFAULT, RAX(%rsp)	/* return -EFAULT if possible */
> +
> +	/* Fill in the rest of pt_regs */
> +	xorl	%eax, %eax
> +	movq	%rax, R11(%rsp)
> +	movq	%rax, R10(%rsp)
> +	movq	%rax, R9(%rsp)
> +	movq	%rax, R8(%rsp)
> +	SAVE_EXTRA_REGS
> +
> +	/* Turn IRQs back off. */
> +	DISABLE_INTERRUPTS(CLBR_NONE)
> +	TRACE_IRQS_OFF
> +
> +	/* And exit again. */
> +	jmp retint_user
> +
>  ia32_ret_from_sys_call:
>  	xorl	%eax, %eax		/* Do not leak kernel information */
>  	movq	%rax, R11(%rsp)
> -- 
> 2.4.3

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path
  2015-06-30 10:58   ` Borislav Petkov
@ 2015-06-30 11:06     ` Ingo Molnar
  2015-06-30 16:04     ` Andy Lutomirski
  1 sibling, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2015-06-30 11:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, paulmck


* Borislav Petkov <bp@alien8.de> wrote:

> On Mon, Jun 29, 2015 at 12:33:34PM -0700, Andy Lutomirski wrote:
> > diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> > index bb187a6a877c..efe0b1e499fa 100644
> > --- a/arch/x86/entry/entry_64_compat.S
> > +++ b/arch/x86/entry/entry_64_compat.S
> > @@ -425,8 +425,39 @@ cstar_tracesys:
> >  END(entry_SYSCALL_compat)
> >  
> >  ia32_badarg:
> > -	ASM_CLAC
> > -	movq	$-EFAULT, RAX(%rsp)
> > +	/*
> > +	 * So far, we've entered kernel mode, set AC, turned on IRQs, and
> > +	 * saved C regs except r8-r11.  We haven't done any of the other
> > +	 * standard entry work, though.  We want to bail, but we shouldn't
> > +	 * treat this as a syscall entry since we don't even know what the
> > +	 * args are.  Instead, treat this as a non-syscall entry, finish
> > +	 * the entry work, and immediately exit after setting AX = -EFAULT.
> > +	 *
> > +	 * We're really just being polite here.  Killing the task outright
> > +	 * would be a reasonable action, too.  Given that the only valid
> > +	 * way to have gotten here is through the vDSO, and we already know
> > +	 * that the stack pointer is bad, the task isn't going to survive
> > +	 * for long no matter what we do.
> 
> You mean something like
> 
> 	force_sig_info(SIGSEGV, &si, current);
> 
> ?

We should also emit a warning message, even if user-space installed a 'special' 
sigfault handler to hide such failures. (I'm looking at you systemd!)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 04/17] context_tracking: Add ct_state and CT_WARN_ON
  2015-06-29 19:33 ` [PATCH v4 04/17] context_tracking: Add ct_state and CT_WARN_ON Andy Lutomirski
@ 2015-06-30 12:20   ` Borislav Petkov
  2015-06-30 12:53     ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-06-30 12:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:36PM -0700, Andy Lutomirski wrote:
> This will let us sprinkle sanity checks around the kernel without
> making too much of a mess.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  include/linux/context_tracking.h       | 15 +++++++++++++++
>  include/linux/context_tracking_state.h |  1 +
>  2 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index b96bd299966f..008fc67d0d96 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -49,13 +49,28 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  	}
>  }
>  
> +
> +/**
> + * ct_state() - return the current context tracking state if known
> + *
> + * Returns the current cpu's context tracking state if context tracking

			 CPU's

> + * is enabled.  If context tracking is disabled, returns
> + * CONTEXT_DISABLED.  This should be used primarily for debugging.
> + */
> +static inline enum ctx_state ct_state(void)
> +{
> +	return context_tracking_is_enabled() ?
> +		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
> +}
>  #else
>  static inline void user_enter(void) { }
>  static inline void user_exit(void) { }
>  static inline enum ctx_state exception_enter(void) { return 0; }
>  static inline void exception_exit(enum ctx_state prev_ctx) { }
> +static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
>  #endif /* !CONFIG_CONTEXT_TRACKING */
>  
> +#define CT_WARN_ON(cond) WARN_ON(context_tracking_is_enabled() && (cond))

Btw, that "CT_" prefix makes this look like it is the software
controlling this thing:

https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/UPMCEast_CTscan.jpg/1280px-UPMCEast_CTscan.jpg

:-)

Other kernel code uses "cxt" or "ctxt" abbreviations. Maybe

	CTXT_WARN_ON(cond)

...

>  #ifdef CONFIG_CONTEXT_TRACKING_FORCE
>  extern void context_tracking_init(void);
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index 678ecdf90cf6..ee956c528fab 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -14,6 +14,7 @@ struct context_tracking {
>  	bool active;
>  	int recursion;
>  	enum ctx_state {
> +		CONTEXT_DISABLED = -1,	/* returned by ct_state() if unknown */
>  		CONTEXT_KERNEL = 0,
>  		CONTEXT_USER,
>  		CONTEXT_GUEST,

And those then CTXT_*

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 04/17] context_tracking: Add ct_state and CT_WARN_ON
  2015-06-30 12:20   ` Borislav Petkov
@ 2015-06-30 12:53     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2015-06-30 12:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, paulmck


* Borislav Petkov <bp@alien8.de> wrote:

> On Mon, Jun 29, 2015 at 12:33:36PM -0700, Andy Lutomirski wrote:
> > This will let us sprinkle sanity checks around the kernel without
> > making too much of a mess.
> > 
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >  include/linux/context_tracking.h       | 15 +++++++++++++++
> >  include/linux/context_tracking_state.h |  1 +
> >  2 files changed, 16 insertions(+)
> > 
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index b96bd299966f..008fc67d0d96 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -49,13 +49,28 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> >  	}
> >  }
> >  
> > +
> > +/**
> > + * ct_state() - return the current context tracking state if known
> > + *
> > + * Returns the current cpu's context tracking state if context tracking
> 
> 			 CPU's
> 
> > + * is enabled.  If context tracking is disabled, returns
> > + * CONTEXT_DISABLED.  This should be used primarily for debugging.
> > + */
> > +static inline enum ctx_state ct_state(void)
> > +{
> > +	return context_tracking_is_enabled() ?
> > +		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
> > +}
> >  #else
> >  static inline void user_enter(void) { }
> >  static inline void user_exit(void) { }
> >  static inline enum ctx_state exception_enter(void) { return 0; }
> >  static inline void exception_exit(enum ctx_state prev_ctx) { }
> > +static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
> >  #endif /* !CONFIG_CONTEXT_TRACKING */
> >  
> > +#define CT_WARN_ON(cond) WARN_ON(context_tracking_is_enabled() && (cond))
> 
> Btw, that "CT_" prefix makes this look like it is the software
> controlling this thing:
> 
> https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/UPMCEast_CTscan.jpg/1280px-UPMCEast_CTscan.jpg
> 
> :-)
> 
> Other kernel code uses "cxt" or "ctxt" abbreviations. Maybe
> 
> 	CTXT_WARN_ON(cond)

So this really looks too much like 'TXT' - which is very far from what it tries to 
express.

So we should either use a CTX_TRACK_ prefix that is unambiguous, or live with CT_ 
and its ambiguity. (There's another ambiguity it has: 'ct' stands for 'connection 
tracking' in the networking code.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path
  2015-06-30 10:58   ` Borislav Petkov
  2015-06-30 11:06     ` Ingo Molnar
@ 2015-06-30 16:04     ` Andy Lutomirski
  2015-07-01  7:43       ` Ingo Molnar
  1 sibling, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-30 16:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kees Cook, Paul E. McKenney, linux-kernel, Oleg Nesterov,
	Denys Vlasenko, Brian Gerst, Frédéric Weisbecker,
	X86 ML, Rik van Riel

On Jun 30, 2015 3:59 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Mon, Jun 29, 2015 at 12:33:34PM -0700, Andy Lutomirski wrote:
> > diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> > index bb187a6a877c..efe0b1e499fa 100644
> > --- a/arch/x86/entry/entry_64_compat.S
> > +++ b/arch/x86/entry/entry_64_compat.S
> > @@ -425,8 +425,39 @@ cstar_tracesys:
> >  END(entry_SYSCALL_compat)
> >
> >  ia32_badarg:
> > -     ASM_CLAC
> > -     movq    $-EFAULT, RAX(%rsp)
> > +     /*
> > +      * So far, we've entered kernel mode, set AC, turned on IRQs, and
> > +      * saved C regs except r8-r11.  We haven't done any of the other
> > +      * standard entry work, though.  We want to bail, but we shouldn't
> > +      * treat this as a syscall entry since we don't even know what the
> > +      * args are.  Instead, treat this as a non-syscall entry, finish
> > +      * the entry work, and immediately exit after setting AX = -EFAULT.
> > +      *
> > +      * We're really just being polite here.  Killing the task outright
> > +      * would be a reasonable action, too.  Given that the only valid
> > +      * way to have gotten here is through the vDSO, and we already know
> > +      * that the stack pointer is bad, the task isn't going to survive
> > +      * for long no matter what we do.
>
> You mean something like
>
>         force_sig_info(SIGSEGV, &si, current);
>
> ?
>
> I'd say we do it and not noodle unnecessarily with zeroing out pt_regs
> if the task is going to die anyway. IOW, make it die faster. :)

That's even more complicated.  To send a signal, we need valid pt_regs
when we save our context in do_signal.  I was thinking do_exit.

In any case, we can certainly change the behavior here, but I did it
this way at the beginning of the series because this patch mostly
preserves existing behavior while fixing what is arguably a bug and
because without something like this we'll have context tracking issues
later in the series.

My eventual goal is to try to turn entry_SYSENTER_32 into a minimal
piece of asm that pushes regs and calls a C function like
do_SYSENTER_32.  Once that happens, we can just write:

enter_from_user_mode();

local_irq_enable();

if (__get_user(...) != 0) {
  fail();
}

pt_regs->whatever = the thing we just read;

... and do the syscall

Given the current state of the asm, I think I'd rather get farther
along in the cleanups before trying to change behavior here.

--Andy

>
> > +      */
> > +
> > +     ASM_CLAC                        /* undo STAC */
> > +     movq    $-EFAULT, RAX(%rsp)     /* return -EFAULT if possible */
> > +
> > +     /* Fill in the rest of pt_regs */
> > +     xorl    %eax, %eax
> > +     movq    %rax, R11(%rsp)
> > +     movq    %rax, R10(%rsp)
> > +     movq    %rax, R9(%rsp)
> > +     movq    %rax, R8(%rsp)
> > +     SAVE_EXTRA_REGS
> > +
> > +     /* Turn IRQs back off. */
> > +     DISABLE_INTERRUPTS(CLBR_NONE)
> > +     TRACE_IRQS_OFF
> > +
> > +     /* And exit again. */
> > +     jmp retint_user
> > +
> >  ia32_ret_from_sys_call:
> >       xorl    %eax, %eax              /* Do not leak kernel information */
> >       movq    %rax, R11(%rsp)
> > --
> > 2.4.3
>
> --
> Regards/Gruss,
>     Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 06/17] x86: Move C entry and exit code to arch/x86/entry/common.c
  2015-06-29 19:33 ` [PATCH v4 06/17] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
@ 2015-06-30 16:32   ` Borislav Petkov
  0 siblings, 0 replies; 54+ messages in thread
From: Borislav Petkov @ 2015-06-30 16:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:38PM -0700, Andy Lutomirski wrote:
> The entry and exit C helpers were confusingly scattered between
> ptrace.c and signal.c, even though they aren't specific to ptrace or
> signal handling.  Move them together in a new file.
> 
> This change just moves code around.  It doesn't change anything.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/Makefile       |   1 +
>  arch/x86/entry/common.c       | 253 ++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/signal.h |   1 +
>  arch/x86/kernel/ptrace.c      | 202 +--------------------------------
>  arch/x86/kernel/signal.c      |  28 +----
>  5 files changed, 257 insertions(+), 228 deletions(-)
>  create mode 100644 arch/x86/entry/common.c
> 
> diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
> index 7a144971db79..bd55dedd7614 100644
> --- a/arch/x86/entry/Makefile
> +++ b/arch/x86/entry/Makefile
> @@ -2,6 +2,7 @@
>  # Makefile for the x86 low level entry code
>  #
>  obj-y				:= entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
> +obj-y				+= common.o
>  
>  obj-y				+= vdso/
>  obj-y				+= vsyscall/
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> new file mode 100644
> index 000000000000..348465473e55
> --- /dev/null
> +++ b/arch/x86/entry/common.c
> @@ -0,0 +1,253 @@
> +/*
> + * entry.c - C code for kernel entry and exit

common.c

> + * Copyright (c) 2015 Andrew Lutomirski
> + * GPL v2
> + *
> + * Based on asm and ptrace code by many authors.  The code here originated
> + * in ptrace.c and signal.c.
> + */

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
  2015-06-29 19:33 ` [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
@ 2015-06-30 17:01   ` Borislav Petkov
  2015-06-30 17:08     ` Andy Lutomirski
  0 siblings, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-06-30 17:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:39PM -0700, Andy Lutomirski wrote:
> Other than the super-atomic exception entries, all exception entries
> are supposed to switch our context tracking state to CONTEXT_KERNEL.
> Assert that they do.  These assertions appear trivial at this point,
> as exception_enter is the function responsible for switching
> context, but I'm planning on reworking x86's exception context
> tracking, and these assertions will help make sure that all of this
> code keeps working.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/kernel/traps.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index f5791927aa64..2a783c4fe0e9 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -292,6 +292,8 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
>  	enum ctx_state prev_state = exception_enter();
>  	siginfo_t info;
>  
> +	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
> +
>  	if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
>  			NOTIFY_STOP) {
>  		conditional_sti(regs);
> @@ -376,6 +378,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
>  	siginfo_t *info;
>  
>  	prev_state = exception_enter();
> +	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>  	if (notify_die(DIE_TRAP, "bounds", regs, error_code,
>  			X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
>  		goto exit;
> @@ -457,6 +460,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
>  	enum ctx_state prev_state;
>  
>  	prev_state = exception_enter();
> +	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>  	conditional_sti(regs);
>  
>  	if (v8086_mode(regs)) {
> @@ -514,6 +518,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>  		return;
>  
>  	prev_state = ist_enter(regs);
> +	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);

Yeah, so any chance those assertions can be moved at the end of both
ist_enter() and exception_enter()?

Yeah, I read above that you're planning to rework it but it is cleaner
to have them at the end of the _enter() functions instead in all those
trap handlers, no...?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
  2015-06-30 17:01   ` Borislav Petkov
@ 2015-06-30 17:08     ` Andy Lutomirski
  2015-06-30 17:15       ` Borislav Petkov
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-06-30 17:08 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Tue, Jun 30, 2015 at 10:01 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Jun 29, 2015 at 12:33:39PM -0700, Andy Lutomirski wrote:
>> Other than the super-atomic exception entries, all exception entries
>> are supposed to switch our context tracking state to CONTEXT_KERNEL.
>> Assert that they do.  These assertions appear trivial at this point,
>> as exception_enter is the function responsible for switching
>> context, but I'm planning on reworking x86's exception context
>> tracking, and these assertions will help make sure that all of this
>> code keeps working.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/kernel/traps.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index f5791927aa64..2a783c4fe0e9 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -292,6 +292,8 @@ static void do_error_trap(struct pt_regs *regs, long error_code, char *str,
>>       enum ctx_state prev_state = exception_enter();
>>       siginfo_t info;
>>
>> +     CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>> +
>>       if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
>>                       NOTIFY_STOP) {
>>               conditional_sti(regs);
>> @@ -376,6 +378,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
>>       siginfo_t *info;
>>
>>       prev_state = exception_enter();
>> +     CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>>       if (notify_die(DIE_TRAP, "bounds", regs, error_code,
>>                       X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
>>               goto exit;
>> @@ -457,6 +460,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
>>       enum ctx_state prev_state;
>>
>>       prev_state = exception_enter();
>> +     CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>>       conditional_sti(regs);
>>
>>       if (v8086_mode(regs)) {
>> @@ -514,6 +518,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>>               return;
>>
>>       prev_state = ist_enter(regs);
>> +     CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
>
> Yeah, so any chance those assertions can be moved at the end of both
> ist_enter() and exception_enter()?
>
> Yeah, I read above that you're planning to rework it but it is cleaner
> to have them at the end of the _enter() functions instead in all those
> trap handlers, no...?
>

I would agree, except that I remove the exception_enter calls later in
the series, so that wouldn't work.  Maybe we should move them into
common code outside the specific exception handlers (idtentry?) when
the dust settles.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries
  2015-06-30 17:08     ` Andy Lutomirski
@ 2015-06-30 17:15       ` Borislav Petkov
  0 siblings, 0 replies; 54+ messages in thread
From: Borislav Petkov @ 2015-06-30 17:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Tue, Jun 30, 2015 at 10:08:00AM -0700, Andy Lutomirski wrote:
> I would agree, except that I remove the exception_enter calls later in
> the series, so that wouldn't work.

Ah.

> Maybe we should move them into common code outside the specific
> exception handlers (idtentry?) when the dust settles.

Yeah, that macro is huge but it could survive yet another CALL <func>.

:-)

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path
  2015-06-30 16:04     ` Andy Lutomirski
@ 2015-07-01  7:43       ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2015-07-01  7:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Kees Cook, Paul E. McKenney, linux-kernel,
	Oleg Nesterov, Denys Vlasenko, Brian Gerst,
	Frédéric Weisbecker, X86 ML, Rik van Riel


* Andy Lutomirski <luto@amacapital.net> wrote:

> Given the current state of the asm, I think I'd rather get farther along in the 
> cleanups before trying to change behavior here.

Agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls
  2015-06-29 19:33 ` [PATCH v4 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
@ 2015-07-01 10:24   ` Borislav Petkov
  0 siblings, 0 replies; 54+ messages in thread
From: Borislav Petkov @ 2015-07-01 10:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:40PM -0700, Andy Lutomirski wrote:
> Changing the x86 context tracking hooks is dangerous because there
> are no good checks that we track our context correctly.  Add a
> helper to check that we're actually in CONTEXT_USER when we enter
> from user mode and wire it up for syscall entries.
> 
> Subsequent patches will wire this up for all non-NMI entries as
> well.  NMIs are their own special beast and cannot currently switch
> overall context tracking state.  Instead, they have their own
> special RCU hooks.
> 
> This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
> branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a layer

CONTEXT_TRACING?!

Oh noooo, not another tracer :-P

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks
  2015-06-29 19:33 ` [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
@ 2015-07-02  9:48   ` Borislav Petkov
  2015-07-02 16:03     ` Andy Lutomirski
  0 siblings, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02  9:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:41PM -0700, Andy Lutomirski wrote:
> The current entry and exit code is incomprehensible, appears to work
> primary by luck, and is very difficult to incrementally improve.  Add
> new code in preparation for simply deleting the old code.
> 
> prepare_exit_to_usermode is a new function that will handle all slow
> path exits to user mode.  It is called with IRQs disabled and it
> leaves us in a state in which it is safe to immediately return to
> user mode.  IRQs must not be re-enabled at any point after
> prepare_exit_to_usermode returns and user mode is actually entered.
> (We can, of course, fail to enter user mode and treat that failure
> as a fresh entry to kernel mode.)  All callers of do_notify_resume
> will be migrated to call prepare_exit_to_usermode instead;
> prepare_exit_to_usermode needs to do everything that
> do_notify_resume does, but it also takes care of scheduling and
> context tracking.  Unlike do_notify_resume, it does not need to be
> called in a loop.
> 
> syscall_return_slowpath is exactly what it sounds like.  It will be
> called on any syscall exit slow path.  It will replaces
> syscall_trace_leave and it calls prepare_exit_to_usermode on the way
> out.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/common.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 111 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 8a7e35af7164..55530d6dd1bd 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
>  		return syscall_trace_enter_phase2(regs, arch, phase1_result);
>  }
>  
> +/* Deprecated. */
>  void syscall_trace_leave(struct pt_regs *regs)

Ah yes, this will get replaced later with syscall_return_slowpath below.

>  {
>  	bool step;
> @@ -237,8 +238,117 @@ void syscall_trace_leave(struct pt_regs *regs)
>  	user_enter();
>  }
>  
> +static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
> +{
> +	unsigned long top_of_stack =
> +		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
> +	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
> +}
> +
> +/* Called with IRQs disabled. */
> +__visible void prepare_exit_to_usermode(struct pt_regs *regs)
> +{
> +	if (WARN_ON(!irqs_disabled()))
> +		local_irq_disable();
> +
> +	/*
> +	 * In order to return to user mode, we need to have IRQs off with
> +	 * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
> +	 * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
> +	 * can be set at any time on preemptable kernels if we have IRQs on,
> +	 * so we need to loop.  Disabling preemption wouldn't help: doing the
> +	 * work to clear some of the flags can sleep.
> +	 */
> +	while (true) {
> +		u32 cached_flags =
> +			READ_ONCE(pt_regs_to_thread_info(regs)->flags);
> +
> +		if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
> +				      _TIF_UPROBE | _TIF_NEED_RESCHED)))
> +			break;
> +
> +		/* We have work to do. */
> +		local_irq_enable();
> +
> +		if (cached_flags & _TIF_NEED_RESCHED)
> +			schedule();
> +
> +		if (cached_flags & _TIF_UPROBE)
> +			uprobe_notify_resume(regs);
> +
> +		/* deal with pending signal delivery */
> +		if (cached_flags & _TIF_SIGPENDING)
> +			do_signal(regs);
> +
> +		if (cached_flags & _TIF_NOTIFY_RESUME) {
> +			clear_thread_flag(TIF_NOTIFY_RESUME);
> +			tracehook_notify_resume(regs);
> +		}
> +
> +		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
> +			fire_user_return_notifiers();
> +
> +		/* Disable IRQs and retry */
> +		local_irq_disable();
> +	}

Stupid question: what assures us that we'll break out of this loop
at some point? I.e., isn't the scenario possible of something always
setting bits in ->flags while we're handling stuff in the IRQs on
section?

OTOH, this is what int_ret_from_sys_call() does now anyway so we should
be fine.

Yeah, it looks that way.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path
  2015-06-29 19:33 ` [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
@ 2015-07-02 10:25   ` Borislav Petkov
  2015-07-02 15:33     ` Andy Lutomirski
  0 siblings, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 10:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:42PM -0700, Andy Lutomirski wrote:
> In 539f51136500 ("x86/asm/entry/64: Disentangle error_entry/exit
> gsbase/ebx/usermode code"), I arranged the code slightly wrong --
> IRET faults would skip the code path that was intended to execute on
> all error entries from user mode.  Fix it up.
> 
> This does not fix a bug, but we'll need it, and it slightly shrinks
> the code.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64.S | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 141a5d49dddc..cd9cbc62159c 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1145,9 +1145,14 @@ ENTRY(error_entry)
>  	testb	$3, CS+8(%rsp)
>  	jz	error_kernelspace
>  
> -	/* We entered from user mode */
> +error_entry_from_usermode_swapgs:
> +	/*
> +	 * We entered from user mode or we're pretending to have entered
> +	 * from user mode due to an IRET fault.
> +	 */
>  	SWAPGS
>  
> +error_entry_from_usermode_after_swapgs:
>  error_entry_done:

Why the second label?

Also, please make all those labels local by prefixing them with .L
because they appear unnecesasrily in objdump output as global symbols.

$ objdump -d vmlinux | grep -E "^[0-9a-f]+ <(error_entry_\w*|error_kernelspace|bstep_iret|error_bad_iret)"
ffffffff8167a4a8 <error_entry_from_usermode_swapgs>:
ffffffff8167a4ab <error_entry_done>:
ffffffff8167a4b1 <error_kernelspace>:
ffffffff8167a4e0 <bstep_iret>:
ffffffff8167a4e8 <error_bad_iret>:

Oh, and shorter please :)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks
  2015-06-29 19:33 ` [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks Andy Lutomirski
@ 2015-07-02 10:49   ` Borislav Petkov
  2015-07-02 15:56     ` Andy Lutomirski
  2015-07-02 16:56   ` Denys Vlasenko
  1 sibling, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 10:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:43PM -0700, Andy Lutomirski wrote:
> These need to be migrated together, as the compat case used to jump
> into the middle of the 64-bit exit code.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64.S        | 69 +++++-----------------------------------
>  arch/x86/entry/entry_64_compat.S |  7 ++--
>  2 files changed, 11 insertions(+), 65 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cd9cbc62159c..9bc76766aa71 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -229,6 +229,11 @@ entry_SYSCALL_64_fastpath:
>  	 */
>  	USERGS_SYSRET64
>  
> +GLOBAL(int_ret_from_sys_call_irqs_off)
> +	TRACE_IRQS_ON
> +	ENABLE_INTERRUPTS(CLBR_NONE)
> +	jmp int_ret_from_sys_call
> +
>  	/* Do syscall entry tracing */
>  tracesys:
>  	movq	%rsp, %rdi
> @@ -272,69 +277,11 @@ tracesys_phase2:
>   * Has correct iret frame.
>   */
>  GLOBAL(int_ret_from_sys_call)
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -int_ret_from_sys_call_irqs_off: /* jumps come here from the irqs-off SYSRET path */
> -	TRACE_IRQS_OFF
> -	movl	$_TIF_ALLWORK_MASK, %edi
> -	/* edi:	mask to check */
> -GLOBAL(int_with_check)
> -	LOCKDEP_SYS_EXIT_IRQ
> -	GET_THREAD_INFO(%rcx)
> -	movl	TI_flags(%rcx), %edx
> -	andl	%edi, %edx
> -	jnz	int_careful
> -	andl	$~TS_COMPAT, TI_status(%rcx)
> -	jmp	syscall_return
> -
> -	/*
> -	 * Either reschedule or signal or syscall exit tracking needed.
> -	 * First do a reschedule test.
> -	 * edx:	work, edi: workmask
> -	 */
> -int_careful:
> -	bt	$TIF_NEED_RESCHED, %edx
> -	jnc	int_very_careful
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
> -	pushq	%rdi
> -	SCHEDULE_USER
> -	popq	%rdi
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -	TRACE_IRQS_OFF
> -	jmp	int_with_check
> -
> -	/* handle signals and tracing -- both require a full pt_regs */
> -int_very_careful:
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
>  	SAVE_EXTRA_REGS
> -	/* Check for syscall exit trace */
> -	testl	$_TIF_WORK_SYSCALL_EXIT, %edx
> -	jz	int_signal
> -	pushq	%rdi
> -	leaq	8(%rsp), %rdi			/* &ptregs -> arg1 */
> -	call	syscall_trace_leave
> -	popq	%rdi
> -	andl	$~(_TIF_WORK_SYSCALL_EXIT|_TIF_SYSCALL_EMU), %edi
> -	jmp	int_restore_rest
> -
> -int_signal:
> -	testl	$_TIF_DO_NOTIFY_MASK, %edx
> -	jz	1f
> -	movq	%rsp, %rdi			/* &ptregs -> arg1 */
> -	xorl	%esi, %esi			/* oldset -> arg2 */
> -	call	do_notify_resume
> -1:	movl	$_TIF_WORK_MASK, %edi
> -int_restore_rest:
> +	movq	%rsp, %rdi
> +	call	syscall_return_slowpath	/* returns with IRQs disabled */
>  	RESTORE_EXTRA_REGS
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -	TRACE_IRQS_OFF
> -	jmp	int_with_check
> -
> -syscall_return:
> -	/* The IRETQ could re-enable interrupts: */
> -	DISABLE_INTERRUPTS(CLBR_ANY)
> -	TRACE_IRQS_IRETQ
> +	TRACE_IRQS_IRETQ		/* we're about to change IF */
>  
>  	/*
>  	 * Try to use SYSRET instead of IRET if we're returning to

Hallelujah!

/me luvz hunks which remove a bunch of asm :)

> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> index efe0b1e499fa..ac0658142ae1 100644
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -209,10 +209,10 @@ sysexit_from_sys_call:
>  	.endm
>  
>  	.macro auditsys_exit exit
> -	testl	$(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> -	jnz	ia32_ret_from_sys_call
>  	TRACE_IRQS_ON
>  	ENABLE_INTERRUPTS(CLBR_NONE)
> +	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> +	jnz ia32_ret_from_sys_call

I guess you want to use tabs here like the rest of the macro does.

>  	movl	%eax, %esi		/* second arg, syscall return value */
>  	cmpl	$-MAX_ERRNO, %eax	/* is it an error ? */
>  	jbe	1f
> @@ -227,11 +227,10 @@ sysexit_from_sys_call:
>  	testl	%edi, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>  	jz	\exit
>  	xorl	%eax, %eax		/* Do not leak kernel information */
> -	movq	%rax, R11(%rsp)

I guess that change needs at least some explanation in the commit
message. AFAIU, this is RIP we shouldn't be zeroing for we need it in
int_ret_from_sys_call...

>  	movq	%rax, R10(%rsp)
>  	movq	%rax, R9(%rsp)
>  	movq	%rax, R8(%rsp)
> -	jmp	int_with_check
> +	jmp	int_ret_from_sys_call_irqs_off
>  	.endm
>  
>  sysenter_auditsys:
> -- 
> 2.4.3
> 
> 

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 12/17] x86/asm/entry/64: Save all regs on interrupt entry
  2015-06-29 19:33 ` [PATCH v4 12/17] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
@ 2015-07-02 10:52   ` Borislav Petkov
  2015-07-02 15:33     ` Andy Lutomirski
  0 siblings, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 10:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:44PM -0700, Andy Lutomirski wrote:
> To prepare for the big rewrite of the error and interrupt exit
> paths, we will need pt_regs completely filled in.  It's already
> completely filled in when error_exit runs, so rearrange interrupt
> handling to match it.  This will slow down interrupt handling very
> slightly (eight instructions), but the simplification it enables
> will be more than worth it.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64.S | 29 +++++++++--------------------
>  1 file changed, 9 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 9bc76766aa71..51d34fafe4c0 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -502,21 +502,13 @@ END(irq_entries_start)
>  /* 0(%rsp): ~(interrupt number) */
>  	.macro interrupt func
>  	cld
> -	/*
> -	 * Since nothing in interrupt handling code touches r12...r15 members
> -	 * of "struct pt_regs", and since interrupts can nest, we can save
> -	 * four stack slots and simultaneously provide
> -	 * an unwind-friendly stack layout by saving "truncated" pt_regs
> -	 * exactly up to rbp slot, without these members.
> -	 */
> -	ALLOC_PT_GPREGS_ON_STACK -RBP
> -	SAVE_C_REGS -RBP
> -	/* this goes to 0(%rsp) for unwinder, not for saving the value: */
> -	SAVE_EXTRA_REGS_RBP -RBP

That macro's unused now AFAICT.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-06-29 19:33 ` [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
@ 2015-07-02 12:09   ` Borislav Petkov
  2015-07-02 16:09     ` Andy Lutomirski
  0 siblings, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 12:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:46PM -0700, Andy Lutomirski wrote:
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64.S        | 63 +++++++++++-----------------------------
>  arch/x86/entry/entry_64_compat.S |  5 ++++
>  2 files changed, 22 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index ce2c9049abef..08a37ec049f0 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -508,7 +508,16 @@ END(irq_entries_start)
>  
>  	testb	$3, CS(%rsp)
>  	jz	1f
> +
> +	/*
> +	 * IRQ from user mode.  Switch to kernel gsbase and inform context
> +	 * tracking that we're in kernel mode.
> +	 */
>  	SWAPGS
> +#ifdef CONFIG_CONTEXT_TRACKING
> +	call enter_from_user_mode
> +#endif

I think you can make this much cleaner by getting rid of the ifdeffery
and pushing it into the enter_from_user_mode() function:

__visible void enter_from_user_mode(void)
{
#ifdef CONFIG_CONTEXT_TRACKING

	...

#endif
}

The disadvantage of all that cleanliness is that we get one dumb

	call enter_from_user_mode

there to this abomination:

ffffffff810014f0 <enter_from_user_mode>:
ffffffff810014f0:       e8 db 97 67 00          callq  ffffffff8167acd0 <__fentry__>
ffffffff810014f5:       55                      push   %rbp
ffffffff810014f6:       48 89 e5                mov    %rsp,%rbp
ffffffff810014f9:       5d                      pop    %rbp
ffffffff810014fa:       c3                      retq   
ffffffff810014fb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

which sux.

We sure could use LTO here.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path
  2015-07-02 10:25   ` Borislav Petkov
@ 2015-07-02 15:33     ` Andy Lutomirski
  2015-07-02 16:29       ` Borislav Petkov
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-02 15:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 2, 2015 at 3:25 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Jun 29, 2015 at 12:33:42PM -0700, Andy Lutomirski wrote:
>> In 539f51136500 ("x86/asm/entry/64: Disentangle error_entry/exit
>> gsbase/ebx/usermode code"), I arranged the code slightly wrong --
>> IRET faults would skip the code path that was intended to execute on
>> all error entries from user mode.  Fix it up.
>>
>> This does not fix a bug, but we'll need it, and it slightly shrinks
>> the code.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/entry_64.S | 12 ++++++++----
>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 141a5d49dddc..cd9cbc62159c 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1145,9 +1145,14 @@ ENTRY(error_entry)
>>       testb   $3, CS+8(%rsp)
>>       jz      error_kernelspace
>>
>> -     /* We entered from user mode */
>> +error_entry_from_usermode_swapgs:
>> +     /*
>> +      * We entered from user mode or we're pretending to have entered
>> +      * from user mode due to an IRET fault.
>> +      */
>>       SWAPGS
>>
>> +error_entry_from_usermode_after_swapgs:
>>  error_entry_done:
>
> Why the second label?

"Migrate error and interrupt exit work to C" puts code between those
labels.  I could squash the patches together to make it less
mysterious.

>
> Also, please make all those labels local by prefixing them with .L
> because they appear unnecesasrily in objdump output as global symbols.

Will do.

>
> $ objdump -d vmlinux | grep -E "^[0-9a-f]+ <(error_entry_\w*|error_kernelspace|bstep_iret|error_bad_iret)"
> ffffffff8167a4a8 <error_entry_from_usermode_swapgs>:
> ffffffff8167a4ab <error_entry_done>:
> ffffffff8167a4b1 <error_kernelspace>:
> ffffffff8167a4e0 <bstep_iret>:
> ffffffff8167a4e8 <error_bad_iret>:
>
> Oh, and shorter please :)
>

I'll try to think of a good shorter name.  No promises :)

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 12/17] x86/asm/entry/64: Save all regs on interrupt entry
  2015-07-02 10:52   ` Borislav Petkov
@ 2015-07-02 15:33     ` Andy Lutomirski
  0 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-02 15:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 2, 2015 at 3:52 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Jun 29, 2015 at 12:33:44PM -0700, Andy Lutomirski wrote:
>> To prepare for the big rewrite of the error and interrupt exit
>> paths, we will need pt_regs completely filled in.  It's already
>> completely filled in when error_exit runs, so rearrange interrupt
>> handling to match it.  This will slow down interrupt handling very
>> slightly (eight instructions), but the simplification it enables
>> will be more than worth it.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/entry_64.S | 29 +++++++++--------------------
>>  1 file changed, 9 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 9bc76766aa71..51d34fafe4c0 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -502,21 +502,13 @@ END(irq_entries_start)
>>  /* 0(%rsp): ~(interrupt number) */
>>       .macro interrupt func
>>       cld
>> -     /*
>> -      * Since nothing in interrupt handling code touches r12...r15 members
>> -      * of "struct pt_regs", and since interrupts can nest, we can save
>> -      * four stack slots and simultaneously provide
>> -      * an unwind-friendly stack layout by saving "truncated" pt_regs
>> -      * exactly up to rbp slot, without these members.
>> -      */
>> -     ALLOC_PT_GPREGS_ON_STACK -RBP
>> -     SAVE_C_REGS -RBP
>> -     /* this goes to 0(%rsp) for unwinder, not for saving the value: */
>> -     SAVE_EXTRA_REGS_RBP -RBP
>
> That macro's unused now AFAICT.
>

I'll delete it in v5.

--Andy

> --
> Regards/Gruss,
>     Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks
  2015-07-02 10:49   ` Borislav Petkov
@ 2015-07-02 15:56     ` Andy Lutomirski
  0 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-02 15:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 2, 2015 at 3:49 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Jun 29, 2015 at 12:33:43PM -0700, Andy Lutomirski wrote:
>> These need to be migrated together, as the compat case used to jump
>> into the middle of the 64-bit exit code.
>>

>> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
>> index efe0b1e499fa..ac0658142ae1 100644
>> --- a/arch/x86/entry/entry_64_compat.S
>> +++ b/arch/x86/entry/entry_64_compat.S
>> @@ -209,10 +209,10 @@ sysexit_from_sys_call:
>>       .endm
>>
>>       .macro auditsys_exit exit
>> -     testl   $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>> -     jnz     ia32_ret_from_sys_call
>>       TRACE_IRQS_ON
>>       ENABLE_INTERRUPTS(CLBR_NONE)
>> +     testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>> +     jnz ia32_ret_from_sys_call
>
> I guess you want to use tabs here like the rest of the macro does.
>

Oops.  I wrote this before everything got tabified and I guess I
didn't fix it up right.

>>       movl    %eax, %esi              /* second arg, syscall return value */
>>       cmpl    $-MAX_ERRNO, %eax       /* is it an error ? */
>>       jbe     1f
>> @@ -227,11 +227,10 @@ sysexit_from_sys_call:
>>       testl   %edi, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>>       jz      \exit
>>       xorl    %eax, %eax              /* Do not leak kernel information */
>> -     movq    %rax, R11(%rsp)
>
> I guess that change needs at least some explanation in the commit
> message. AFAIU, this is RIP we shouldn't be zeroing for we need it in
> int_ret_from_sys_call...

This change is a mistake.  There was another rebase issue in here that
I fixed, but apparently I still haven't gotten it right.  Sigh.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks
  2015-07-02  9:48   ` Borislav Petkov
@ 2015-07-02 16:03     ` Andy Lutomirski
  2015-07-02 16:25       ` Borislav Petkov
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-02 16:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 2, 2015 at 2:48 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Jun 29, 2015 at 12:33:41PM -0700, Andy Lutomirski wrote:
>> The current entry and exit code is incomprehensible, appears to work
>> primary by luck, and is very difficult to incrementally improve.  Add
>> new code in preparation for simply deleting the old code.
>>
>> prepare_exit_to_usermode is a new function that will handle all slow
>> path exits to user mode.  It is called with IRQs disabled and it
>> leaves us in a state in which it is safe to immediately return to
>> user mode.  IRQs must not be re-enabled at any point after
>> prepare_exit_to_usermode returns and user mode is actually entered.
>> (We can, of course, fail to enter user mode and treat that failure
>> as a fresh entry to kernel mode.)  All callers of do_notify_resume
>> will be migrated to call prepare_exit_to_usermode instead;
>> prepare_exit_to_usermode needs to do everything that
>> do_notify_resume does, but it also takes care of scheduling and
>> context tracking.  Unlike do_notify_resume, it does not need to be
>> called in a loop.
>>
>> syscall_return_slowpath is exactly what it sounds like.  It will be
>> called on any syscall exit slow path.  It will replaces
>> syscall_trace_leave and it calls prepare_exit_to_usermode on the way
>> out.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/common.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 111 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>> index 8a7e35af7164..55530d6dd1bd 100644
>> --- a/arch/x86/entry/common.c
>> +++ b/arch/x86/entry/common.c
>> @@ -207,6 +207,7 @@ long syscall_trace_enter(struct pt_regs *regs)
>>               return syscall_trace_enter_phase2(regs, arch, phase1_result);
>>  }
>>
>> +/* Deprecated. */
>>  void syscall_trace_leave(struct pt_regs *regs)
>
> Ah yes, this will get replaced later with syscall_return_slowpath below.

I already have code in my tree to delete this, but it needs careful
testing for the 32-bit case.  The asm changes ended up being more
intrusive than the 64-bit change, and I had to fiddle with vm86 as
well.

>
> Stupid question: what assures us that we'll break out of this loop
> at some point? I.e., isn't the scenario possible of something always
> setting bits in ->flags while we're handling stuff in the IRQs on
> section?

Nothing, actually.  We could spin forever if we keep scheduling in and
having TIF_NEED_RESCHED get set before we check again, or we could
even, in principle, keep delivering more and more signals forever.
This seems quite unlikely, though, and if we really end up delivering
infinite signals, something is very wrong already.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-02 12:09   ` Borislav Petkov
@ 2015-07-02 16:09     ` Andy Lutomirski
  2015-07-02 16:33       ` Borislav Petkov
                         ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-02 16:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 2, 2015 at 5:09 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Jun 29, 2015 at 12:33:46PM -0700, Andy Lutomirski wrote:
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/entry_64.S        | 63 +++++++++++-----------------------------
>>  arch/x86/entry/entry_64_compat.S |  5 ++++
>>  2 files changed, 22 insertions(+), 46 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index ce2c9049abef..08a37ec049f0 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -508,7 +508,16 @@ END(irq_entries_start)
>>
>>       testb   $3, CS(%rsp)
>>       jz      1f
>> +
>> +     /*
>> +      * IRQ from user mode.  Switch to kernel gsbase and inform context
>> +      * tracking that we're in kernel mode.
>> +      */
>>       SWAPGS
>> +#ifdef CONFIG_CONTEXT_TRACKING
>> +     call enter_from_user_mode
>> +#endif
>
> I think you can make this much cleaner by getting rid of the ifdeffery
> and pushing it into the enter_from_user_mode() function:
>
> __visible void enter_from_user_mode(void)
> {
> #ifdef CONFIG_CONTEXT_TRACKING
>
>         ...
>
> #endif
> }
>
> The disadvantage of all that cleanliness is that we get one dumb
>
>         call enter_from_user_mode
>
> there to this abomination:
>
> ffffffff810014f0 <enter_from_user_mode>:
> ffffffff810014f0:       e8 db 97 67 00          callq  ffffffff8167acd0 <__fentry__>
> ffffffff810014f5:       55                      push   %rbp
> ffffffff810014f6:       48 89 e5                mov    %rsp,%rbp
> ffffffff810014f9:       5d                      pop    %rbp
> ffffffff810014fa:       c3                      retq
> ffffffff810014fb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>
> which sux.
>
> We sure could use LTO here.
>

You mean link-time asm optimizations?  Turning off frame pointers in
leaf functions as long as rbp is still preserved might not be so
terrible either.

I'm torn on this one.  In principle, you're right, or we could have a
macro CALL_ENTER_FROM_USER_MODE that does nothing if context tracking
is off.  OTOH, that's also kind of messy.

If we move even more of this stuff into C, then this problem goes
away.  "call enter_from_user_mode" turns into "enter_from_user_mode()"
or similar.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks
  2015-07-02 16:03     ` Andy Lutomirski
@ 2015-07-02 16:25       ` Borislav Petkov
  0 siblings, 0 replies; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 16:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 02, 2015 at 09:03:11AM -0700, Andy Lutomirski wrote:
> Nothing, actually.  We could spin forever if we keep scheduling in and
> having TIF_NEED_RESCHED get set before we check again, or we could
> even, in principle, keep delivering more and more signals forever.
> This seems quite unlikely, though, and if we really end up delivering
> infinite signals, something is very wrong already.

And if it hasn't happened already with the current code, then I guess
we're always catching a break. And it's not like we're not preemptible
for the whole looping duration so not a problem even for -RT.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path
  2015-07-02 15:33     ` Andy Lutomirski
@ 2015-07-02 16:29       ` Borislav Petkov
  0 siblings, 0 replies; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 16:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 02, 2015 at 08:33:05AM -0700, Andy Lutomirski wrote:
> "Migrate error and interrupt exit work to C" puts code between those
> labels.  I could squash the patches together to make it less
> mysterious.

Nah, it is my mistake of not reviewing the whole thing first and then
sending replies. Sorry about that.

> > Also, please make all those labels local by prefixing them with .L
> > because they appear unnecesasrily in objdump output as global symbols.
> 
> Will do.

Yeah, the only maybe partly advantage of having the global labels would
be to ease debuggability but one can get to the proper offset even
without those labels so...

> I'll try to think of a good shorter name.  No promises :)

:-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-02 16:09     ` Andy Lutomirski
@ 2015-07-02 16:33       ` Borislav Petkov
  2015-07-03  6:33       ` Ingo Molnar
  2015-07-03 14:37       ` Denys Vlasenko
  2 siblings, 0 replies; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 16:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 02, 2015 at 09:09:23AM -0700, Andy Lutomirski wrote:
> You mean link-time asm optimizations?  Turning off frame pointers in
> leaf functions as long as rbp is still preserved might not be so
> terrible either.

Is there a gcc switch for that?

> I'm torn on this one.  In principle, you're right, or we could have a
> macro CALL_ENTER_FROM_USER_MODE that does nothing if context tracking
> is off.  OTOH, that's also kind of messy.

Yeah, I'm torn too.

If you have something better, cool. Otherwise it is a judgement call for
tip people.

> If we move even more of this stuff into C, then this problem goes
> away.  "call enter_from_user_mode" turns into "enter_from_user_mode()"
> or similar.

... and that should be optimized away, I'd guess, if the body's empty.
Something like empty body removal or whatever...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/17] x86: Rewrite exit-to-userspace code
  2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
                   ` (17 preceding siblings ...)
  2015-06-29 19:46 ` [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Richard Weinberger
@ 2015-07-02 16:45 ` Borislav Petkov
  2015-07-03  6:34   ` Ingo Molnar
  18 siblings, 1 reply; 54+ messages in thread
From: Borislav Petkov @ 2015-07-02 16:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Denys Vlasenko, Kees Cook, Brian Gerst, paulmck

On Mon, Jun 29, 2015 at 12:33:32PM -0700, Andy Lutomirski wrote:
> This is the first big batch of x86 asm-to-C conversion patches.

So yeah, I think I'm done. It looks good to me, except the minor things
already raised. I've tested the tree already on my boxes here and it was
ok.

Maybe it is time now for v5 to see wider audience in tip :-)

Thanks Andy.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks
  2015-06-29 19:33 ` [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks Andy Lutomirski
  2015-07-02 10:49   ` Borislav Petkov
@ 2015-07-02 16:56   ` Denys Vlasenko
  1 sibling, 0 replies; 54+ messages in thread
From: Denys Vlasenko @ 2015-07-02 16:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Linux Kernel Mailing List, Frédéric Weisbecker,
	Rik van Riel, Oleg Nesterov, Borislav Petkov, Kees Cook,
	Brian Gerst, Paul McKenney

On Mon, Jun 29, 2015 at 9:33 PM, Andy Lutomirski <luto@kernel.org> wrote:
> +       jmp int_ret_from_sys_call


> +       testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> +       jnz ia32_ret_from_sys_call

The files got reformatted to use a consistent style
with tab between insn mnemonic and its arguments.
Please do not break it again...

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-02 16:09     ` Andy Lutomirski
  2015-07-02 16:33       ` Borislav Petkov
@ 2015-07-03  6:33       ` Ingo Molnar
  2015-07-03 16:27         ` Andy Lutomirski
  2015-07-03 14:37       ` Denys Vlasenko
  2 siblings, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2015-07-03  6:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney


* Andy Lutomirski <luto@amacapital.net> wrote:

> If we move even more of this stuff into C, then this problem goes away.  "call 
> enter_from_user_mode" turns into "enter_from_user_mode()" or similar.

Yes. I think we should first see how that process works out, and then see what 
else can be done. For the initial step I'm willing to trade up to 10 cycles in 
exchange for sane and maintainable x86 entry code that we can then speed up ...

I presume we are still within that budget?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/17] x86: Rewrite exit-to-userspace code
  2015-07-02 16:45 ` Borislav Petkov
@ 2015-07-03  6:34   ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2015-07-03  6:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, x86, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, paulmck


* Borislav Petkov <bp@alien8.de> wrote:

> On Mon, Jun 29, 2015 at 12:33:32PM -0700, Andy Lutomirski wrote:
> > This is the first big batch of x86 asm-to-C conversion patches.
> 
> So yeah, I think I'm done. It looks good to me, except the minor things already 
> raised. I've tested the tree already on my boxes here and it was ok.
> 
> Maybe it is time now for v5 to see wider audience in tip :-)

Ok, I'll wait for v5 and add it to -tip. Thanks for the review!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-02 16:09     ` Andy Lutomirski
  2015-07-02 16:33       ` Borislav Petkov
  2015-07-03  6:33       ` Ingo Molnar
@ 2015-07-03 14:37       ` Denys Vlasenko
  2015-07-03 16:24         ` Andy Lutomirski
  2 siblings, 1 reply; 54+ messages in thread
From: Denys Vlasenko @ 2015-07-03 14:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 2, 2015 at 6:09 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jul 2, 2015 at 5:09 AM, Borislav Petkov <bp@alien8.de> wrote:
>> On Mon, Jun 29, 2015 at 12:33:46PM -0700, Andy Lutomirski wrote:
>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>> ---
>>>  arch/x86/entry/entry_64.S        | 63 +++++++++++-----------------------------
>>>  arch/x86/entry/entry_64_compat.S |  5 ++++
>>>  2 files changed, 22 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>>> index ce2c9049abef..08a37ec049f0 100644
>>> --- a/arch/x86/entry/entry_64.S
>>> +++ b/arch/x86/entry/entry_64.S
>>> @@ -508,7 +508,16 @@ END(irq_entries_start)
>>>
>>>       testb   $3, CS(%rsp)
>>>       jz      1f
>>> +
>>> +     /*
>>> +      * IRQ from user mode.  Switch to kernel gsbase and inform context
>>> +      * tracking that we're in kernel mode.
>>> +      */
>>>       SWAPGS
>>> +#ifdef CONFIG_CONTEXT_TRACKING
>>> +     call enter_from_user_mode
>>> +#endif
>>
>> I think you can make this much cleaner by getting rid of the ifdeffery
>> and pushing it into the enter_from_user_mode() function:
>>
>> __visible void enter_from_user_mode(void)
>> {
>> #ifdef CONFIG_CONTEXT_TRACKING
>>
>>         ...
>>
>> #endif
>> }
>>
>> The disadvantage of all that cleanliness is that we get one dumb
>>
>>         call enter_from_user_mode
>>
>> there to this abomination:
>>
>> ffffffff810014f0 <enter_from_user_mode>:
>> ffffffff810014f0:       e8 db 97 67 00          callq  ffffffff8167acd0 <__fentry__>
>> ffffffff810014f5:       55                      push   %rbp
>> ffffffff810014f6:       48 89 e5                mov    %rsp,%rbp
>> ffffffff810014f9:       5d                      pop    %rbp
>> ffffffff810014fa:       c3                      retq
>> ffffffff810014fb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>
>> which sux.
>>
>> We sure could use LTO here.
>>
>
> You mean link-time asm optimizations?  Turning off frame pointers in
> leaf functions as long as rbp is still preserved might not be so
> terrible either.
>
> I'm torn on this one.  In principle, you're right, or we could have a
> macro CALL_ENTER_FROM_USER_MODE that does nothing if context tracking
> is off.  OTOH, that's also kind of messy.

busybox has IF() macros:

IF_CONTEXT_TRACKING(call enter_from_user_mode)

Unlike #if, this needs one line, not three.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-03 14:37       ` Denys Vlasenko
@ 2015-07-03 16:24         ` Andy Lutomirski
  2015-07-04  8:12           ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-03 16:24 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Kees Cook, Brian Gerst, Paul McKenney

On Fri, Jul 3, 2015 at 7:37 AM, Denys Vlasenko <vda.linux@googlemail.com> wrote:
> On Thu, Jul 2, 2015 at 6:09 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Jul 2, 2015 at 5:09 AM, Borislav Petkov <bp@alien8.de> wrote:
>>> On Mon, Jun 29, 2015 at 12:33:46PM -0700, Andy Lutomirski wrote:
>>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>>> ---
>>>>  arch/x86/entry/entry_64.S        | 63 +++++++++++-----------------------------
>>>>  arch/x86/entry/entry_64_compat.S |  5 ++++
>>>>  2 files changed, 22 insertions(+), 46 deletions(-)
>>>>
>>>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>>>> index ce2c9049abef..08a37ec049f0 100644
>>>> --- a/arch/x86/entry/entry_64.S
>>>> +++ b/arch/x86/entry/entry_64.S
>>>> @@ -508,7 +508,16 @@ END(irq_entries_start)
>>>>
>>>>       testb   $3, CS(%rsp)
>>>>       jz      1f
>>>> +
>>>> +     /*
>>>> +      * IRQ from user mode.  Switch to kernel gsbase and inform context
>>>> +      * tracking that we're in kernel mode.
>>>> +      */
>>>>       SWAPGS
>>>> +#ifdef CONFIG_CONTEXT_TRACKING
>>>> +     call enter_from_user_mode
>>>> +#endif
>>>
>>> I think you can make this much cleaner by getting rid of the ifdeffery
>>> and pushing it into the enter_from_user_mode() function:
>>>
>>> __visible void enter_from_user_mode(void)
>>> {
>>> #ifdef CONFIG_CONTEXT_TRACKING
>>>
>>>         ...
>>>
>>> #endif
>>> }
>>>
>>> The disadvantage of all that cleanliness is that we get one dumb
>>>
>>>         call enter_from_user_mode
>>>
>>> there to this abomination:
>>>
>>> ffffffff810014f0 <enter_from_user_mode>:
>>> ffffffff810014f0:       e8 db 97 67 00          callq  ffffffff8167acd0 <__fentry__>
>>> ffffffff810014f5:       55                      push   %rbp
>>> ffffffff810014f6:       48 89 e5                mov    %rsp,%rbp
>>> ffffffff810014f9:       5d                      pop    %rbp
>>> ffffffff810014fa:       c3                      retq
>>> ffffffff810014fb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>>
>>> which sux.
>>>
>>> We sure could use LTO here.
>>>
>>
>> You mean link-time asm optimizations?  Turning off frame pointers in
>> leaf functions as long as rbp is still preserved might not be so
>> terrible either.
>>
>> I'm torn on this one.  In principle, you're right, or we could have a
>> macro CALL_ENTER_FROM_USER_MODE that does nothing if context tracking
>> is off.  OTOH, that's also kind of messy.
>
> busybox has IF() macros:
>
> IF_CONTEXT_TRACKING(call enter_from_user_mode)
>
> Unlike #if, this needs one line, not three.

I predict all of these call sites will get moved to C before 4.3.  If
not, then I'd be glad to clean this part up.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-03  6:33       ` Ingo Molnar
@ 2015-07-03 16:27         ` Andy Lutomirski
  2015-07-03 16:29           ` Andy Lutomirski
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-03 16:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Thu, Jul 2, 2015 at 11:33 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> If we move even more of this stuff into C, then this problem goes away.  "call
>> enter_from_user_mode" turns into "enter_from_user_mode()" or similar.
>
> Yes. I think we should first see how that process works out, and then see what
> else can be done. For the initial step I'm willing to trade up to 10 cycles in
> exchange for sane and maintainable x86 entry code that we can then speed up ...
>
> I presume we are still within that budget?
>

I haven't benchmarked very carefully.  IIRC syscall timing was
unaffected by this series.  Exceptions ought to be unaffected unless
something very weird is going on with the trace cache or branch
predictor.  IRQs will slow down by a couple cycles (I'm guessing 5,
since it's more or less the same change I benchmarked when I was
playing with the next pile of patches).

My current plan is to try to send the 32-bit asm changes some time
next week (after I find a vm86-using DOS game to test) and then the
actual performance-affecting patch after that.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-03 16:27         ` Andy Lutomirski
@ 2015-07-03 16:29           ` Andy Lutomirski
  0 siblings, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2015-07-03 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, linux-kernel,
	Frédéric Weisbecker, Rik van Riel, Oleg Nesterov,
	Denys Vlasenko, Kees Cook, Brian Gerst, Paul McKenney

On Fri, Jul 3, 2015 at 9:27 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I haven't benchmarked very carefully.  IIRC syscall timing was
> unaffected by this series.  Exceptions ought to be unaffected unless
> something very weird is going on with the trace cache or branch
> predictor.  IRQs will slow down by a couple cycles (I'm guessing 5,
> since it's more or less the same change I benchmarked when I was
> playing with the next pile of patches).

One clarification: I bet that context tracking performance changes
with these patches, but it sucked before and it'll suck afterwards
regardless.  I think that these patches are a good step in making it
not suck, though -- they will allow us to only change ct state once on
entry and once on exit, and we used to get it wrong.  They also allow
us to do context tracking with IRQs off, which Rik van Riel wanted to
help his speedup project.

So, in the long run, I expect a huge context tracking improvement, and
these patches will help.  By themselves, they won't do much.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C
  2015-07-03 16:24         ` Andy Lutomirski
@ 2015-07-04  8:12           ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2015-07-04  8:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Borislav Petkov, Andy Lutomirski, X86 ML,
	linux-kernel, Frédéric Weisbecker, Rik van Riel,
	Oleg Nesterov, Kees Cook, Brian Gerst, Paul McKenney


* Andy Lutomirski <luto@amacapital.net> wrote:

> >> I'm torn on this one.  In principle, you're right, or we could have a macro 
> >> CALL_ENTER_FROM_USER_MODE that does nothing if context tracking is off.  
> >> OTOH, that's also kind of messy.
> >
> > busybox has IF() macros:
> >
> > IF_CONTEXT_TRACKING(call enter_from_user_mode)
> >
> > Unlike #if, this needs one line, not three.
> 
> I predict all of these call sites will get moved to C before 4.3.  If not, then 
> I'd be glad to clean this part up.

Ok, let's see how it works out.

I think in the initial phase we should be conservative and should attempt to 
introduce as few unrelated changes as possible, and get the conversion done. The 
current x86 entry code is a reasonable base to start with.

Once most of the 'mechanic' conversion is done we can do all the other changes.

The conversion itself is risky enough as-is, we want to offload as much of any 
other risk to after the conversion is done.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2015-07-05  8:43 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-29 19:33 [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Andy Lutomirski
2015-06-29 19:33 ` [PATCH v4 01/17] selftests/x86: Add a test for 32-bit fast syscall arg faults Andy Lutomirski
2015-06-29 19:33 ` [PATCH v4 02/17] x86/entry/64/compat: Fix bad fast syscall arg failure path Andy Lutomirski
2015-06-30 10:58   ` Borislav Petkov
2015-06-30 11:06     ` Ingo Molnar
2015-06-30 16:04     ` Andy Lutomirski
2015-07-01  7:43       ` Ingo Molnar
2015-06-29 19:33 ` [PATCH v4 03/17] uml: Fix do_signal() prototype Andy Lutomirski
2015-06-29 20:47   ` Richard Weinberger
2015-06-29 19:33 ` [PATCH v4 04/17] context_tracking: Add ct_state and CT_WARN_ON Andy Lutomirski
2015-06-30 12:20   ` Borislav Petkov
2015-06-30 12:53     ` Ingo Molnar
2015-06-29 19:33 ` [PATCH v4 05/17] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
2015-06-29 19:33 ` [PATCH v4 06/17] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
2015-06-30 16:32   ` Borislav Petkov
2015-06-29 19:33 ` [PATCH v4 07/17] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
2015-06-30 17:01   ` Borislav Petkov
2015-06-30 17:08     ` Andy Lutomirski
2015-06-30 17:15       ` Borislav Petkov
2015-06-29 19:33 ` [PATCH v4 08/17] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
2015-07-01 10:24   ` Borislav Petkov
2015-06-29 19:33 ` [PATCH v4 09/17] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
2015-07-02  9:48   ` Borislav Petkov
2015-07-02 16:03     ` Andy Lutomirski
2015-07-02 16:25       ` Borislav Petkov
2015-06-29 19:33 ` [PATCH v4 10/17] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
2015-07-02 10:25   ` Borislav Petkov
2015-07-02 15:33     ` Andy Lutomirski
2015-07-02 16:29       ` Borislav Petkov
2015-06-29 19:33 ` [PATCH v4 11/17] x86/entry/64: Migrate 64-bit and compat syscalls to new exit hooks Andy Lutomirski
2015-07-02 10:49   ` Borislav Petkov
2015-07-02 15:56     ` Andy Lutomirski
2015-07-02 16:56   ` Denys Vlasenko
2015-06-29 19:33 ` [PATCH v4 12/17] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
2015-07-02 10:52   ` Borislav Petkov
2015-07-02 15:33     ` Andy Lutomirski
2015-06-29 19:33 ` [PATCH v4 13/17] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
2015-06-29 19:33 ` [PATCH v4 14/17] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
2015-07-02 12:09   ` Borislav Petkov
2015-07-02 16:09     ` Andy Lutomirski
2015-07-02 16:33       ` Borislav Petkov
2015-07-03  6:33       ` Ingo Molnar
2015-07-03 16:27         ` Andy Lutomirski
2015-07-03 16:29           ` Andy Lutomirski
2015-07-03 14:37       ` Denys Vlasenko
2015-07-03 16:24         ` Andy Lutomirski
2015-07-04  8:12           ` Ingo Molnar
2015-06-29 19:33 ` [PATCH v4 15/17] x86/entry: Remove exception_enter from most trap handlers Andy Lutomirski
2015-06-29 19:33 ` [PATCH v4 16/17] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
2015-06-29 19:33 ` [PATCH v4 17/17] x86/irq: Document how IRQ context tracking works and add an assertion Andy Lutomirski
2015-06-29 19:46 ` [PATCH v4 00/17] x86: Rewrite exit-to-userspace code Richard Weinberger
2015-06-29 20:14   ` Andy Lutomirski
2015-07-02 16:45 ` Borislav Petkov
2015-07-03  6:34   ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.