linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] ptrace-vs-syscall-restart fixes, v3
@ 2016-06-20 23:39 Andy Lutomirski
  2016-06-20 23:39 ` [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:39 UTC (permalink / raw)
  To: x86, linux-kernel; +Cc: Borislav Petkov, Andy Lutomirski

Now that I have a complete fix and cleanup that I like, here it is.
Patch 1 is the same as before except for a config-dependent build
fix.

This series is a cleanup and fixes at least two bugs:

 - 64-bit gdb attached to a 32-bit program malfunctions if the user
   types something like "print foo()" while stopped on exit from an
   interrupted ERESTART_RESTARTBLOCK-using syscall.

 - A 32-bit tracer that writes orig_eax while the tracee is stopped
   at syscall entry will break in_ia32_syscall() and
   syscall_get_arch().

After this is applied, it might be safe to drop all of the TS_COMPAT
checks in syscall.h, but I'd want to think carefully about that.

The reason that patch 1 is here at all instead of being folded into
the other patches is that it's intended to be cleanly and safely
backported if needed.

Patch 1 is for 4.7 or 4.8 at the maintainers' discretion.  It could
also make sense for -stable -- I thing it fixes a bug that could be
exploited to confuse the syscall auit logs.  Patch 2 and 3 are
intended for 4.8.

Pedro, can you try to test this series a bit?  I'm having trouble
getting ptrace-tests to pass even on an unmodified kernel.

Andy Lutomirski (3):
  x86/ptrace: Stop setting TS_COMPAT in ptrace code
  x86/signal: Rewire the restart_block() syscall to have a constant nr
  x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED

 arch/x86/entry/syscalls/syscall_32.tbl |  2 ++
 arch/x86/entry/syscalls/syscall_64.tbl |  3 +++
 arch/x86/include/asm/syscall.h         |  3 ---
 arch/x86/kernel/ptrace.c               | 34 +++++++++++++++++++++-------------
 arch/x86/kernel/signal.c               | 16 ++++++++++------
 5 files changed, 36 insertions(+), 22 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code
  2016-06-20 23:39 [PATCH v3 0/3] ptrace-vs-syscall-restart fixes, v3 Andy Lutomirski
@ 2016-06-20 23:39 ` Andy Lutomirski
  2016-06-22 22:13   ` Oleg Nesterov
  2016-07-24 18:47   ` Andy Lutomirski
  2016-06-20 23:39 ` [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr Andy Lutomirski
  2016-06-20 23:39 ` [PATCH v3 3/3] x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED Andy Lutomirski
  2 siblings, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:39 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Andy Lutomirski, Pedro Alves, Oleg Nesterov, Kees Cook

Setting TS_COMPAT in ptrace is wrong: if we happen to do it during
syscall entry, then we'll confuse seccomp and audit.  (The former
isn't a security problem: seccomp is currently entirely insecure if a
malicious ptracer is attached.)  As a minimal fix, this patch adds a
new flag TS_I386_REGS_POKED that handles the ptrace special case.

Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c            |  6 +++++-
 arch/x86/include/asm/syscall.h     |  5 +----
 arch/x86/include/asm/thread_info.h |  3 +++
 arch/x86/kernel/ptrace.c           | 15 +++++++++------
 arch/x86/kernel/signal.c           | 26 ++++++++++++++++++++++++--
 5 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e538c44..0db497a8ff19 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -270,8 +270,12 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	 * handling, because syscall restart has a fixup for compat
 	 * syscalls.  The fixup is exercised by the ptrace_syscall_32
 	 * selftest.
+	 *
+	 * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
+	 * special case only applies after poking regs and before the
+	 * very next return to user mode.
 	 */
-	ti->status &= ~TS_COMPAT;
+	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
 
 	user_enter();
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 999b7cd2e78c..4e23dd15c661 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
 	 * TS_COMPAT is set for 32-bit syscall entries and then
 	 * remains set until we return to user mode.
 	 */
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task_thread_info(task)->status & (TS_COMPAT|TS_I386_REGS_POKED))
 		/*
 		 * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
 		 * and will match correctly in comparisons.
@@ -239,9 +239,6 @@ static inline int syscall_get_arch(void)
 	 * TS_COMPAT is set for 32-bit syscall entry and then
 	 * remains set until we return to user mode.
 	 *
-	 * TIF_IA32 tasks should always have TS_COMPAT set at
-	 * system call time.
-	 *
 	 * x32 tasks should be considered AUDIT_ARCH_X86_64.
 	 */
 	if (task_thread_info(current)->status & TS_COMPAT)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..4bca518d11f4 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -228,6 +228,9 @@ static inline unsigned long current_stack_pointer(void)
  * have to worry about atomic accesses.
  */
 #define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
+#ifdef CONFIG_COMPAT
+#define TS_I386_REGS_POKED	0x0004	/* regs poked by 32-bit ptracer */
+#endif
 #define TS_RESTORE_SIGMASK	0x0008	/* restore signal mask in do_signal() */
 
 #ifndef __ASSEMBLY__
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 600edd225e81..f79576a541ff 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -923,15 +923,18 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
 
 	case offsetof(struct user32, regs.orig_eax):
 		/*
-		 * A 32-bit debugger setting orig_eax means to restore
-		 * the state of the task restarting a 32-bit syscall.
-		 * Make sure we interpret the -ERESTART* codes correctly
-		 * in case the task is not actually still sitting at the
-		 * exit from a 32-bit syscall with TS_COMPAT still set.
+		 * Warning: bizarre corner case fixup here.  A 32-bit
+		 * debugger setting orig_eax to -1 wants to disable
+		 * syscall restart.  Make sure that the syscall
+		 * restart code sign-extends orig_ax.  Also make sure
+		 * we interpret the -ERESTART* codes correctly if
+		 * loaded into regs->ax in case the task is not
+		 * actually still sitting at the exit from a 32-bit
+		 * syscall with TS_COMPAT still set.
 		 */
 		regs->orig_ax = value;
 		if (syscall_get_nr(child, regs) >= 0)
-			task_thread_info(child)->status |= TS_COMPAT;
+			task_thread_info(child)->status |= TS_I386_REGS_POKED;
 		break;
 
 	case offsetof(struct user32, regs.eflags):
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 22cc2f9f8aec..6b952e1d8db8 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -760,8 +760,30 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 
 static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 {
-#ifdef CONFIG_X86_64
-	if (in_ia32_syscall())
+	/*
+	 * This function is fundamentally broken as currently
+	 * implemented.
+	 *
+	 * The idea is that we want to trigger a call to the
+	 * restart_block() syscall and that we want in_ia32_syscall(),
+	 * in_x32_syscall(), etc.  to match whatever they were in the
+	 * syscall being restarted.  We assume that the syscall
+	 * instruction at (regs->ip - 2) matches whatever syscall
+	 * instruction we used to enter in the first place.
+	 *
+	 * The problem is that we can get here when ptrace pokes
+	 * syscall-like values into regs even if we're not in a syscall
+	 * at all.
+	 *
+	 * For now, we maintain historical behavior and guess based on
+	 * stored state.  We could do better by saving the actual
+	 * syscall arch in restart_block or (with caveats on x32) by
+	 * checking if regs->ip points to 'int $0x80'.  The current
+	 * behavior is incorrect if a tracer has a different bitness
+	 * than the tracee.
+	 */
+#ifdef CONFIG_IA32_EMULATION
+	if (current_thread_info()->status & (TS_COMPAT|TS_I386_REGS_POKED))
 		return __NR_ia32_restart_syscall;
 #endif
 #ifdef CONFIG_X86_X32_ABI
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
  2016-06-20 23:39 [PATCH v3 0/3] ptrace-vs-syscall-restart fixes, v3 Andy Lutomirski
  2016-06-20 23:39 ` [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code Andy Lutomirski
@ 2016-06-20 23:39 ` Andy Lutomirski
  2016-06-21 12:39   ` Pedro Alves
  2016-06-23 21:21   ` Oleg Nesterov
  2016-06-20 23:39 ` [PATCH v3 3/3] x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED Andy Lutomirski
  2 siblings, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:39 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Andy Lutomirski, Pedro Alves, Oleg Nesterov, Kees Cook

Suppose a 64-bit task A traces a 32-bit task B.  B makes a syscall
that uses ERESTART_RESTARTBLOCK and gets a signal.  A catches
syscall exit, snapshots B's regs, changes the regs, and resumes.
Then A restores the snapshot of B's regs.

A expects B to resume where it left off when snapshotted, which means
that B should execute sys_restart_block().  We have a big set of hacks
in place that make this work in some cases, but in this particular
case, the hacks fall apart.  When A restores regs, because A is
64-bit, ptrace() will *not* set TS_I386_REGS_POKED.  But, because B
isn't stopped in a syscall, TS_COMPAT won't be set either.  As a
result, the current code will set nr=219 (the 64-bit
__NR_restart_syscall) and then promptly invoke madvise() (the 32-bit
syscall with nr==219).

This patch fixes the bug and simplifies the code simultaneous by
simply wiring up nr==380 to refer to sys_restart_block() in all cases.

Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  2 ++
 arch/x86/entry/syscalls/syscall_64.tbl |  3 +++
 arch/x86/kernel/signal.c               | 34 ++++++++--------------------------
 3 files changed, 13 insertions(+), 26 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cddd17153fb..9fc9272d0c41 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -386,3 +386,5 @@
 377	i386	copy_file_range		sys_copy_file_range
 378	i386	preadv2			sys_preadv2			compat_sys_preadv2
 379	i386	pwritev2		sys_pwritev2			compat_sys_pwritev2
+# Same as restart_syscall but with the same nr for all ABIs.
+380	i386	restart_syscall2	sys_restart_syscall
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 555263e385c9..f074952eaad8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -336,6 +336,9 @@
 327	64	preadv2			sys_preadv2
 328	64	pwritev2		sys_pwritev2
 
+# Same as restart_syscall but with the same nr for all ABIs.
+380	common	restart_syscall2	sys_restart_syscall
+
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
 # for native 64-bit operation.
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 6b952e1d8db8..b9f5133867f3 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -761,35 +761,17 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 {
 	/*
-	 * This function is fundamentally broken as currently
-	 * implemented.
-	 *
-	 * The idea is that we want to trigger a call to the
-	 * restart_block() syscall and that we want in_ia32_syscall(),
-	 * in_x32_syscall(), etc.  to match whatever they were in the
-	 * syscall being restarted.  We assume that the syscall
-	 * instruction at (regs->ip - 2) matches whatever syscall
-	 * instruction we used to enter in the first place.
-	 *
-	 * The problem is that we can get here when ptrace pokes
-	 * syscall-like values into regs even if we're not in a syscall
-	 * at all.
-	 *
-	 * For now, we maintain historical behavior and guess based on
-	 * stored state.  We could do better by saving the actual
-	 * syscall arch in restart_block or (with caveats on x32) by
-	 * checking if regs->ip points to 'int $0x80'.  The current
-	 * behavior is incorrect if a tracer has a different bitness
-	 * than the tracee.
+	 * We can't reliably distinguish between restarting an i386
+	 * syscall and an x86_64 syscall without decoding the
+	 * instruction at regs->ip -= 2.  Rather than trying, restart
+	 * using __NR_restart_syscall2, which works regardless of ABI.
+	 * We still want to preserve the x32 state, but we can do that
+	 * directly.
 	 */
-#ifdef CONFIG_IA32_EMULATION
-	if (current_thread_info()->status & (TS_COMPAT|TS_I386_REGS_POKED))
-		return __NR_ia32_restart_syscall;
-#endif
 #ifdef CONFIG_X86_X32_ABI
-	return __NR_restart_syscall | (regs->orig_ax & __X32_SYSCALL_BIT);
+	return __NR_restart_syscall2 | (regs->orig_ax & __X32_SYSCALL_BIT);
 #else
-	return __NR_restart_syscall;
+	return __NR_restart_syscall2;
 #endif
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/3] x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED
  2016-06-20 23:39 [PATCH v3 0/3] ptrace-vs-syscall-restart fixes, v3 Andy Lutomirski
  2016-06-20 23:39 ` [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code Andy Lutomirski
  2016-06-20 23:39 ` [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr Andy Lutomirski
@ 2016-06-20 23:39 ` Andy Lutomirski
  2016-06-23 21:26   ` Oleg Nesterov
  2 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:39 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Andy Lutomirski, Pedro Alves, Oleg Nesterov, Kees Cook

System call restart has some oddities wrt ptrace:

1. For whatever reason, the kernel delivers signals and triggers
   ptrace before handling syscall restart.  This means that
   -ERESTART_RESTARTBLOCK, etc is visible to userspace.  We could
   plausibly get away with changing that, but it seems quite risky.

2. As a result of (1), gdb (quite reasonably) expects that it can
   snapshot user state on signal delivery, adjust regs to call a
   function, and then restore user state.

3. Presumably as a result of (2), we do syscall restart if indicated
   by the register state on ptrace resume even if we're *not* resuming
   a syscall.

4. Also as a result of (1), gdb expects that writing -1 to orig_eax
   via POKEUSER or similar will *disable* syscall restart, which is
   necessary to get function calling on syscall exit to work.

The combination of (1) and (4) means that, if we have a 32-bit tracer,
we need to skip syscall restart if orig_eax == -1 (in a 32-bit signed
sense).  The combination of (1) and (2) means that, if we have a
32-bit tracer, we need to enable syscall restart if orig_eax > 0 (in a
32-bit signed sense) and eax contains a -ERESTART* code (again in a
signed sense).

The current state of affairs is a mess.  Setting a temporary per-task
flag when ptrace changes orig_eax is messy.  It does the wrong thing
when ptrace only writes eax.  It's also seriously overcomplicated IMO.

Instead, just unconditionally sign-extending them in the ptrace code
and not worrying about ptrace in the signal handling code.

Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c            |  6 +-----
 arch/x86/include/asm/syscall.h     |  2 +-
 arch/x86/include/asm/thread_info.h |  3 ---
 arch/x86/kernel/ptrace.c           | 37 +++++++++++++++++++++----------------
 4 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0db497a8ff19..ec138e538c44 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -270,12 +270,8 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	 * handling, because syscall restart has a fixup for compat
 	 * syscalls.  The fixup is exercised by the ptrace_syscall_32
 	 * selftest.
-	 *
-	 * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
-	 * special case only applies after poking regs and before the
-	 * very next return to user mode.
 	 */
-	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
+	ti->status &= ~TS_COMPAT;
 #endif
 
 	user_enter();
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 4e23dd15c661..4216bb7cbcba 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
 	 * TS_COMPAT is set for 32-bit syscall entries and then
 	 * remains set until we return to user mode.
 	 */
-	if (task_thread_info(task)->status & (TS_COMPAT|TS_I386_REGS_POKED))
+	if (task_thread_info(task)->status & TS_COMPAT)
 		/*
 		 * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
 		 * and will match correctly in comparisons.
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 4bca518d11f4..30c133ac05cd 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -228,9 +228,6 @@ static inline unsigned long current_stack_pointer(void)
  * have to worry about atomic accesses.
  */
 #define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
-#ifdef CONFIG_COMPAT
-#define TS_I386_REGS_POKED	0x0004	/* regs poked by 32-bit ptracer */
-#endif
 #define TS_RESTORE_SIGMASK	0x0008	/* restore signal mask in do_signal() */
 
 #ifndef __ASSEMBLY__
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index f79576a541ff..c95aba795f88 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -891,6 +891,10 @@ long arch_ptrace(struct task_struct *child, long request,
 	case offsetof(struct user32, regs.l):				\
 		regs->q = value; break
 
+#define R32_SIGNED(l,q)							\
+	case offsetof(struct user32, regs.l):				\
+		regs->q = (long)(s32)value; break
+
 #define SEG32(rs)							\
 	case offsetof(struct user32, regs.rs):				\
 		return set_segment_reg(child,				\
@@ -917,25 +921,26 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
 	R32(edi, di);
 	R32(esi, si);
 	R32(ebp, bp);
-	R32(eax, ax);
 	R32(eip, ip);
 	R32(esp, sp);
 
-	case offsetof(struct user32, regs.orig_eax):
-		/*
-		 * Warning: bizarre corner case fixup here.  A 32-bit
-		 * debugger setting orig_eax to -1 wants to disable
-		 * syscall restart.  Make sure that the syscall
-		 * restart code sign-extends orig_ax.  Also make sure
-		 * we interpret the -ERESTART* codes correctly if
-		 * loaded into regs->ax in case the task is not
-		 * actually still sitting at the exit from a 32-bit
-		 * syscall with TS_COMPAT still set.
-		 */
-		regs->orig_ax = value;
-		if (syscall_get_nr(child, regs) >= 0)
-			task_thread_info(child)->status |= TS_I386_REGS_POKED;
-		break;
+	/*
+	 * A 32-bit ptracer has the following expectations:
+	 *
+	 * - Storing -1 (i.e. 0xffffffff) to orig_eax will prevent
+	 *   syscall restart handling.
+	 *
+	 * - Restoring regs saved on exit from an interrupted
+	 *   restartable syscall will trigger syscall restart.  Such
+	 *   regs will have non-negative orig_eax and negative eax.
+	 *
+	 * The kernel's syscall restart code treats regs->orig_ax and
+	 * regs->ax as 64-bit signed quantities.  32-bit user code
+	 * doesn't care about the high bits.  Keep it simple and just
+	 * sign-extend both values.
+	 */
+	R32_SIGNED(orig_eax, orig_ax);
+	R32_SIGNED(eax, ax);
 
 	case offsetof(struct user32, regs.eflags):
 		return set_flags(child, value);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
  2016-06-20 23:39 ` [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr Andy Lutomirski
@ 2016-06-21 12:39   ` Pedro Alves
  2016-06-21 16:32     ` Andy Lutomirski
  2016-06-23 21:21   ` Oleg Nesterov
  1 sibling, 1 reply; 17+ messages in thread
From: Pedro Alves @ 2016-06-21 12:39 UTC (permalink / raw)
  To: Andy Lutomirski, x86, linux-kernel
  Cc: Borislav Petkov, Oleg Nesterov, Kees Cook

Hi Andy,

On 06/21/2016 12:39 AM, Andy Lutomirski wrote:
> Suppose a 64-bit task A traces a 32-bit task B.

I gave your x86/ptrace branch a try:

 https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/ptrace

(this looks to be the same patch set.)

Unfortunately, with gdb git master, I still get the
64-bit ptracer x 32-bit ptracee problem:

 (gdb) r
 Starting program: interrupt.32 
 talk to me baby
 ^C

 Program received signal SIGINT, Interrupt.
 0xf7fd9d09 in __kernel_vsyscall ()
 (gdb) p func1 ()
 $1 = 4
 (gdb) c
 Continuing.
 Unknown error 512
 [Inferior 1 (process 2198) exited with code 01]
 (gdb) q

Is this expected?

This is the same testcase as before:

 https://sourceware.org/ml/gdb/2014-05/msg00004.html

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
  2016-06-21 12:39   ` Pedro Alves
@ 2016-06-21 16:32     ` Andy Lutomirski
  2016-06-22 12:00       ` Pedro Alves
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:32 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Oleg Nesterov, Kees Cook, Borislav Petkov, linux-kernel, X86 ML

On Jun 21, 2016 5:40 AM, "Pedro Alves" <palves@redhat.com> wrote:
>
> Hi Andy,
>
> On 06/21/2016 12:39 AM, Andy Lutomirski wrote:
> > Suppose a 64-bit task A traces a 32-bit task B.
>
> I gave your x86/ptrace branch a try:
>
>  https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/ptrace
>
> (this looks to be the same patch set.)
>
> Unfortunately, with gdb git master, I still get the
> 64-bit ptracer x 32-bit ptracee problem:
>
>  (gdb) r
>  Starting program: interrupt.32
>  talk to me baby
>  ^C

I didn't try that particular experiment.  But, from that email:

> After that, GDB can control the stopped inferior. To call function "func1()" of inferior, GDB need: Step 1, save current values of registers ($rax 0xfffffffffffffe00(64 bits -512) is cut to 0xfffffe00(32 bits -512) because inferior is a 32 bits program).

That sounds like it may be a gdb bug.  Why does gdb truncate the register?

I haven't played with it recently, but, in my experience, gdb seems to
work quite poorly in mixed-mode situations.  For example, if you
attach 64-bit gdb to qemu-system-x86_64's gdbserver, boot a 64-bit
guest, and breakpoint in early 32-bit code, gdb tends to explode
pretty badly.

On x86_64, I think gdb should treat CPU state as 64-bit no matter
what.  The fact that a 32-bit tracee code segment is in use shouldn't
change much.

Admittedly the kernel doesn't really help.  There is some questionable
code involving which regsets to show to ptrace.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
  2016-06-21 16:32     ` Andy Lutomirski
@ 2016-06-22 12:00       ` Pedro Alves
  2016-06-22 15:20         ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Pedro Alves @ 2016-06-22 12:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Kees Cook, Borislav Petkov, linux-kernel, X86 ML

On 06/21/2016 05:32 PM, Andy Lutomirski wrote:
> On Jun 21, 2016 5:40 AM, "Pedro Alves" <palves@redhat.com> wrote:

> I didn't try that particular experiment.  But, from that email:
> 
>> After that, GDB can control the stopped inferior. To call function "func1()" of inferior, GDB need: Step 1, save current values of registers ($rax 0xfffffffffffffe00(64 bits -512) is cut to 0xfffffe00(32 bits -512) because inferior is a 32 bits program).
> 
> That sounds like it may be a gdb bug.  Why does gdb truncate the register?

Because when debugging a 32-bit program, gdb's register cache only
stores 32-bit-wide registers ($eax, $eip, etc., not $rax, etc.)

Let me turn this around:

Why does the kernel care about the upper 32-bit bits of $orig_rax when
the task is in 32-bit mode in the first place?

The 32-bit syscall entry points already only care about $eax, not $rax,
since $rax doesn't exist in real 32-bit CPUs.

Looking at arch/x86/entry/entry_64_compat.S, all three 64-bit x 32-bit syscall
entry points zero-extend $eax already, and then push that as pt_regs->orig_ax.

So if the kernel is giving significance to the higher 32-bits of orig_ax at
32-bit syscall restart time, that very much looks like a kernel bug to me.

> 
> I haven't played with it recently, but, in my experience, gdb seems to
> work quite poorly in mixed-mode situations.  For example, if you
> attach 64-bit gdb to qemu-system-x86_64's gdbserver, boot a 64-bit
> guest, and breakpoint in early 32-bit code, gdb tends to explode
> pretty badly.

Right, but that's a bit of a red herring, and not entirely gdb's fault.  The case you
mention happens because qemu does exactly the opposite of what you're suggesting below.
It's qemu that changes the remote protocol's register layout (and thus size) in the
remote protocol register read/write packets when the kernel changes mode, behind
gdb's back, and gdb errors out because obviously it isn't expecting that.  All gdb
knows is that qemu is now sending bogus register read replies, different from what
the target description qemu reported on initial remote connection described.

I say "entirely", because gdb has its own share of fault for the remote protocol
not including a some kind of standard mechanism to inform gdb of mode changes.

However, the usual scenario where the program _doesn't_ change mode
during execution, is supported.

> 
> On x86_64, I think gdb should treat CPU state as 64-bit no matter
> what.  The fact that a 32-bit tracee code segment is in use shouldn't
> change much.

It's not as clear or easy as you make it sound, unfortunately.

For normal userspace programs, the current design across
gdb/remote protocol/ptrace/kernel/core dump format/elf formats/ is that
what matters is the program's architecture, not whatever the tracer's arch
is.

Should core dumping dump 64-bit CPU state as well for 32-bit programs?
The current core dump format dumps a 32-bit elf with notes that contain
32-bit registers.  And I think it'd be a bit odd for a 32-bit program to
dump different cores files depending on the bitness of the kernel.

Should a gdb connected to a 64-bit gdbserver that is debugging a 32-bit
program see different registers compared to a gdb that
is connected to a 32-bit gdbserver that is debugging a 32-bit program?
Currently, it doesn't.  The architecture of gdbserver doesn't matter here,
only the tracee's.

> Admittedly the kernel doesn't really help.  There is some questionable
> code involving which regsets to show to ptrace.

I don't know what code you're looking at, but I consider this mandatory reading:

 Roland McGrath on PTRACE_GETREGSET design:
 https://sourceware.org/ml/archer/2010-q3/msg00193.html

  "A caveat about those requests for bi-arch systems.  Unlike other
  ptrace requests, these access the native formats of the tracee
  process, rather than the native formats of the debugger process.
  So, a 64-bit debugger process using PTRACE_GETREGSET on a 32-bit
  tracee process will see the 32-bit layouts (i.e. what would appear
  in an ELF core file if that process dumped one)."


gdb currently uses PTRACE_SETREGS for the general registers, which
means it currently writes those as 64-bit registers.  However, if gdb or any
ptracer restores/writes $eax/$orig_eax using PTRACE_SETREGSET, it's only
going to pass down a 32-bit value, and again it must be the kernel that sign
extends $orig_eax if it wants to interpret it as signed 64-bit internally.


But actually looking at your patch 3, I'm confused, because it seems
to be doing what I'm suggesting?


Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
  2016-06-22 12:00       ` Pedro Alves
@ 2016-06-22 15:20         ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2016-06-22 15:20 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Oleg Nesterov, Kees Cook, Borislav Petkov, X86 ML, linux-kernel

On Jun 22, 2016 5:00 AM, "Pedro Alves" <palves@redhat.com> wrote:
>
> On 06/21/2016 05:32 PM, Andy Lutomirski wrote:
> > On Jun 21, 2016 5:40 AM, "Pedro Alves" <palves@redhat.com> wrote:
>
> > I didn't try that particular experiment.  But, from that email:
> >
> >> After that, GDB can control the stopped inferior. To call function "func1()" of inferior, GDB need: Step 1, save current values of registers ($rax 0xfffffffffffffe00(64 bits -512) is cut to 0xfffffe00(32 bits -512) because inferior is a 32 bits program).
> >
> > That sounds like it may be a gdb bug.  Why does gdb truncate the register?
>
> Because when debugging a 32-bit program, gdb's register cache only
> stores 32-bit-wide registers ($eax, $eip, etc., not $rax, etc.)
>
> Let me turn this around:
>
> Why does the kernel care about the upper 32-bit bits of $orig_rax when
> the task is in 32-bit mode in the first place?
>
> The 32-bit syscall entry points already only care about $eax, not $rax,
> since $rax doesn't exist in real 32-bit CPUs.

Two reasons.  First, long jump to 32-bit followed by long jump to
64-bit preserves all the high bits.  If there's a context switch or
similar in between, the high bits should still be preserved.  Ideally
gdb would preserve them, too -- this would make debugging weird
mode-switching programs much easier.

Second, when this save-and-restore dance happens, the kernel *doesn't
know* what the syscall bitness is, because it's not in a syscall
anymore.  The kernel can *guess* by checking CS or decoding the
syscall instruction, but it's just a guess -- decoding the syscall
instruction isn't absolutely guaranteed to work (UML, for example,
could unmap it temporarily), and the syscall could be int $0x80
(32-bit) even in a 64-bit code segment.

Part of the problem here is that there's a critical piece of state
that isn't visible directly through ptrace at all: where in the
syscall process the tracee is.  When PTRACE_SYSCALL fires the first
time, the tracee is in syscall entry.  No matter what regs are set,
the tracee will still be in syscall entry on resume.  Depending on
orig_ax, it may abort the syscall.  On the second event, it's in
syscall exit.  This is when you might read out an ERESTART code.  In
both of these syscall states, the kernel has an associated concept of
the syscall arch, and ptrace can neither read nor write it.  (The fact
that ptrace can't read it is why strace screws up completely if a
64-bit program uses int $0x80.)  On a signal event or a single-step
event, the tracee isn't in a syscall any more, so, when gdb shoves
syscall-like registers back into the tracee, the kernel has no direct
knowledge of whether the tracee is in a syscall or what arch it is, so
the kernel has to either assume that all regs are 64-bit (my preferred
approach) or needs to make a guess.

>
> Looking at arch/x86/entry/entry_64_compat.S, all three 64-bit x 32-bit syscall
> entry points zero-extend $eax already, and then push that as pt_regs->orig_ax.
>
> So if the kernel is giving significance to the higher 32-bits of orig_ax at
> 32-bit syscall restart time, that very much looks like a kernel bug to me.
>

It's not doing this.  When gdb does this reg restore, the kernel's not
doing 32-bit syscall restart -- it's doing a generic resume, and it
just happens that syscall restart has been (incorrectly!) run at this
stage in the resume process even when the task isn't returning from a
syscall.  Arguably the fact that orig_ax doesn't encode the syscall
bitness is a bug, but it's way too late to change it.

Could 64-bit gdb sign-extend eax and orig_eax when loading them if
it's forgotten their original high bits?

> >
> > I haven't played with it recently, but, in my experience, gdb seems to
> > work quite poorly in mixed-mode situations.  For example, if you
> > attach 64-bit gdb to qemu-system-x86_64's gdbserver, boot a 64-bit
> > guest, and breakpoint in early 32-bit code, gdb tends to explode
> > pretty badly.
>
> Right, but that's a bit of a red herring, and not entirely gdb's fault.  The case you
> mention happens because qemu does exactly the opposite of what you're suggesting below.
> It's qemu that changes the remote protocol's register layout (and thus size) in the
> remote protocol register read/write packets when the kernel changes mode, behind
> gdb's back, and gdb errors out because obviously it isn't expecting that.  All gdb
> knows is that qemu is now sending bogus register read replies, different from what
> the target description qemu reported on initial remote connection described.
>

Yuck.

> I say "entirely", because gdb has its own share of fault for the remote protocol
> not including a some kind of standard mechanism to inform gdb of mode changes.

Is this something QEMU could improve?  I could try pestering them to fix it.

>
> However, the usual scenario where the program _doesn't_ change mode
> during execution, is supported.
>
> >
> > On x86_64, I think gdb should treat CPU state as 64-bit no matter
> > what.  The fact that a 32-bit tracee code segment is in use shouldn't
> > change much.
>
> It's not as clear or easy as you make it sound, unfortunately.
>
> For normal userspace programs, the current design across
> gdb/remote protocol/ptrace/kernel/core dump format/elf formats/ is that
> what matters is the program's architecture, not whatever the tracer's arch
> is.
>
> Should core dumping dump 64-bit CPU state as well for 32-bit programs?
> The current core dump format dumps a 32-bit elf with notes that contain
> 32-bit registers.  And I think it'd be a bit odd for a 32-bit program to
> dump different cores files depending on the bitness of the kernel.

I actually think it should, but only if the binary format would work
without breaking existing core file readers.

>
> Should a gdb connected to a 64-bit gdbserver that is debugging a 32-bit
> program see different registers compared to a gdb that
> is connected to a 32-bit gdbserver that is debugging a 32-bit program?
> Currently, it doesn't.  The architecture of gdbserver doesn't matter here,
> only the tracee's.

Yes, absolutely, at least if the user opts in.  This would make
debugging the kernel or other strange programs that use long jumps
much, much easier.

>
> > Admittedly the kernel doesn't really help.  There is some questionable
> > code involving which regsets to show to ptrace.
>
> I don't know what code you're looking at, but I consider this mandatory reading:
>
>  Roland McGrath on PTRACE_GETREGSET design:
>  https://sourceware.org/ml/archer/2010-q3/msg00193.html
>
>   "A caveat about those requests for bi-arch systems.  Unlike other
>   ptrace requests, these access the native formats of the tracee
>   process, rather than the native formats of the debugger process.
>   So, a 64-bit debugger process using PTRACE_GETREGSET on a 32-bit
>   tracee process will see the 32-bit layouts (i.e. what would appear
>   in an ELF core file if that process dumped one)."
>
>
> gdb currently uses PTRACE_SETREGS for the general registers, which
> means it currently writes those as 64-bit registers.  However, if gdb or any
> ptracer restores/writes $eax/$orig_eax using PTRACE_SETREGSET, it's only
> going to pass down a 32-bit value, and again it must be the kernel that sign
> extends $orig_eax if it wants to interpret it as signed 64-bit internally.
>

Egads!  Do you know whether this was intentional on the kernel's part?

I'm inclined to add a PTRACE_GETREGSET_CALLER_ABI or similar that
returns the regsets according to the tracer's view, thus allowing this
to actually work sanely.  (With the caveat that a 32-bit tracer may
have serious problems if the tracee is 64-bit.)

>
> But actually looking at your patch 3, I'm confused, because it seems
> to be doing what I'm suggesting?
>

The case that's still broken is syscall restart in general (the
decision to backtrack regs->ip), not the invocation of
restart_syscall.  The latter only happens for a small number of
syscalls.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code
  2016-06-20 23:39 ` [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code Andy Lutomirski
@ 2016-06-22 22:13   ` Oleg Nesterov
  2016-07-24 18:47   ` Andy Lutomirski
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2016-06-22 22:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Pedro Alves, Kees Cook

Andy, sorry for delay. And for the noise.

I just want to say that I'll try very much to read this series
tomorrow. I have some concerns at first glance... but I feel that
most probably this is only because I already need to sleep ;)

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
  2016-06-20 23:39 ` [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr Andy Lutomirski
  2016-06-21 12:39   ` Pedro Alves
@ 2016-06-23 21:21   ` Oleg Nesterov
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2016-06-23 21:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Pedro Alves, Kees Cook

So I think this series is fine, yesterday I misread it completely.

On 06/20, Andy Lutomirski wrote:
>
> Suppose a 64-bit task A traces a 32-bit task B.

And even if they are both 64-bit ...

> B makes a syscall
> that uses ERESTART_RESTARTBLOCK and gets a signal.  A catches
> syscall exit, snapshots B's regs, changes the regs, and resumes.
> Then A restores the snapshot of B's regs.

perhaps in this case gdb should always turn ERESTART_RESTARTBLOCK
into EINTR, because we can't know if B->restart_block is still the
same; it can be changed if the tracee does another RESTARTBLOCK
syscall after the first resume.

But anyway the patch looks good to me.

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/3] x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED
  2016-06-20 23:39 ` [PATCH v3 3/3] x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED Andy Lutomirski
@ 2016-06-23 21:26   ` Oleg Nesterov
  2016-06-23 21:53     ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2016-06-23 21:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Pedro Alves, Kees Cook

Again, I think the patch is fine, just a question

On 06/20, Andy Lutomirski wrote:
>
> System call restart has some oddities wrt ptrace:
>
> 1. For whatever reason, the kernel delivers signals and triggers
>    ptrace before handling syscall restart.  This means that
>    -ERESTART_RESTARTBLOCK, etc is visible to userspace.  We could
>    plausibly get away with changing that, but it seems quite risky.

How we can change this?

The kernel simply can't know how it should react to (say) -ERESTARTSYS
until debugger acks/nacks/changes the signal reported by tracee.

> +	/*
> +	 * A 32-bit ptracer has the following expectations:
> +	 *
> +	 * - Storing -1 (i.e. 0xffffffff) to orig_eax will prevent
> +	 *   syscall restart handling.
> +	 *
> +	 * - Restoring regs saved on exit from an interrupted
> +	 *   restartable syscall will trigger syscall restart.  Such
> +	 *   regs will have non-negative orig_eax and negative eax.
> +	 *
> +	 * The kernel's syscall restart code treats regs->orig_ax and
> +	 * regs->ax as 64-bit signed quantities.  32-bit user code
> +	 * doesn't care about the high bits.  Keep it simple and just
> +	 * sign-extend both values.
> +	 */
> +	R32_SIGNED(orig_eax, orig_ax);
> +	R32_SIGNED(eax, ax);

OK. but do we really need R32_SIGNED(orig_eax) ? syscall_get_nr()
returns "int", not "long".

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/3] x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED
  2016-06-23 21:26   ` Oleg Nesterov
@ 2016-06-23 21:53     ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2016-06-23 21:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Pedro Alves, Kees Cook

On Thu, Jun 23, 2016 at 2:26 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> Again, I think the patch is fine, just a question
>
> On 06/20, Andy Lutomirski wrote:
>>
>> System call restart has some oddities wrt ptrace:
>>
>> 1. For whatever reason, the kernel delivers signals and triggers
>>    ptrace before handling syscall restart.  This means that
>>    -ERESTART_RESTARTBLOCK, etc is visible to userspace.  We could
>>    plausibly get away with changing that, but it seems quite risky.
>
> How we can change this?
>
> The kernel simply can't know how it should react to (say) -ERESTARTSYS
> until debugger acks/nacks/changes the signal reported by tracee.

Hmm, good point.  I don't know whether our current behavior is fully
correct or whether we could change it.

>
>> +     /*
>> +      * A 32-bit ptracer has the following expectations:
>> +      *
>> +      * - Storing -1 (i.e. 0xffffffff) to orig_eax will prevent
>> +      *   syscall restart handling.
>> +      *
>> +      * - Restoring regs saved on exit from an interrupted
>> +      *   restartable syscall will trigger syscall restart.  Such
>> +      *   regs will have non-negative orig_eax and negative eax.
>> +      *
>> +      * The kernel's syscall restart code treats regs->orig_ax and
>> +      * regs->ax as 64-bit signed quantities.  32-bit user code
>> +      * doesn't care about the high bits.  Keep it simple and just
>> +      * sign-extend both values.
>> +      */
>> +     R32_SIGNED(orig_eax, orig_ax);
>> +     R32_SIGNED(eax, ax);
>
> OK. but do we really need R32_SIGNED(orig_eax) ? syscall_get_nr()
> returns "int", not "long".

Fair enough, maybe we don't.  I'll drop that part and just keep
R32_SIGNED(eax, ax).

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code
  2016-06-20 23:39 ` [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code Andy Lutomirski
  2016-06-22 22:13   ` Oleg Nesterov
@ 2016-07-24 18:47   ` Andy Lutomirski
  2016-07-25  6:38     ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2016-07-24 18:47 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: X86 ML, linux-kernel, Borislav Petkov, Pedro Alves,
	Oleg Nesterov, Kees Cook

On Mon, Jun 20, 2016 at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
> Setting TS_COMPAT in ptrace is wrong: if we happen to do it during
> syscall entry, then we'll confuse seccomp and audit.  (The former
> isn't a security problem: seccomp is currently entirely insecure if a
> malicious ptracer is attached.)  As a minimal fix, this patch adds a
> new flag TS_I386_REGS_POKED that handles the ptrace special case.

Hi Ingo-

Could you apply this one patch for 4.8?  While I don't think it's a
significant security issue in 4.7 or earlier, leaving it unfixed in
4.8 will introduce a potentially unpleasant interaction with some
seccomp changes that are queued up in the
security tree for 4.8.

It will have a trivially-resolvable conflict with -mm.

The rest of the series this is in can wait.

--Andy

>
> Cc: Pedro Alves <palves@redhat.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/common.c            |  6 +++++-
>  arch/x86/include/asm/syscall.h     |  5 +----
>  arch/x86/include/asm/thread_info.h |  3 +++
>  arch/x86/kernel/ptrace.c           | 15 +++++++++------
>  arch/x86/kernel/signal.c           | 26 ++++++++++++++++++++++++--
>  5 files changed, 42 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index ec138e538c44..0db497a8ff19 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -270,8 +270,12 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
>          * handling, because syscall restart has a fixup for compat
>          * syscalls.  The fixup is exercised by the ptrace_syscall_32
>          * selftest.
> +        *
> +        * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
> +        * special case only applies after poking regs and before the
> +        * very next return to user mode.
>          */
> -       ti->status &= ~TS_COMPAT;
> +       ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
>  #endif
>
>         user_enter();
> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index 999b7cd2e78c..4e23dd15c661 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h
> @@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
>          * TS_COMPAT is set for 32-bit syscall entries and then
>          * remains set until we return to user mode.
>          */
> -       if (task_thread_info(task)->status & TS_COMPAT)
> +       if (task_thread_info(task)->status & (TS_COMPAT|TS_I386_REGS_POKED))
>                 /*
>                  * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
>                  * and will match correctly in comparisons.
> @@ -239,9 +239,6 @@ static inline int syscall_get_arch(void)
>          * TS_COMPAT is set for 32-bit syscall entry and then
>          * remains set until we return to user mode.
>          *
> -        * TIF_IA32 tasks should always have TS_COMPAT set at
> -        * system call time.
> -        *
>          * x32 tasks should be considered AUDIT_ARCH_X86_64.
>          */
>         if (task_thread_info(current)->status & TS_COMPAT)
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 30c133ac05cd..4bca518d11f4 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -228,6 +228,9 @@ static inline unsigned long current_stack_pointer(void)
>   * have to worry about atomic accesses.
>   */
>  #define TS_COMPAT              0x0002  /* 32bit syscall active (64BIT)*/
> +#ifdef CONFIG_COMPAT
> +#define TS_I386_REGS_POKED     0x0004  /* regs poked by 32-bit ptracer */
> +#endif
>  #define TS_RESTORE_SIGMASK     0x0008  /* restore signal mask in do_signal() */
>
>  #ifndef __ASSEMBLY__
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 600edd225e81..f79576a541ff 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -923,15 +923,18 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
>
>         case offsetof(struct user32, regs.orig_eax):
>                 /*
> -                * A 32-bit debugger setting orig_eax means to restore
> -                * the state of the task restarting a 32-bit syscall.
> -                * Make sure we interpret the -ERESTART* codes correctly
> -                * in case the task is not actually still sitting at the
> -                * exit from a 32-bit syscall with TS_COMPAT still set.
> +                * Warning: bizarre corner case fixup here.  A 32-bit
> +                * debugger setting orig_eax to -1 wants to disable
> +                * syscall restart.  Make sure that the syscall
> +                * restart code sign-extends orig_ax.  Also make sure
> +                * we interpret the -ERESTART* codes correctly if
> +                * loaded into regs->ax in case the task is not
> +                * actually still sitting at the exit from a 32-bit
> +                * syscall with TS_COMPAT still set.
>                  */
>                 regs->orig_ax = value;
>                 if (syscall_get_nr(child, regs) >= 0)
> -                       task_thread_info(child)->status |= TS_COMPAT;
> +                       task_thread_info(child)->status |= TS_I386_REGS_POKED;
>                 break;
>
>         case offsetof(struct user32, regs.eflags):
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index 22cc2f9f8aec..6b952e1d8db8 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -760,8 +760,30 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
>
>  static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
>  {
> -#ifdef CONFIG_X86_64
> -       if (in_ia32_syscall())
> +       /*
> +        * This function is fundamentally broken as currently
> +        * implemented.
> +        *
> +        * The idea is that we want to trigger a call to the
> +        * restart_block() syscall and that we want in_ia32_syscall(),
> +        * in_x32_syscall(), etc.  to match whatever they were in the
> +        * syscall being restarted.  We assume that the syscall
> +        * instruction at (regs->ip - 2) matches whatever syscall
> +        * instruction we used to enter in the first place.
> +        *
> +        * The problem is that we can get here when ptrace pokes
> +        * syscall-like values into regs even if we're not in a syscall
> +        * at all.
> +        *
> +        * For now, we maintain historical behavior and guess based on
> +        * stored state.  We could do better by saving the actual
> +        * syscall arch in restart_block or (with caveats on x32) by
> +        * checking if regs->ip points to 'int $0x80'.  The current
> +        * behavior is incorrect if a tracer has a different bitness
> +        * than the tracee.
> +        */
> +#ifdef CONFIG_IA32_EMULATION
> +       if (current_thread_info()->status & (TS_COMPAT|TS_I386_REGS_POKED))
>                 return __NR_ia32_restart_syscall;
>  #endif
>  #ifdef CONFIG_X86_X32_ABI
> --
> 2.5.5
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code
  2016-07-24 18:47   ` Andy Lutomirski
@ 2016-07-25  6:38     ` Ingo Molnar
  2016-07-25 16:38       ` Oleg Nesterov
  2016-07-25 16:57       ` Andy Lutomirski
  0 siblings, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2016-07-25  6:38 UTC (permalink / raw)
  To: Andy Lutomirski, Oleg Nesterov
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Pedro Alves, Oleg Nesterov, Kees Cook


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Mon, Jun 20, 2016 at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > Setting TS_COMPAT in ptrace is wrong: if we happen to do it during
> > syscall entry, then we'll confuse seccomp and audit.  (The former
> > isn't a security problem: seccomp is currently entirely insecure if a
> > malicious ptracer is attached.)  As a minimal fix, this patch adds a
> > new flag TS_I386_REGS_POKED that handles the ptrace special case.
> 
> Hi Ingo-
> 
> Could you apply this one patch for 4.8?  While I don't think it's a
> significant security issue in 4.7 or earlier, leaving it unfixed in
> 4.8 will introduce a potentially unpleasant interaction with some
> seccomp changes that are queued up in the
> security tree for 4.8.
> 
> It will have a trivially-resolvable conflict with -mm.
> 
> The rest of the series this is in can wait.

I don't mind the rest of the series either - could you please repost it (with the 
review feedback addressed)?

Looks like that with minor changes the series has Oleg's Acked-by?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code
  2016-07-25  6:38     ` Ingo Molnar
@ 2016-07-25 16:38       ` Oleg Nesterov
  2016-07-25 16:57       ` Andy Lutomirski
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2016-07-25 16:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, linux-kernel,
	Borislav Petkov, Pedro Alves, Kees Cook

On 07/25, Ingo Molnar wrote:
>
> Looks like that with minor changes the series has Oleg's Acked-by?

Yes, thanks, I think these changes are fine.

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code
  2016-07-25  6:38     ` Ingo Molnar
  2016-07-25 16:38       ` Oleg Nesterov
@ 2016-07-25 16:57       ` Andy Lutomirski
  2016-07-26  0:21         ` Andy Lutomirski
  1 sibling, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2016-07-25 16:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pedro Alves, Oleg Nesterov, Kees Cook, Borislav Petkov,
	linux-kernel, X86 ML

On Jul 24, 2016 11:38 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
> > On Mon, Jun 20, 2016 at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > > Setting TS_COMPAT in ptrace is wrong: if we happen to do it during
> > > syscall entry, then we'll confuse seccomp and audit.  (The former
> > > isn't a security problem: seccomp is currently entirely insecure if a
> > > malicious ptracer is attached.)  As a minimal fix, this patch adds a
> > > new flag TS_I386_REGS_POKED that handles the ptrace special case.
> >
> > Hi Ingo-
> >
> > Could you apply this one patch for 4.8?  While I don't think it's a
> > significant security issue in 4.7 or earlier, leaving it unfixed in
> > 4.8 will introduce a potentially unpleasant interaction with some
> > seccomp changes that are queued up in the
> > security tree for 4.8.
> >
> > It will have a trivially-resolvable conflict with -mm.
> >
> > The rest of the series this is in can wait.
>
> I don't mind the rest of the series either - could you please repost it (with the
> review feedback addressed)?

I'm nervous about it for a couple reasons involving the fact that it's
user visible.

1. It doesn't make gdb work right in all the cases that gdb currently
gets wrong.  I haven't had time to think about whether there's a
minimal tweak that would fix this.

2. It might have annoying interactions with seccomp whitelists.  I
don't know that for sure, but I still don't love it.


Patch 1 is only user-visible in the case where the current behavior is
clearly wrong, so I'd personally be more comfortable applying just
patch 1 for 4.8.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code
  2016-07-25 16:57       ` Andy Lutomirski
@ 2016-07-26  0:21         ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2016-07-26  0:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pedro Alves, Oleg Nesterov, Kees Cook, Borislav Petkov,
	linux-kernel, X86 ML

On Mon, Jul 25, 2016 at 9:57 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Jul 24, 2016 11:38 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>>
>>
>> * Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> > On Mon, Jun 20, 2016 at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> > > Setting TS_COMPAT in ptrace is wrong: if we happen to do it during
>> > > syscall entry, then we'll confuse seccomp and audit.  (The former
>> > > isn't a security problem: seccomp is currently entirely insecure if a
>> > > malicious ptracer is attached.)  As a minimal fix, this patch adds a
>> > > new flag TS_I386_REGS_POKED that handles the ptrace special case.
>> >
>> > Hi Ingo-
>> >
>> > Could you apply this one patch for 4.8?  While I don't think it's a
>> > significant security issue in 4.7 or earlier, leaving it unfixed in
>> > 4.8 will introduce a potentially unpleasant interaction with some
>> > seccomp changes that are queued up in the
>> > security tree for 4.8.
>> >
>> > It will have a trivially-resolvable conflict with -mm.
>> >
>> > The rest of the series this is in can wait.
>>
>> I don't mind the rest of the series either - could you please repost it (with the
>> review feedback addressed)?
>
> I'm nervous about it for a couple reasons involving the fact that it's
> user visible.
>
> 1. It doesn't make gdb work right in all the cases that gdb currently
> gets wrong.  I haven't had time to think about whether there's a
> minimal tweak that would fix this.

After re-reading the whole thread, I think that the rest of the series
needs a good self-test to make sure that we're providing whatever
behavior we think we're providing.  So I really don't what to apply it
yet.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-07-26  0:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-20 23:39 [PATCH v3 0/3] ptrace-vs-syscall-restart fixes, v3 Andy Lutomirski
2016-06-20 23:39 ` [PATCH v3 1/3] x86/ptrace: Stop setting TS_COMPAT in ptrace code Andy Lutomirski
2016-06-22 22:13   ` Oleg Nesterov
2016-07-24 18:47   ` Andy Lutomirski
2016-07-25  6:38     ` Ingo Molnar
2016-07-25 16:38       ` Oleg Nesterov
2016-07-25 16:57       ` Andy Lutomirski
2016-07-26  0:21         ` Andy Lutomirski
2016-06-20 23:39 ` [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr Andy Lutomirski
2016-06-21 12:39   ` Pedro Alves
2016-06-21 16:32     ` Andy Lutomirski
2016-06-22 12:00       ` Pedro Alves
2016-06-22 15:20         ` Andy Lutomirski
2016-06-23 21:21   ` Oleg Nesterov
2016-06-20 23:39 ` [PATCH v3 3/3] x86/ptrace, x86/signal: Remove TS_I386_REGS_POKED Andy Lutomirski
2016-06-23 21:26   ` Oleg Nesterov
2016-06-23 21:53     ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).