[PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack)
       [not found]                 ` <CA+55aFy=newnMbx53HipyWbRs2mUUPSqXXCpSfDLW78gkro37g@mail.gmail.com>
@ 2013-01-20 19:24                   ` Oleg Nesterov
  2013-01-20 19:25                     ` [PATCH 1/4] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
                                       ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-20 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman

add lkml/cc's.

On 01/18, Linus Torvalds wrote:
>
> On Fri, Jan 18, 2013 at 10:55 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Or we can do this after wait_task_inactive() but then we need to take
> > ->siglock again.
>
> Yes. We absolutely need siglock, since that would be exactly what
> would protect us against signal_wake_up() (which is, I *think* the
> only thing that can ever wake up the TASK_TRACED/WAKEKILL cases).

Yes. And thus 4/4 probably should be 1/4.

> And we'd need to make sure to re-set the WAKEKILL flag not just in all
> the callers of ptrace_check_attach(), but also in the failure case of
> wait_task_inactive(). I'm not sure it can actually fail if we cleared
> WAKEKILL, but it's all pretty subtle.

Afaics it can't fail if we clear WAKEKILL... So 2/4 assumes it should
always succeed and adds the warning.

> And when we *do* set the WAKEKILL bit again, we should make sure to
> wake the task in case the killable signal happened while it was clear.

Yes, yes, this is clear. And we need to ensure we can not race with
attach-after-detach...

> And I agree that this is all pretty scary and generally playing with
> another process' 'flags' field is some really nasty business. So I'm a
> bit worried about it.

Oh yes. And I was going to argue that (a much simpler) change which
doesn't allow the tracee to return from ptrace_stop() is better. But
then I recalled about set_task_blockstep() and changed my mind (see
the changelog in 2/4).

Greg, this doesn't look like -stable material. But please let me know
if you think 2/4 should be backported. With a couple of simple hacks
in PTRACE_DETACH/LISTEN paths we can do this without 1/4 and without
changes outside of ptrace.c. But again, probably we shouldn't do this.

Please review.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up()
  2013-01-20 19:24                   ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
@ 2013-01-20 19:25                     ` Oleg Nesterov
  2013-01-20 19:25                     ` [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
                                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-20 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman

Cleanup and preparation for the next change.

signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
necessary mask.

Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
which adds __TASK_TRACED.

This way ptrace_signal_wake_up() can work "inside" ptrace_request()
even if the tracee doesn't have the TASK_WAKEKILL bit set.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/sched.h |   11 ++++++++++-
 kernel/ptrace.c       |    8 ++++----
 kernel/signal.c       |   14 ++++----------
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..48b4151 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2713,7 +2713,16 @@ static inline void thread_group_cputime_init(struct signal_struct *sig)
 extern void recalc_sigpending_and_wake(struct task_struct *t);
 extern void recalc_sigpending(void);
 
-extern void signal_wake_up(struct task_struct *t, int resume_stopped);
+extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
+
+static inline void signal_wake_up(struct task_struct *t, bool resume)
+{
+	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
+}
+static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
+{
+	signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
+}
 
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1599157..74ebdec 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -117,7 +117,7 @@ void __ptrace_unlink(struct task_struct *child)
 	 * TASK_KILLABLE sleeps.
 	 */
 	if (child->jobctl & JOBCTL_STOP_PENDING || task_is_traced(child))
-		signal_wake_up(child, task_is_traced(child));
+		ptrace_signal_wake_up(child, task_is_traced(child));
 
 	spin_unlock(&child->sighand->siglock);
 }
@@ -317,7 +317,7 @@ static int ptrace_attach(struct task_struct *task, long request,
 	 */
 	if (task_is_stopped(task) &&
 	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING))
-		signal_wake_up(task, 1);
+		signal_wake_up_state(task, __TASK_STOPPED);
 
 	spin_unlock(&task->sighand->siglock);
 
@@ -737,7 +737,7 @@ int ptrace_request(struct task_struct *child, long request,
 		 * tracee into STOP.
 		 */
 		if (likely(task_set_jobctl_pending(child, JOBCTL_TRAP_STOP)))
-			signal_wake_up(child, child->jobctl & JOBCTL_LISTENING);
+			ptrace_signal_wake_up(child, child->jobctl & JOBCTL_LISTENING);
 
 		unlock_task_sighand(child, &flags);
 		ret = 0;
@@ -763,7 +763,7 @@ int ptrace_request(struct task_struct *child, long request,
 			 * start of this trap and now.  Trigger re-trap.
 			 */
 			if (child->jobctl & JOBCTL_TRAP_NOTIFY)
-				signal_wake_up(child, true);
+				ptrace_signal_wake_up(child, true);
 			ret = 0;
 		}
 		unlock_task_sighand(child, &flags);
diff --git a/kernel/signal.c b/kernel/signal.c
index 372771e..2bf2731 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -680,23 +680,17 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  * No need to set need_resched since signal event passing
  * goes through ->blocked
  */
-void signal_wake_up(struct task_struct *t, int resume)
+void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
-	unsigned int mask;
-
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
-
 	/*
-	 * For SIGKILL, we want to wake it up in the stopped/traced/killable
+	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
 	 * case. We don't check t->state here because there is a race with it
 	 * executing another processor and just now entering stopped state.
 	 * By using wake_up_state, we ensure the process will wake up and
 	 * handle its death signal.
 	 */
-	mask = TASK_INTERRUPTIBLE;
-	if (resume)
-		mask |= TASK_WAKEKILL;
-	if (!wake_up_state(t, mask))
+	if (!wake_up_state(t, state | TASK_INTERRUPTIBLE))
 		kick_process(t);
 }
 
@@ -844,7 +838,7 @@ static void ptrace_trap_notify(struct task_struct *t)
 	assert_spin_locked(&t->sighand->siglock);
 
 	task_set_jobctl_pending(t, JOBCTL_TRAP_NOTIFY);
-	signal_wake_up(t, t->jobctl & JOBCTL_LISTENING);
+	ptrace_signal_wake_up(t, t->jobctl & JOBCTL_LISTENING);
 }
 
 /*
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-20 19:24                   ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
  2013-01-20 19:25                     ` [PATCH 1/4] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
@ 2013-01-20 19:25                     ` Oleg Nesterov
  2013-01-20 19:46                       ` [PATCH v2 " Oleg Nesterov
  2013-01-20 20:21                       ` [PATCH " Linus Torvalds
  2013-01-20 19:25                     ` [PATCH 3/4] ia64: kill thread_matches(), unexport ptrace_check_attach() Oleg Nesterov
                                       ` (3 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-20 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman

putreg() assumes that the tracee is not running and pt_regs_access()
can safely play with its stack. However a killed tracee can return
from ptrace_stop() to the low-level asm code and do RESTORE_REST,
this means that debugger can actually read/modify the kernel stack
until the tracee does SAVE_REST again.

set_task_blockstep() can race with SIGKILL too and in some sense
this race is even worse, the very fact the tracee can be woken up
breaks the logic.

As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
call, this ensures that nobody can ever wakeup the tracee while the
debugger looks at it. Not only this fixes the mentioned problems,
we can do some cleanups/simplifications in arch_ptrace() paths.

Probably ptrace_unfreeze_traced() needs more callers, for example
it makes sense to make the tracee killable for oom-killer before
access_process_vm().

While at it, add the comment into may_ptrace_stop() to explain why
ptrace_stop() still can't rely on SIGKILL and signal_pending_state().

Reported-by: Salman Qazi <sqazi@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/kernel/step.c |    9 ++++---
 kernel/ptrace.c        |   54 ++++++++++++++++++++++++++++++++++++++++--------
 kernel/signal.c        |    5 ++++
 3 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index cd3b243..9b4d51d 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -165,10 +165,11 @@ void set_task_blockstep(struct task_struct *task, bool on)
 	 * Ensure irq/preemption can't change debugctl in between.
 	 * Note also that both TIF_BLOCKSTEP and debugctl should
 	 * be changed atomically wrt preemption.
-	 * FIXME: this means that set/clear TIF_BLOCKSTEP is simply
-	 * wrong if task != current, SIGKILL can wakeup the stopped
-	 * tracee and set/clear can play with the running task, this
-	 * can confuse the next __switch_to_xtra().
+	 *
+	 * NOTE: this means that set/clear TIF_BLOCKSTEP is only safe if
+	 * task is current or it can't be running, otherwise we can race
+	 * with __switch_to_xtra(). We rely on ptrace_freeze_traced() but
+	 * PTRACE_KILL is not safe.
 	 */
 	local_irq_disable();
 	debugctl = get_debugctlmsr();
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 74ebdec..878ef11 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -122,6 +122,41 @@ void __ptrace_unlink(struct task_struct *child)
 	spin_unlock(&child->sighand->siglock);
 }
 
+/* Ensure that nothing can wake it up, even SIGKILL */
+static bool ptrace_freeze_traced(struct task_struct *task)
+{
+	bool ret = false;
+
+	/* Lockless, nobody but us can set this flag */
+	if (task->jobctl & JOBCTL_LISTENING)
+		return ret;
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (task_is_traced(task) && !__fatal_signal_pending(task)) {
+		task->state = __TASK_TRACED;
+		ret = true;
+	}
+	spin_unlock_irq(&task->sighand->siglock);
+
+	return ret;
+}
+
+static void ptrace_unfreeze_traced(struct task_struct *task)
+{
+	if (task->state != __TASK_TRACED)
+		return;
+
+	if (WARN_ON(!task->ptrace || task->parent != current))
+		return;
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (__fatal_signal_pending(task))
+		wake_up_state(task, __TASK_TRACED);
+	else
+		task->state = TASK_TRACED;
+	spin_unlock_irq(&task->sighand->siglock);
+}
+
 /**
  * ptrace_check_attach - check whether ptracee is ready for ptrace operation
  * @child: ptracee to check for
@@ -151,24 +186,20 @@ int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	 * be changed by us so it's not changing right after this.
 	 */
 	read_lock(&tasklist_lock);
-	if ((child->ptrace & PT_PTRACED) && child->parent == current) {
+	if (child->ptrace && child->parent == current) {
+		WARN_ON(child->state == __TASK_TRACED);
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		spin_lock_irq(&child->sighand->siglock);
-		WARN_ON_ONCE(task_is_stopped(child));
-		if (ignore_state || (task_is_traced(child) &&
-				     !(child->jobctl & JOBCTL_LISTENING)))
+		if (ignore_state || ptrace_freeze_traced(child))
 			ret = 0;
-		spin_unlock_irq(&child->sighand->siglock);
 	}
 	read_unlock(&tasklist_lock);
 
 	if (!ret && !ignore_state)
-		ret = wait_task_inactive(child, TASK_TRACED) ? 0 : -ESRCH;
+		WARN_ON(!wait_task_inactive(child, __TASK_TRACED));
 
-	/* All systems go.. */
 	return ret;
 }
 
@@ -900,6 +931,8 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
+	if (request != PTRACE_DETACH)
+		ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1039,8 +1072,11 @@ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid,
 
 	ret = ptrace_check_attach(child, request == PTRACE_KILL ||
 				  request == PTRACE_INTERRUPT);
-	if (!ret)
+	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
+		if (request != PTRACE_DETACH)
+			ptrace_unfreeze_traced(child);
+	}
 
  out_put_task_struct:
 	put_task_struct(child);
diff --git a/kernel/signal.c b/kernel/signal.c
index 2bf2731..980bcf3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1794,6 +1794,10 @@ static inline int may_ptrace_stop(void)
 	 * If SIGKILL was already sent before the caller unlocked
 	 * ->siglock we must see ->core_state != NULL. Otherwise it
 	 * is safe to enter schedule().
+	 *
+	 * This is almost outdated, a task with the pending SIGKILL can't
+	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
+	 * after SIGKILL was already dequeued.
 	 */
 	if (unlikely(current->mm->core_state) &&
 	    unlikely(current->mm == current->parent->mm))
@@ -1919,6 +1923,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
 
+		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		if (clear_code)
 			current->exit_code = 0;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] ia64: kill thread_matches(), unexport ptrace_check_attach()
  2013-01-20 19:24                   ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
  2013-01-20 19:25                     ` [PATCH 1/4] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
  2013-01-20 19:25                     ` [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
@ 2013-01-20 19:25                     ` Oleg Nesterov
  2013-01-20 20:23                       ` Linus Torvalds
  2013-01-20 19:26                     ` [PATCH 4/4] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task Oleg Nesterov
                                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-20 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman

arch/ia64/kernel/ptrace.c:thread_matches() has no users since e868a55c
"[IA64] remove find_thread_for_addr()", remove it.

And this allows us to unexport ptrace_check_attach().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/ia64/kernel/ptrace.c |   27 ---------------------------
 include/linux/ptrace.h    |    1 -
 kernel/ptrace.c           |    2 +-
 3 files changed, 1 insertions(+), 29 deletions(-)

diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c
index 4265ff6..b7a5fff 100644
--- a/arch/ia64/kernel/ptrace.c
+++ b/arch/ia64/kernel/ptrace.c
@@ -672,33 +672,6 @@ ptrace_attach_sync_user_rbs (struct task_struct *child)
 	read_unlock(&tasklist_lock);
 }
 
-static inline int
-thread_matches (struct task_struct *thread, unsigned long addr)
-{
-	unsigned long thread_rbs_end;
-	struct pt_regs *thread_regs;
-
-	if (ptrace_check_attach(thread, 0) < 0)
-		/*
-		 * If the thread is not in an attachable state, we'll
-		 * ignore it.  The net effect is that if ADDR happens
-		 * to overlap with the portion of the thread's
-		 * register backing store that is currently residing
-		 * on the thread's kernel stack, then ptrace() may end
-		 * up accessing a stale value.  But if the thread
-		 * isn't stopped, that's a problem anyhow, so we're
-		 * doing as well as we can...
-		 */
-		return 0;
-
-	thread_regs = task_pt_regs(thread);
-	thread_rbs_end = ia64_get_user_rbs_end(thread, thread_regs, NULL);
-	if (!on_kernel_rbs(addr, thread_regs->ar_bspstore, thread_rbs_end))
-		return 0;
-
-	return 1;	/* looks like we've got a winner */
-}
-
 /*
  * Write f32-f127 back to task->thread.fph if it has been modified.
  */
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 1693775..89573a3 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -45,7 +45,6 @@ extern long arch_ptrace(struct task_struct *child, long request,
 extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len);
 extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned long dst, int len);
 extern void ptrace_disable(struct task_struct *);
-extern int ptrace_check_attach(struct task_struct *task, bool ignore_state);
 extern int ptrace_request(struct task_struct *child, long request,
 			  unsigned long addr, unsigned long data);
 extern void ptrace_notify(int exit_code);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 425ff24..6cfa950 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -174,7 +174,7 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
  * RETURNS:
  * 0 on success, -ESRCH if %child is not ready.
  */
-int ptrace_check_attach(struct task_struct *child, bool ignore_state)
+static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
 	int ret = -ESRCH;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task
  2013-01-20 19:24                   ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
                                       ` (2 preceding siblings ...)
  2013-01-20 19:25                     ` [PATCH 3/4] ia64: kill thread_matches(), unexport ptrace_check_attach() Oleg Nesterov
@ 2013-01-20 19:26                     ` Oleg Nesterov
  2013-01-20 19:35                     ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
  2013-01-23 18:00                     ` Greg Kroah-Hartman
  5 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-20 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman

wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
Change it to use TASK_NORMAL and add the WARN_ON().

TASK_ALL has no other users, probably can be killed.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/core.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..26058d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1523,7 +1523,8 @@ out:
  */
 int wake_up_process(struct task_struct *p)
 {
-	return try_to_wake_up(p, TASK_ALL, 0);
+	WARN_ON(task_is_stopped_or_traced(p));
+	return try_to_wake_up(p, TASK_NORMAL, 0);
 }
 EXPORT_SYMBOL(wake_up_process);
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack)
  2013-01-20 19:24                   ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
                                       ` (3 preceding siblings ...)
  2013-01-20 19:26                     ` [PATCH 4/4] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task Oleg Nesterov
@ 2013-01-20 19:35                     ` Oleg Nesterov
  2013-01-23 18:00                     ` Greg Kroah-Hartman
  5 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-20 19:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman

On 01/20, Oleg Nesterov wrote:
>
> > And we'd need to make sure to re-set the WAKEKILL flag not just in all
> > the callers of ptrace_check_attach(), but also in the failure case of
> > wait_task_inactive(). I'm not sure it can actually fail if we cleared
> > WAKEKILL, but it's all pretty subtle.
>
> Afaics it can't fail if we clear WAKEKILL... So 2/4 assumes it should
> always succeed and adds the warning.

Argh! It can, exactly because we can not kill may_ptrace_stop() yet.
I'll update and resend 2/4.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-20 19:25                     ` [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
@ 2013-01-20 19:46                       ` Oleg Nesterov
  2013-01-20 20:21                       ` [PATCH " Linus Torvalds
  1 sibling, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-20 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman

putreg() assumes that the tracee is not running and pt_regs_access()
can safely play with its stack. However a killed tracee can return
from ptrace_stop() to the low-level asm code and do RESTORE_REST,
this means that debugger can actually read/modify the kernel stack
until the tracee does SAVE_REST again.

set_task_blockstep() can race with SIGKILL too and in some sense
this race is even worse, the very fact the tracee can be woken up
breaks the logic.

As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
call, this ensures that nobody can ever wakeup the tracee while the
debugger looks at it. Not only this fixes the mentioned problems,
we can do some cleanups/simplifications in arch_ptrace() paths.

Probably ptrace_unfreeze_traced() needs more callers, for example
it makes sense to make the tracee killable for oom-killer before
access_process_vm().

While at it, add the comment into may_ptrace_stop() to explain why
ptrace_stop() still can't rely on SIGKILL and signal_pending_state().

Reported-by: Salman Qazi <sqazi@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/kernel/step.c |    9 ++++---
 kernel/ptrace.c        |   61 ++++++++++++++++++++++++++++++++++++++++--------
 kernel/signal.c        |    5 ++++
 3 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index cd3b243..9b4d51d 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -165,10 +165,11 @@ void set_task_blockstep(struct task_struct *task, bool on)
 	 * Ensure irq/preemption can't change debugctl in between.
 	 * Note also that both TIF_BLOCKSTEP and debugctl should
 	 * be changed atomically wrt preemption.
-	 * FIXME: this means that set/clear TIF_BLOCKSTEP is simply
-	 * wrong if task != current, SIGKILL can wakeup the stopped
-	 * tracee and set/clear can play with the running task, this
-	 * can confuse the next __switch_to_xtra().
+	 *
+	 * NOTE: this means that set/clear TIF_BLOCKSTEP is only safe if
+	 * task is current or it can't be running, otherwise we can race
+	 * with __switch_to_xtra(). We rely on ptrace_freeze_traced() but
+	 * PTRACE_KILL is not safe.
 	 */
 	local_irq_disable();
 	debugctl = get_debugctlmsr();
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 74ebdec..c68e705 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -122,6 +122,41 @@ void __ptrace_unlink(struct task_struct *child)
 	spin_unlock(&child->sighand->siglock);
 }
 
+/* Ensure that nothing can wake it up, even SIGKILL */
+static bool ptrace_freeze_traced(struct task_struct *task)
+{
+	bool ret = false;
+
+	/* Lockless, nobody but us can set this flag */
+	if (task->jobctl & JOBCTL_LISTENING)
+		return ret;
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (task_is_traced(task) && !__fatal_signal_pending(task)) {
+		task->state = __TASK_TRACED;
+		ret = true;
+	}
+	spin_unlock_irq(&task->sighand->siglock);
+
+	return ret;
+}
+
+static void ptrace_unfreeze_traced(struct task_struct *task)
+{
+	if (task->state != __TASK_TRACED)
+		return;
+
+	if (WARN_ON(!task->ptrace || task->parent != current))
+		return;
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (__fatal_signal_pending(task))
+		wake_up_state(task, __TASK_TRACED);
+	else
+		task->state = TASK_TRACED;
+	spin_unlock_irq(&task->sighand->siglock);
+}
+
 /**
  * ptrace_check_attach - check whether ptracee is ready for ptrace operation
  * @child: ptracee to check for
@@ -151,24 +186,25 @@ int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	 * be changed by us so it's not changing right after this.
 	 */
 	read_lock(&tasklist_lock);
-	if ((child->ptrace & PT_PTRACED) && child->parent == current) {
+	if (child->ptrace && child->parent == current) {
+		WARN_ON(child->state == __TASK_TRACED);
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		spin_lock_irq(&child->sighand->siglock);
-		WARN_ON_ONCE(task_is_stopped(child));
-		if (ignore_state || (task_is_traced(child) &&
-				     !(child->jobctl & JOBCTL_LISTENING)))
+		if (ignore_state || ptrace_freeze_traced(child))
 			ret = 0;
-		spin_unlock_irq(&child->sighand->siglock);
 	}
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state)
-		ret = wait_task_inactive(child, TASK_TRACED) ? 0 : -ESRCH;
+	if (!ret && !ignore_state) {
+		if (!wait_task_inactive(child, __TASK_TRACED)) {
+			/* This can only happen if may_ptrace_stop() fails */
+			WARN_ON(child->state == __TASK_TRACED);
+			ret = -ESRCH;
+		}
+	}
 
-	/* All systems go.. */
 	return ret;
 }
 
@@ -900,6 +936,8 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
+	if (request != PTRACE_DETACH)
+		ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1039,8 +1077,11 @@ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid,
 
 	ret = ptrace_check_attach(child, request == PTRACE_KILL ||
 				  request == PTRACE_INTERRUPT);
-	if (!ret)
+	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
+		if (request != PTRACE_DETACH)
+			ptrace_unfreeze_traced(child);
+	}
 
  out_put_task_struct:
 	put_task_struct(child);
diff --git a/kernel/signal.c b/kernel/signal.c
index 2bf2731..980bcf3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1794,6 +1794,10 @@ static inline int may_ptrace_stop(void)
 	 * If SIGKILL was already sent before the caller unlocked
 	 * ->siglock we must see ->core_state != NULL. Otherwise it
 	 * is safe to enter schedule().
+	 *
+	 * This is almost outdated, a task with the pending SIGKILL can't
+	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
+	 * after SIGKILL was already dequeued.
 	 */
 	if (unlikely(current->mm->core_state) &&
 	    unlikely(current->mm == current->parent->mm))
@@ -1919,6 +1923,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
 
+		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		if (clear_code)
 			current->exit_code = 0;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-20 19:25                     ` [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
  2013-01-20 19:46                       ` [PATCH v2 " Oleg Nesterov
@ 2013-01-20 20:21                       ` Linus Torvalds
  2013-01-21 17:21                         ` Oleg Nesterov
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2013-01-20 20:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

On Sun, Jan 20, 2013 at 11:25 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> +
> +static void ptrace_unfreeze_traced(struct task_struct *task)
> +{
> +       if (task->state != __TASK_TRACED)
> +               return;
> +
> +       if (WARN_ON(!task->ptrace || task->parent != current))
> +               return;

This WARN_ON() is bogus.

Because you added the warning, you then need to make the callers check
for the whole PTRACE_UNATTACH.

So I think you should just remove the WARN_ON(), and then just call
ptrace_unfreeze_traced() unconditionally after you've successfully
done a ptrace_check_attach(). Just to keep the code simpler.

Also, in your corrected version, you had

+               if (!wait_task_inactive(child, __TASK_TRACED)) {
+                       /* This can only happen if may_ptrace_stop() fails */
+                       WARN_ON(child->state == __TASK_TRACED);
+                       ret = -ESRCH;

where I actually think that the comment is not really helpful. It
doesn't really explain what he child can do to get to ptrace_stop() in
the first place, and what that does to the child state...

               Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] ia64: kill thread_matches(), unexport ptrace_check_attach()
  2013-01-20 19:25                     ` [PATCH 3/4] ia64: kill thread_matches(), unexport ptrace_check_attach() Oleg Nesterov
@ 2013-01-20 20:23                       ` Linus Torvalds
  0 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2013-01-20 20:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

On Sun, Jan 20, 2013 at 11:25 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> arch/ia64/kernel/ptrace.c:thread_matches() has no users since e868a55c
> "[IA64] remove find_thread_for_addr()", remove it.

I think this one should go first, so that there are no stale callers
of ptrace_check_attach() when you change the semantics. In fact, I'll
apply it immediately, since it's clearly correct and independent of
the other changes.

              Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-20 20:21                       ` [PATCH " Linus Torvalds
@ 2013-01-21 17:21                         ` Oleg Nesterov
  2013-01-21 18:27                           ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-21 17:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

On 01/20, Linus Torvalds wrote:
>
> On Sun, Jan 20, 2013 at 11:25 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > +
> > +static void ptrace_unfreeze_traced(struct task_struct *task)
> > +{
> > +       if (task->state != __TASK_TRACED)
> > +               return;
> > +
> > +       if (WARN_ON(!task->ptrace || task->parent != current))
> > +               return;
>
> This WARN_ON() is bogus.
>
> Because you added the warning, you then need to make the callers check
> for the whole PTRACE_UNATTACH.
>
> So I think you should just remove the WARN_ON(), and then just call
> ptrace_unfreeze_traced() unconditionally after you've successfully
> done a ptrace_check_attach(). Just to keep the code simpler.

This is what initial patch did. But, assuming that ptrace_unfreeze_traced()
is called unconditionally, we need a locking or barriers, otherwise

	// another debugger attached after we did PTRACE_DETACH ?
	if (!task->ptrace || task->parent != current)
		return;

is racy. Suppose we trace the natural child, then do PTRACE_DETACH,
then another tracer comes. We can see ->ptrace and __TASK_TRACED,
but see the old task->parent == current.

Of course, this is only theoretical, and probably we can add a barrier
before this check, but I am not sure this will make the code simpler.
If nothing else, this needs a comment.

If PTRACE_DETACH doesn't do _unfreeze_, we know that the task is either
traced by us or it is exiting/exited, so we can always trust the
"state == __TASK_TRACED" check.

So I'd prefer to keep this code, but I won't insist if you still disagree.

> Also, in your corrected version, you had
>
> +               if (!wait_task_inactive(child, __TASK_TRACED)) {
> +                       /* This can only happen if may_ptrace_stop() fails */
> +                       WARN_ON(child->state == __TASK_TRACED);
> +                       ret = -ESRCH;
>
> where I actually think that the comment is not really helpful. It
> doesn't really explain what he child can do to get to ptrace_stop() in
> the first place, and what that does to the child state...

OK. Agreed. This comment reflects the fact that the first version removed
may_ptrace_stop() to ensure wait_task_inactive() can't fail.

I'll update the comment and resend.

Oleg.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-21 17:21                         ` Oleg Nesterov
@ 2013-01-21 18:27                           ` Linus Torvalds
  2013-01-21 19:47                             ` [PATCH v3 0/3] " Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2013-01-21 18:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

On Mon, Jan 21, 2013 at 9:21 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> This is what initial patch did. But, assuming that ptrace_unfreeze_traced()
> is called unconditionally, we need a locking or barriers, otherwise

Ahh. Fair enough.

I guess that works then. It's a bit sad, but at least I see why you did it.

                  Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 0/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-21 18:27                           ` Linus Torvalds
@ 2013-01-21 19:47                             ` Oleg Nesterov
  2013-01-21 19:47                               ` [PATCH v3 1/3] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
                                                 ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-21 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

On 01/21, Linus Torvalds wrote:
>
> I guess that works then. It's a bit sad, but at least I see why you did it.

OK, please see v3 rebased on top of "unexport ptrace_check_attach()" you
already applied.

I tried to update the comment in ptrace_check_attach(), and changed unfreeze()
to simply do WARN_ON() without if/return.

There is also the cosmetic change in 1/3,

	-		ptrace_signal_wake_up(child, task_is_traced(child));
	+		ptrace_signal_wake_up(child, true);

in __ptrace_unlink().

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/3] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up()
  2013-01-21 19:47                             ` [PATCH v3 0/3] " Oleg Nesterov
@ 2013-01-21 19:47                               ` Oleg Nesterov
  2013-01-21 19:48                               ` [PATCH v3 2/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
                                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-21 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

Cleanup and preparation for the next change.

signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
necessary mask.

Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
which adds __TASK_TRACED.

This way ptrace_signal_wake_up() can work "inside" ptrace_request()
even if the tracee doesn't have the TASK_WAKEKILL bit set.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/sched.h |   11 ++++++++++-
 kernel/ptrace.c       |    8 ++++----
 kernel/signal.c       |   14 ++++----------
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..48b4151 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2713,7 +2713,16 @@ static inline void thread_group_cputime_init(struct signal_struct *sig)
 extern void recalc_sigpending_and_wake(struct task_struct *t);
 extern void recalc_sigpending(void);
 
-extern void signal_wake_up(struct task_struct *t, int resume_stopped);
+extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
+
+static inline void signal_wake_up(struct task_struct *t, bool resume)
+{
+	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
+}
+static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
+{
+	signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
+}
 
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 612a561..62f7c27 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -117,7 +117,7 @@ void __ptrace_unlink(struct task_struct *child)
 	 * TASK_KILLABLE sleeps.
 	 */
 	if (child->jobctl & JOBCTL_STOP_PENDING || task_is_traced(child))
-		signal_wake_up(child, task_is_traced(child));
+		ptrace_signal_wake_up(child, true);
 
 	spin_unlock(&child->sighand->siglock);
 }
@@ -317,7 +317,7 @@ static int ptrace_attach(struct task_struct *task, long request,
 	 */
 	if (task_is_stopped(task) &&
 	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING))
-		signal_wake_up(task, 1);
+		signal_wake_up_state(task, __TASK_STOPPED);
 
 	spin_unlock(&task->sighand->siglock);
 
@@ -737,7 +737,7 @@ int ptrace_request(struct task_struct *child, long request,
 		 * tracee into STOP.
 		 */
 		if (likely(task_set_jobctl_pending(child, JOBCTL_TRAP_STOP)))
-			signal_wake_up(child, child->jobctl & JOBCTL_LISTENING);
+			ptrace_signal_wake_up(child, child->jobctl & JOBCTL_LISTENING);
 
 		unlock_task_sighand(child, &flags);
 		ret = 0;
@@ -763,7 +763,7 @@ int ptrace_request(struct task_struct *child, long request,
 			 * start of this trap and now.  Trigger re-trap.
 			 */
 			if (child->jobctl & JOBCTL_TRAP_NOTIFY)
-				signal_wake_up(child, true);
+				ptrace_signal_wake_up(child, true);
 			ret = 0;
 		}
 		unlock_task_sighand(child, &flags);
diff --git a/kernel/signal.c b/kernel/signal.c
index 372771e..2bf2731 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -680,23 +680,17 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  * No need to set need_resched since signal event passing
  * goes through ->blocked
  */
-void signal_wake_up(struct task_struct *t, int resume)
+void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
-	unsigned int mask;
-
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
-
 	/*
-	 * For SIGKILL, we want to wake it up in the stopped/traced/killable
+	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
 	 * case. We don't check t->state here because there is a race with it
 	 * executing another processor and just now entering stopped state.
 	 * By using wake_up_state, we ensure the process will wake up and
 	 * handle its death signal.
 	 */
-	mask = TASK_INTERRUPTIBLE;
-	if (resume)
-		mask |= TASK_WAKEKILL;
-	if (!wake_up_state(t, mask))
+	if (!wake_up_state(t, state | TASK_INTERRUPTIBLE))
 		kick_process(t);
 }
 
@@ -844,7 +838,7 @@ static void ptrace_trap_notify(struct task_struct *t)
 	assert_spin_locked(&t->sighand->siglock);
 
 	task_set_jobctl_pending(t, JOBCTL_TRAP_NOTIFY);
-	signal_wake_up(t, t->jobctl & JOBCTL_LISTENING);
+	ptrace_signal_wake_up(t, t->jobctl & JOBCTL_LISTENING);
 }
 
 /*
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-21 19:47                             ` [PATCH v3 0/3] " Oleg Nesterov
  2013-01-21 19:47                               ` [PATCH v3 1/3] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
@ 2013-01-21 19:48                               ` Oleg Nesterov
  2013-01-22 17:52                                 ` [PATCH v4 " Oleg Nesterov
  2013-01-21 19:48                               ` [PATCH v3 3/3] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task Oleg Nesterov
                                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-21 19:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

putreg() assumes that the tracee is not running and pt_regs_access()
can safely play with its stack. However a killed tracee can return
from ptrace_stop() to the low-level asm code and do RESTORE_REST,
this means that debugger can actually read/modify the kernel stack
until the tracee does SAVE_REST again.

set_task_blockstep() can race with SIGKILL too and in some sense
this race is even worse, the very fact the tracee can be woken up
breaks the logic.

As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
call, this ensures that nobody can ever wakeup the tracee while the
debugger looks at it. Not only this fixes the mentioned problems,
we can do some cleanups/simplifications in arch_ptrace() paths.

Probably ptrace_unfreeze_traced() needs more callers, for example
it makes sense to make the tracee killable for oom-killer before
access_process_vm().

While at it, add the comment into may_ptrace_stop() to explain why
ptrace_stop() still can't rely on SIGKILL and signal_pending_state().

Reported-by: Salman Qazi <sqazi@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/kernel/step.c |    9 +++---
 kernel/ptrace.c        |   64 ++++++++++++++++++++++++++++++++++++++++-------
 kernel/signal.c        |    5 +++
 3 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index cd3b243..9b4d51d 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -165,10 +165,11 @@ void set_task_blockstep(struct task_struct *task, bool on)
 	 * Ensure irq/preemption can't change debugctl in between.
 	 * Note also that both TIF_BLOCKSTEP and debugctl should
 	 * be changed atomically wrt preemption.
-	 * FIXME: this means that set/clear TIF_BLOCKSTEP is simply
-	 * wrong if task != current, SIGKILL can wakeup the stopped
-	 * tracee and set/clear can play with the running task, this
-	 * can confuse the next __switch_to_xtra().
+	 *
+	 * NOTE: this means that set/clear TIF_BLOCKSTEP is only safe if
+	 * task is current or it can't be running, otherwise we can race
+	 * with __switch_to_xtra(). We rely on ptrace_freeze_traced() but
+	 * PTRACE_KILL is not safe.
 	 */
 	local_irq_disable();
 	debugctl = get_debugctlmsr();
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 62f7c27..b6c22b5 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -122,6 +122,40 @@ void __ptrace_unlink(struct task_struct *child)
 	spin_unlock(&child->sighand->siglock);
 }
 
+/* Ensure that nothing can wake it up, even SIGKILL */
+static bool ptrace_freeze_traced(struct task_struct *task)
+{
+	bool ret = false;
+
+	/* Lockless, nobody but us can set this flag */
+	if (task->jobctl & JOBCTL_LISTENING)
+		return ret;
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (task_is_traced(task) && !__fatal_signal_pending(task)) {
+		task->state = __TASK_TRACED;
+		ret = true;
+	}
+	spin_unlock_irq(&task->sighand->siglock);
+
+	return ret;
+}
+
+static void ptrace_unfreeze_traced(struct task_struct *task)
+{
+	if (task->state != __TASK_TRACED)
+		return;
+
+	WARN_ON(!task->ptrace || task->parent != current);
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (__fatal_signal_pending(task))
+		wake_up_state(task, __TASK_TRACED);
+	else
+		task->state = TASK_TRACED;
+	spin_unlock_irq(&task->sighand->siglock);
+}
+
 /**
  * ptrace_check_attach - check whether ptracee is ready for ptrace operation
  * @child: ptracee to check for
@@ -151,24 +185,29 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	 * be changed by us so it's not changing right after this.
 	 */
 	read_lock(&tasklist_lock);
-	if ((child->ptrace & PT_PTRACED) && child->parent == current) {
+	if (child->ptrace && child->parent == current) {
+		WARN_ON(child->state == __TASK_TRACED);
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		spin_lock_irq(&child->sighand->siglock);
-		WARN_ON_ONCE(task_is_stopped(child));
-		if (ignore_state || (task_is_traced(child) &&
-				     !(child->jobctl & JOBCTL_LISTENING)))
+		if (ignore_state || ptrace_freeze_traced(child))
 			ret = 0;
-		spin_unlock_irq(&child->sighand->siglock);
 	}
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state)
-		ret = wait_task_inactive(child, TASK_TRACED) ? 0 : -ESRCH;
+	if (!ret && !ignore_state) {
+		if (!wait_task_inactive(child, __TASK_TRACED)) {
+			/*
+			 * This can only happen if may_ptrace_stop() fails and
+			 * ptrace_stop() changes ->state back to TASK_RUNNING,
+			 * so we should not worry about leaking __TASK_TRACED.
+			 */
+			WARN_ON(child->state == __TASK_TRACED);
+			ret = -ESRCH;
+		}
+	}
 
-	/* All systems go.. */
 	return ret;
 }
 
@@ -900,6 +939,8 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
+	if (request != PTRACE_DETACH)
+		ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1039,8 +1080,11 @@ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid,
 
 	ret = ptrace_check_attach(child, request == PTRACE_KILL ||
 				  request == PTRACE_INTERRUPT);
-	if (!ret)
+	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
+		if (request != PTRACE_DETACH)
+			ptrace_unfreeze_traced(child);
+	}
 
  out_put_task_struct:
 	put_task_struct(child);
diff --git a/kernel/signal.c b/kernel/signal.c
index 2bf2731..980bcf3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1794,6 +1794,10 @@ static inline int may_ptrace_stop(void)
 	 * If SIGKILL was already sent before the caller unlocked
 	 * ->siglock we must see ->core_state != NULL. Otherwise it
 	 * is safe to enter schedule().
+	 *
+	 * This is almost outdated, a task with the pending SIGKILL can't
+	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
+	 * after SIGKILL was already dequeued.
 	 */
 	if (unlikely(current->mm->core_state) &&
 	    unlikely(current->mm == current->parent->mm))
@@ -1919,6 +1923,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
 
+		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		if (clear_code)
 			current->exit_code = 0;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 3/3] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task
  2013-01-21 19:47                             ` [PATCH v3 0/3] " Oleg Nesterov
  2013-01-21 19:47                               ` [PATCH v3 1/3] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
  2013-01-21 19:48                               ` [PATCH v3 2/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
@ 2013-01-21 19:48                               ` Oleg Nesterov
  2013-01-22 17:51                               ` [PATCH v3 0/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
  2013-01-23 19:19                               ` TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request can never race with SIGKILL) Oleg Nesterov
  4 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-21 19:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
Change it to use TASK_NORMAL and add the WARN_ON().

TASK_ALL has no other users, probably can be killed.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/core.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..26058d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1523,7 +1523,8 @@ out:
  */
 int wake_up_process(struct task_struct *p)
 {
-	return try_to_wake_up(p, TASK_ALL, 0);
+	WARN_ON(task_is_stopped_or_traced(p));
+	return try_to_wake_up(p, TASK_NORMAL, 0);
 }
 EXPORT_SYMBOL(wake_up_process);
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-21 19:47                             ` [PATCH v3 0/3] " Oleg Nesterov
                                                 ` (2 preceding siblings ...)
  2013-01-21 19:48                               ` [PATCH v3 3/3] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task Oleg Nesterov
@ 2013-01-22 17:51                               ` Oleg Nesterov
  2013-01-23 19:19                               ` TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request can never race with SIGKILL) Oleg Nesterov
  4 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-22 17:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

On 01/21, Oleg Nesterov wrote:
>
> On 01/21, Linus Torvalds wrote:
> >
> > I guess that works then. It's a bit sad, but at least I see why you did it.
>
> OK, please see v3 rebased on top of "unexport ptrace_check_attach()" you
> already applied.
>
> I tried to update the comment in ptrace_check_attach(), and changed unfreeze()
> to simply do WARN_ON() without if/return.

Damn. But the current "if (request != PTRACE_DETACH)" is not right,
somehow I forgot that it can fail.

I am sending "[PATCH v4 2/3]" in reply to 2/3. Or see the fixlet below.

And perhaps you were right, ptrace_unfreeze_traced() should prevent the
attach-after-detach race itself... but iiuc then it needs the full mb(),
rmb() if not enough for transitivity.

Sorry for confusion.

Oleg.

------------------------------------------------------------------------------
[PATCH v3 4/3] ptrace: if PTRACE_DETACH fails we need ptrace_unfreeze_traced()

Somehow I forgot that PTRACE_DETACH can fail if !valid_signal(data).
Fix the check before ptrace_unfreeze_traced().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/ptrace.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index b6c22b5..6cbeaae 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -939,7 +939,7 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
-	if (request != PTRACE_DETACH)
+	if (ret || request != PTRACE_DETACH)
 		ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
@@ -1082,7 +1082,7 @@ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid,
 				  request == PTRACE_INTERRUPT);
 	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
-		if (request != PTRACE_DETACH)
+		if (ret || request != PTRACE_DETACH)
 			ptrace_unfreeze_traced(child);
 	}
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL
  2013-01-21 19:48                               ` [PATCH v3 2/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
@ 2013-01-22 17:52                                 ` Oleg Nesterov
  0 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-22 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	Linux Kernel Mailing List, Tejun Heo, Roland McGrath, Tony Luck,
	Fenghua Yu, Greg Kroah-Hartman

putreg() assumes that the tracee is not running and pt_regs_access()
can safely play with its stack. However a killed tracee can return
from ptrace_stop() to the low-level asm code and do RESTORE_REST,
this means that debugger can actually read/modify the kernel stack
until the tracee does SAVE_REST again.

set_task_blockstep() can race with SIGKILL too and in some sense
this race is even worse, the very fact the tracee can be woken up
breaks the logic.

As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
call, this ensures that nobody can ever wakeup the tracee while the
debugger looks at it. Not only this fixes the mentioned problems,
we can do some cleanups/simplifications in arch_ptrace() paths.

Probably ptrace_unfreeze_traced() needs more callers, for example
it makes sense to make the tracee killable for oom-killer before
access_process_vm().

While at it, add the comment into may_ptrace_stop() to explain why
ptrace_stop() still can't rely on SIGKILL and signal_pending_state().

Reported-by: Salman Qazi <sqazi@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/kernel/step.c |    9 +++---
 kernel/ptrace.c        |   64 ++++++++++++++++++++++++++++++++++++++++-------
 kernel/signal.c        |    5 +++
 3 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index cd3b243..9b4d51d 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -165,10 +165,11 @@ void set_task_blockstep(struct task_struct *task, bool on)
 	 * Ensure irq/preemption can't change debugctl in between.
 	 * Note also that both TIF_BLOCKSTEP and debugctl should
 	 * be changed atomically wrt preemption.
-	 * FIXME: this means that set/clear TIF_BLOCKSTEP is simply
-	 * wrong if task != current, SIGKILL can wakeup the stopped
-	 * tracee and set/clear can play with the running task, this
-	 * can confuse the next __switch_to_xtra().
+	 *
+	 * NOTE: this means that set/clear TIF_BLOCKSTEP is only safe if
+	 * task is current or it can't be running, otherwise we can race
+	 * with __switch_to_xtra(). We rely on ptrace_freeze_traced() but
+	 * PTRACE_KILL is not safe.
 	 */
 	local_irq_disable();
 	debugctl = get_debugctlmsr();
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 62f7c27..6cbeaae 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -122,6 +122,40 @@ void __ptrace_unlink(struct task_struct *child)
 	spin_unlock(&child->sighand->siglock);
 }
 
+/* Ensure that nothing can wake it up, even SIGKILL */
+static bool ptrace_freeze_traced(struct task_struct *task)
+{
+	bool ret = false;
+
+	/* Lockless, nobody but us can set this flag */
+	if (task->jobctl & JOBCTL_LISTENING)
+		return ret;
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (task_is_traced(task) && !__fatal_signal_pending(task)) {
+		task->state = __TASK_TRACED;
+		ret = true;
+	}
+	spin_unlock_irq(&task->sighand->siglock);
+
+	return ret;
+}
+
+static void ptrace_unfreeze_traced(struct task_struct *task)
+{
+	if (task->state != __TASK_TRACED)
+		return;
+
+	WARN_ON(!task->ptrace || task->parent != current);
+
+	spin_lock_irq(&task->sighand->siglock);
+	if (__fatal_signal_pending(task))
+		wake_up_state(task, __TASK_TRACED);
+	else
+		task->state = TASK_TRACED;
+	spin_unlock_irq(&task->sighand->siglock);
+}
+
 /**
  * ptrace_check_attach - check whether ptracee is ready for ptrace operation
  * @child: ptracee to check for
@@ -151,24 +185,29 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	 * be changed by us so it's not changing right after this.
 	 */
 	read_lock(&tasklist_lock);
-	if ((child->ptrace & PT_PTRACED) && child->parent == current) {
+	if (child->ptrace && child->parent == current) {
+		WARN_ON(child->state == __TASK_TRACED);
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		spin_lock_irq(&child->sighand->siglock);
-		WARN_ON_ONCE(task_is_stopped(child));
-		if (ignore_state || (task_is_traced(child) &&
-				     !(child->jobctl & JOBCTL_LISTENING)))
+		if (ignore_state || ptrace_freeze_traced(child))
 			ret = 0;
-		spin_unlock_irq(&child->sighand->siglock);
 	}
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state)
-		ret = wait_task_inactive(child, TASK_TRACED) ? 0 : -ESRCH;
+	if (!ret && !ignore_state) {
+		if (!wait_task_inactive(child, __TASK_TRACED)) {
+			/*
+			 * This can only happen if may_ptrace_stop() fails and
+			 * ptrace_stop() changes ->state back to TASK_RUNNING,
+			 * so we should not worry about leaking __TASK_TRACED.
+			 */
+			WARN_ON(child->state == __TASK_TRACED);
+			ret = -ESRCH;
+		}
+	}
 
-	/* All systems go.. */
 	return ret;
 }
 
@@ -900,6 +939,8 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
+	if (ret || request != PTRACE_DETACH)
+		ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1039,8 +1080,11 @@ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid,
 
 	ret = ptrace_check_attach(child, request == PTRACE_KILL ||
 				  request == PTRACE_INTERRUPT);
-	if (!ret)
+	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
+		if (ret || request != PTRACE_DETACH)
+			ptrace_unfreeze_traced(child);
+	}
 
  out_put_task_struct:
 	put_task_struct(child);
diff --git a/kernel/signal.c b/kernel/signal.c
index 2bf2731..980bcf3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1794,6 +1794,10 @@ static inline int may_ptrace_stop(void)
 	 * If SIGKILL was already sent before the caller unlocked
 	 * ->siglock we must see ->core_state != NULL. Otherwise it
 	 * is safe to enter schedule().
+	 *
+	 * This is almost outdated, a task with the pending SIGKILL can't
+	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
+	 * after SIGKILL was already dequeued.
 	 */
 	if (unlikely(current->mm->core_state) &&
 	    unlikely(current->mm == current->parent->mm))
@@ -1919,6 +1923,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
 
+		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		if (clear_code)
 			current->exit_code = 0;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack)
  2013-01-20 19:24                   ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
                                       ` (4 preceding siblings ...)
  2013-01-20 19:35                     ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
@ 2013-01-23 18:00                     ` Greg Kroah-Hartman
  5 siblings, 0 replies; 21+ messages in thread
From: Greg Kroah-Hartman @ 2013-01-23 18:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dan Carpenter, Kernel Security, Michael Davidson,
	Suleiman Souhlal, Julien Tinnes, Aaron Durbin, Andrew Morton,
	linux-kernel, Tejun Heo, Roland McGrath, Tony Luck, Fenghua Yu

On Sun, Jan 20, 2013 at 08:24:48PM +0100, Oleg Nesterov wrote:
> Greg, this doesn't look like -stable material. But please let me know
> if you think 2/4 should be backported. With a couple of simple hacks
> in PTRACE_DETACH/LISTEN paths we can do this without 1/4 and without
> changes outside of ptrace.c. But again, probably we shouldn't do this.

I think it is -stable material, and I've now applied the three patches
that went into Linus's tree into the 3.4-stable and 3.7-stable trees.
There was only one minor patch problem with 3.4 (comment block not being
needed), if there is more work than that that you think is needed to get
these working in those kernel releases, please let me know.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request can never race with SIGKILL)
  2013-01-21 19:47                             ` [PATCH v3 0/3] " Oleg Nesterov
                                                 ` (3 preceding siblings ...)
  2013-01-22 17:51                               ` [PATCH v3 0/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
@ 2013-01-23 19:19                               ` Oleg Nesterov
  2013-01-23 19:50                                 ` Tejun Heo
  4 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-23 19:19 UTC (permalink / raw)
  To: Linus Torvalds, Yasunori Goto, Peter Zijlstra, Ingo Molnar
  Cc: Dan Carpenter, Michael Davidson, Suleiman Souhlal, Julien Tinnes,
	Aaron Durbin, Andrew Morton, Linux Kernel Mailing List,
	Tejun Heo, Roland McGrath, Alexander Gordeev

To avoid the confusion, this is not connected to ptrace_freeze_traced()
changes...

With or without these changes, there is another problem: a spurious
wakeup from try_to_wake_up(TASK_NORMAL) which doesn't necessarily see
the "right" task->state.

As for ptrace_stop() this is purely theoretical, but I thought that
perhaps it makes sense to extract the "mb + unlock_wait(pi_lock)" code
from do_exit() into the generic helper,
set_current_state_sync_because_we_cant_tolerate_a_wrong_wakeup().
But when I look at this code again I am not sure it is right.

Let me remind the problem. To oversimplify, we have

	try_to_wake_up(task, state)
	{
		lock(task->pi_lock);

		if (task->state & state)
			task->state = TASK_RUNNING;

		unlock(task->pi_lock);
	}

And this means that a task doing

	current->state = STATE_1;
	// no schedule() in between
	current->state = STATE_2;
	schedule();

can be actually woken up by try_to_wake_up(STATE_1) even if it already
sleeps in STATE_2.

Usually this is fine, any wait_event-like code should be careful. But
sometimes we can't afford the false wakeup, that is why do_wait() roughly
does

	do_exit()
	{
		// down_read(mmap_sem) can do this without schedule()
		current->state = TASK_UNINTERRUPTIBLE;
		current->state = TASK_RUNNING;

		mb();

		spin_unlock_wait(current->pi_lock);

		current->state = TASK_DEAD;
		schedule();
	}

This should ensure that any subsequent (after unlock_wait) try_to_wake_up()
can't see state == UNINTERRUPTIBLE and I think this works.

But. Somehow we missed the fact (I think) that we also need to serialize
unlock_wait() and "state = TASK_DEAD". The code above can be reordered,

	do_exit()
	{
		// down_read(mmap_sem) can do this without schedule()
		current->state = TASK_UNINTERRUPTIBLE;
		current->state = TASK_RUNNING;

		mb();

		current->state = TASK_DEAD;
		// !!! ttwu() can change ->state here !!!
		spin_unlock_wait(current->pi_lock);

		schedule();
	}

and we have the same problem again. So _I think_ that we we need another
mb() after unlock_wait() ?

And, afaics, in theory we can't simply move the current mb() down, after
unlock_wait().  (again, only in theory, if nothing else we should have
the implicit barrrers after we played with ->state in the past).

Or perhaps we should modify ttwu_do_wakeup() to not blindly set RUNNING,
say, cmpxchg(old_state, RUNNING). But this is not simple/nice.

Or I missed something?

Oleg.

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -869,8 +869,15 @@ void do_exit(long code)
 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
 	 * is held by try_to_wake_up()
 	 */
+
 	smp_mb();
 	raw_spin_unlock_wait(&tsk->pi_lock);
+	/*
+	 * The first mb() ensures that after that try_to_wake_up() must see
+	 * state == TASK_RUNNING. We need another one to ensure that we set
+	 * TASK_DEAD only after ->pi_lock is really unlocked.
+	 */
+	smp_mb();

 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request can never race with SIGKILL)
  2013-01-23 19:19                               ` TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request can never race with SIGKILL) Oleg Nesterov
@ 2013-01-23 19:50                                 ` Tejun Heo
  2013-01-24 18:50                                   ` Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2013-01-23 19:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Yasunori Goto, Peter Zijlstra, Ingo Molnar,
	Dan Carpenter, Michael Davidson, Suleiman Souhlal, Julien Tinnes,
	Aaron Durbin, Andrew Morton, Linux Kernel Mailing List,
	Roland McGrath, Alexander Gordeev

Hello, Oleg.

On Wed, Jan 23, 2013 at 08:19:46PM +0100, Oleg Nesterov wrote:
> Let me remind the problem. To oversimplify, we have
> 
> 	try_to_wake_up(task, state)
> 	{
> 		lock(task->pi_lock);
> 
> 		if (task->state & state)
> 			task->state = TASK_RUNNING;
> 
> 		unlock(task->pi_lock);
> 	}
> 
> And this means that a task doing
> 
> 	current->state = STATE_1;
> 	// no schedule() in between
> 	current->state = STATE_2;
> 	schedule();
> 
> can be actually woken up by try_to_wake_up(STATE_1) even if it already
> sleeps in STATE_2.

Hmmm... nasty.

...
> and we have the same problem again. So _I think_ that we we need another
> mb() after unlock_wait() ?

Seems so, or, maybe we should add barrier semantics to unlock_wait()?
As it currently stands, it kinda invites misusages.

> And, afaics, in theory we can't simply move the current mb() down, after
> unlock_wait().  (again, only in theory, if nothing else we should have
> the implicit barrrers after we played with ->state in the past).
> 
> Or perhaps we should modify ttwu_do_wakeup() to not blindly set RUNNING,
> say, cmpxchg(old_state, RUNNING). But this is not simple/nice.

I personally think this is the right thing to do short of requiring
locking on current->state changes.  The situation is a bit muddy
because we're generally requiring sleepers to loop while still having
cases where things don't work that way.  It's a little scary that we
require looping to protect against stray wakeups, which can be very
rare, without any way to verify/test.

The waker would be acquiring the cacheline exclusively one way or the
other, so I don't think doing cmpxchg would add much overhead.  We
would definitely want to do comparisons tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request can never race with SIGKILL)
  2013-01-23 19:50                                 ` Tejun Heo
@ 2013-01-24 18:50                                   ` Oleg Nesterov
  0 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-01-24 18:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Yasunori Goto, Peter Zijlstra, Ingo Molnar,
	Dan Carpenter, Michael Davidson, Suleiman Souhlal, Julien Tinnes,
	Aaron Durbin, Andrew Morton, Linux Kernel Mailing List,
	Roland McGrath, Alexander Gordeev

Hi Tejun,

On 01/23, Tejun Heo wrote:
> Hello, Oleg.
>
> On Wed, Jan 23, 2013 at 08:19:46PM +0100, Oleg Nesterov wrote:
> >
> > And this means that a task doing
> >
> > 	current->state = STATE_1;
> > 	// no schedule() in between
> > 	current->state = STATE_2;
> > 	schedule();
> >
> > can be actually woken up by try_to_wake_up(STATE_1) even if it already
> > sleeps in STATE_2.
>
> Hmmm... nasty.
>
> ...
> > and we have the same problem again. So _I think_ that we we need another
> > mb() after unlock_wait() ?
>
> Seems so, or, maybe we should add barrier semantics to unlock_wait()?
> As it currently stands, it kinda invites misusages.

Perhaps, I am not sure. I agree, unlock_wait() without a barrier at
least on one side probably never makes sense.

> > And, afaics, in theory we can't simply move the current mb() down, after
> > unlock_wait().  (again, only in theory, if nothing else we should have
> > the implicit barrrers after we played with ->state in the past).
> >
> > Or perhaps we should modify ttwu_do_wakeup() to not blindly set RUNNING,
> > say, cmpxchg(old_state, RUNNING). But this is not simple/nice.
>
> I personally think this is the right thing to do short of requiring
> locking on current->state changes.  The situation is a bit muddy
> because we're generally requiring sleepers to loop while still having
> cases where things don't work that way.  It's a little scary that we
> require looping to protect against stray wakeups, which can be very
> rare, without any way to verify/test.
>
> The waker would be acquiring the cacheline exclusively one way or the
> other, so I don't think doing cmpxchg would add much overhead.  We
> would definitely want to do comparisons tho.

Still cmpxchg() is not free, Peter argued that ttwu() is the very hot
path and I understand his concern. And in fact this change doesn't look
very simple...

And this can not eliminate the spurious wakeups in general. Plus in
this particular case (I mean the race described by the comment above
spin_unlock_wait in do_exit) we could equally blame rw_semaphore, not
try_to_wake_up().

So I leave this to scheduler people.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-01-24 18:51 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20130115172613.GA3011@redhat.com>
     [not found] ` <20130116181830.GA6469@redhat.com>
     [not found]   ` <CA+55aFzkWqSEzw9oa5JodrM2NWE0H_AF7xyzRhd+DQ=PB=ZT2A@mail.gmail.com>
     [not found]     ` <20130118153700.GA27915@redhat.com>
     [not found]       ` <CA+55aFxEow_-PoX0xFa07yOi6az=6uVx8zeOsfToErmzh7dB8A@mail.gmail.com>
     [not found]         ` <20130118172854.GA29753@redhat.com>
     [not found]           ` <20130118175224.GA520@redhat.com>
     [not found]             ` <CA+55aFyEsU-pkX557A-m+xoGkA_v+fXEyA8z8HbJ5J8K1jObeg@mail.gmail.com>
     [not found]               ` <20130118185559.GA3773@redhat.com>
     [not found]                 ` <CA+55aFy=newnMbx53HipyWbRs2mUUPSqXXCpSfDLW78gkro37g@mail.gmail.com>
2013-01-20 19:24                   ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
2013-01-20 19:25                     ` [PATCH 1/4] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
2013-01-20 19:25                     ` [PATCH 2/4] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
2013-01-20 19:46                       ` [PATCH v2 " Oleg Nesterov
2013-01-20 20:21                       ` [PATCH " Linus Torvalds
2013-01-21 17:21                         ` Oleg Nesterov
2013-01-21 18:27                           ` Linus Torvalds
2013-01-21 19:47                             ` [PATCH v3 0/3] " Oleg Nesterov
2013-01-21 19:47                               ` [PATCH v3 1/3] ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() Oleg Nesterov
2013-01-21 19:48                               ` [PATCH v3 2/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
2013-01-22 17:52                                 ` [PATCH v4 " Oleg Nesterov
2013-01-21 19:48                               ` [PATCH v3 3/3] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task Oleg Nesterov
2013-01-22 17:51                               ` [PATCH v3 0/3] ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL Oleg Nesterov
2013-01-23 19:19                               ` TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request can never race with SIGKILL) Oleg Nesterov
2013-01-23 19:50                                 ` Tejun Heo
2013-01-24 18:50                                   ` Oleg Nesterov
2013-01-20 19:25                     ` [PATCH 3/4] ia64: kill thread_matches(), unexport ptrace_check_attach() Oleg Nesterov
2013-01-20 20:23                       ` Linus Torvalds
2013-01-20 19:26                     ` [PATCH 4/4] wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task Oleg Nesterov
2013-01-20 19:35                     ` [PATCH 0/4] (Was: ptrace: prevent PTRACE_SETREGS from corrupting stack) Oleg Nesterov
2013-01-23 18:00                     ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).