All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-02-13 14:14 Oleg Nesterov
  2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-13 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel

Hello,

Lets finally fix this problem, it was reported several times. I still think that
in the longer term we should (try to) rework the security hooks and (partially)
revert this change, but this is not trivial and we need something backportable
anyway.

Eric, Jann, we already discussed this change. 1/2 is the same patch I suggested 3
months ago except now it compiles and moves flush_signal_handlers() to de_thread().

Both patches ask for subsequent cleanups, see the changelogs.

Oleg.

 arch/x86/ia32/ia32_aout.c |   3 ++
 fs/binfmt_aout.c          |   3 ++
 fs/binfmt_elf.c           |   6 ++-
 fs/binfmt_elf_fdpic.c     |   4 ++
 fs/binfmt_flat.c          |   3 ++
 fs/exec.c                 | 128 +++++++++++++++++++++++-----------------------
 include/linux/binfmts.h   |   1 +
 kernel/exit.c             |   5 +-
 kernel/signal.c           |  21 +++++---
 9 files changed, 101 insertions(+), 73 deletions(-)

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
@ 2017-02-13 14:15 ` Oleg Nesterov
  2017-02-13 16:12   ` kbuild test robot
                     ` (3 more replies)
  2017-02-13 14:15 ` [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
  2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
  2 siblings, 4 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-13 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel

de_thread() waits for other threads with ->cred_guard_mutex held and this
is really bad because the time is not bounded, debugger can delay the exit
and this lock has a lot of users (mostly abusers imo) in fs/proc and more.
And this leads to deadlock if debugger tries to take the same mutex:

	#include <unistd.h>
	#include <signal.h>
	#include <sys/ptrace.h>
	#include <pthread.h>

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp("echo", "echo", "passed", NULL);
		}

		sleep(1);
		// or anything else which needs ->cred_guard_mutex,
		// say open(/proc/$pid/mem)
		ptrace(PTRACE_ATTACH, pid, 0,0);
		kill(pid, SIGCONT);

		return 0;
	}

This hangs because de_thread() waits for debugger which should release the
killed thread with cred_guard_mutex held, while the debugger sleeps waiting
for the same mutex. Not really that bad, the tracer can be killed, but still
this is a bug and people hit it in practice.

The patch changes flush_old_exec() to wait until all the threads have passed
exit_notify(), we use the new kill_sub_threads() helper instead of de_thread().

However, we still need to wait until they all disappear. The main reason is
that we can not unshare sighand until that, execing thread and zombies must
use the same sighand->siglock to serialize the access to ->thread_head/etc.
So the patch changes ->load_binary() methods to call de_thread() right after
install_exec_creds() drops cred_guard_mutex.

Note that de_thread() is simplified, it no longer needs to send SIGKILL to
sub-threads and wait for the group_leader to become a zombie. But since it
is called after flush_old_exec() we also need to move flush_signal_handlers()
to de_thread().

TODO:

- Move install_exec_creds() and de_thread() into setup_new_exec(), unexport
  de_thread().

- Avoid tasklist_lock in kill_sub_threads(), we can change exit_notify() to
  set ->exit_state under siglock.

- Simplify/cleanup the ->notify_count logic, it is really ugly. I think we
  should just turn this counter into signal->nr_live_threads. __exit_signal
  can use signal->nr_threads instead.

- In any case we should limit the scope of cred_guard_mutex in execve paths.
  It is not clear why do we take it at the start of execve, and worse, it is
  not clear why we do we actually overload this mutex to exclude other threads
  (except check_unsafe_exec() but this is solveable). The original motivation
  was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221
  ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to
  call copy_strings/etc with this mutex held.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Reported-by: Attila Fazekas <afazekas@redhat.com>
Reported-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/ia32/ia32_aout.c |   3 ++
 fs/binfmt_aout.c          |   3 ++
 fs/binfmt_elf.c           |   6 ++-
 fs/binfmt_elf_fdpic.c     |   4 ++
 fs/binfmt_flat.c          |   3 ++
 fs/exec.c                 | 128 +++++++++++++++++++++++-----------------------
 include/linux/binfmts.h   |   1 +
 kernel/exit.c             |   5 +-
 kernel/signal.c           |   3 +-
 9 files changed, 87 insertions(+), 69 deletions(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 7c0a711..a6b9cc9 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -312,6 +312,9 @@ static int load_aout_binary(struct linux_binprm *bprm)
 		return retval;
 
 	install_exec_creds(bprm);
+	retval = de_thread(current);
+	if (retval)
+		return retval;
 
 	if (N_MAGIC(ex) == OMAGIC) {
 		unsigned long text_addr, map_size;
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 2a59139..edd1335 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -256,6 +256,9 @@ static int load_aout_binary(struct linux_binprm * bprm)
 		return retval;
 
 	install_exec_creds(bprm);
+	retval = de_thread(current);
+	if (retval)
+		return retval;
 
 	if (N_MAGIC(ex) == OMAGIC) {
 		unsigned long text_addr, map_size;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 4223702..79508f7 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	setup_new_exec(bprm);
 	install_exec_creds(bprm);
 
+	retval = de_thread(current);
+	if (retval)
+		goto out_free_dentry;
+
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
 	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
 				 executable_stack);
 	if (retval < 0)
 		goto out_free_dentry;
-	
+
 	current->mm->start_stack = bprm->p;
 
 	/* Now we do a little grungy work by mmapping the ELF image into
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index d2e36f8..75fd6d8 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 #endif
 
 	install_exec_creds(bprm);
+	retval = de_thread(current);
+	if (retval)
+		goto error;
+
 	if (create_elf_fdpic_tables(bprm, current->mm,
 				    &exec_params, &interp_params) < 0)
 		goto error;
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index 9b2917a..a0ad9a3 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm)
 	}
 
 	install_exec_creds(bprm);
+	res = de_thread(current);
+	if (res)
+		return res;
 
 	set_binfmt(&flat_format);
 
diff --git a/fs/exec.c b/fs/exec.c
index e579466..8591c56 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig)
+{
+	for (;;) {
+		if (unlikely(__fatal_signal_pending(tsk)))
+			goto killed;
+		set_current_state(TASK_KILLABLE);
+		if (!sig->notify_count)
+			break;
+		schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+
+killed:
+	/* protects against exit_notify() and __exit_signal() */
+	read_lock(&tasklist_lock);
+	sig->group_exit_task = NULL;
+	sig->notify_count = 0;
+	read_unlock(&tasklist_lock);
+	return -EINTR;
+}
+
+/*
+ * Kill all the sub-threads and wait until they all pass exit_notify().
+ */
+static int kill_sub_threads(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+	int err = -EINTR;
+
+	if (thread_group_empty(tsk))
+		return 0;
+
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (!signal_group_exit(sig)) {
+		sig->group_exit_task = tsk;
+		sig->notify_count = -zap_other_threads(tsk);
+		err = 0;
+	}
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
+
+	if (!err)
+		err = wait_for_notify_count(tsk, sig);
+	return err;
+
+}
+
 /*
- * This function makes sure the current process has its own signal table,
- * so that flush_signal_handlers can later reset the handlers without
- * disturbing other processes.  (Other processes might share the signal
- * table via the CLONE_SIGHAND option to clone().)
+ * This function makes sure the current process has no other threads and
+ * has a private signal table so that flush_signal_handlers() can reset
+ * the handlers without disturbing other processes which might share the
+ * signal table via the CLONE_SIGHAND option to clone().
  */
-static int de_thread(struct task_struct *tsk)
+int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
@@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk)
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
 
-	/*
-	 * Kill all other threads in the thread group.
-	 */
 	spin_lock_irq(lock);
-	if (signal_group_exit(sig)) {
-		/*
-		 * Another group action in progress, just
-		 * return so that the signal is processed.
-		 */
-		spin_unlock_irq(lock);
-		return -EAGAIN;
-	}
-
-	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
+	sig->notify_count = sig->nr_threads;
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
-
-	while (sig->notify_count) {
-		__set_current_state(TASK_KILLABLE);
-		spin_unlock_irq(lock);
-		schedule();
-		if (unlikely(__fatal_signal_pending(tsk)))
-			goto killed;
-		spin_lock_irq(lock);
-	}
 	spin_unlock_irq(lock);
 
+	if (wait_for_notify_count(tsk, sig))
+		return -EINTR;
+
 	/*
 	 * At this point all other threads have exited, all we have to
-	 * do is to wait for the thread group leader to become inactive,
-	 * and to assume its PID:
+	 * do is to reap the old leader and assume its PID.
 	 */
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
-
+		threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 		/*
 		 * The only record we have of the real-time age of a
 		 * process, regardless of execs it's done, is start_time.
@@ -1162,10 +1175,9 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+no_thread_group:
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
-
-no_thread_group:
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
 
@@ -1198,15 +1210,8 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	BUG_ON(!thread_group_leader(tsk));
+	flush_signal_handlers(current, 0);
 	return 0;
-
-killed:
-	/* protects against exit_notify() and __exit_signal() */
-	read_lock(&tasklist_lock);
-	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
-	read_unlock(&tasklist_lock);
-	return -EAGAIN;
 }
 
 char *get_task_comm(char *buf, struct task_struct *tsk)
@@ -1237,11 +1242,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 {
 	int retval;
 
-	/*
-	 * Make sure we have a private signal table and that
-	 * we are unassociated from the previous thread group.
-	 */
-	retval = de_thread(current);
+	retval = kill_sub_threads(current);
 	if (retval)
 		goto out;
 
@@ -1336,7 +1337,6 @@ void setup_new_exec(struct linux_binprm * bprm)
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
 	current->self_exec_id++;
-	flush_signal_handlers(current, 0);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 1303b57..06a5a7b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *);
 extern int search_binary_handler(struct linux_binprm *);
 extern int flush_old_exec(struct linux_binprm * bprm);
 extern void setup_new_exec(struct linux_binprm * bprm);
+extern int de_thread(struct task_struct *tsk);
 extern void would_dump(struct linux_binprm *, struct file *);
 
 extern int suid_dumpable;
diff --git a/kernel/exit.c b/kernel/exit.c
index 8f14b86..169d9f2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -699,8 +699,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
+	/* mt-exec, kill_sub_threads() is waiting for group exit */
+	if (unlikely(tsk->signal->notify_count < 0) &&
+	    !++tsk->signal->notify_count)
 		wake_up_process(tsk->signal->group_exit_task);
 	write_unlock_irq(&tasklist_lock);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 3603d93..b78ce63 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1200,13 +1200,12 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
-
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+		count++;
 	}
 
 	return count;
-- 
2.5.0

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec
  2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
  2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
@ 2017-02-13 14:15 ` Oleg Nesterov
  2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
  2 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-13 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel

The previous patch fixed the deadlock when mt-exec waits for debugger
which should reap a zombie thread, but we can hit the same problem if
the killed sub-thread stops in ptrace_event(PTRACE_EVENT_EXIT). Change
may_ptrace_stop() to check signal->group_exit_task.

This is a user-visible change. But hopefully it can't break anything.
Note that the semantics of PTRACE_EVENT_EXIT was never really defined,
it depends on /dev/random. Just for example, currently a sub-thread
killed by exec will stop, but if it exits on its own and races with
exec it will not stop, so nobody can rely on PTRACE_EVENT_EXIT anyway.

We really need to finally define what PTRACE_EVENT_EXIT should actually
do, but this needs other changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/signal.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index b78ce63..16e44d7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1738,6 +1738,20 @@ static inline int may_ptrace_stop(void)
 {
 	if (!likely(current->ptrace))
 		return 0;
+
+	/*
+	 * Other checks can only affect PTRACE_EVENT_EXIT, a task with the
+	 * pending SIGKILL can't block in TASK_TRACED. But PTRACE_EVENT_EXIT
+	 * can be reported after SIGKILL was already dequeued.
+	 */
+
+	/*
+	 * Coredump or exec. We must not stop in the latter case, debuger can
+	 * wait for cred_guard_mutex held by execing thread.
+	 */
+	if (current->signal->group_exit_task)
+		return 0;
+
 	/*
 	 * Are we in the middle of do_coredump?
 	 * If so and our tracer is also part of the coredump stopping
@@ -1746,10 +1760,6 @@ static inline int may_ptrace_stop(void)
 	 * If SIGKILL was already sent before the caller unlocked
 	 * ->siglock we must see ->core_state != NULL. Otherwise it
 	 * is safe to enter schedule().
-	 *
-	 * This is almost outdated, a task with the pending SIGKILL can't
-	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
-	 * after SIGKILL was already dequeued.
 	 */
 	if (unlikely(current->mm->core_state) &&
 	    unlikely(current->mm == current->parent->mm))
-- 
2.5.0

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
@ 2017-02-13 16:12   ` kbuild test robot
  2017-02-13 16:47     ` Oleg Nesterov
  2017-02-13 16:39   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: kbuild test robot @ 2017-02-13 16:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kbuild-all, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook,
	Michal Hocko, Ulrich Obergfell, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 902 bytes --]

Hi Oleg,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.10-rc8 next-20170213]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Oleg-Nesterov/fix-the-traced-mt-exec-deadlock/20170213-221943
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

>> ERROR: "de_thread" [fs/binfmt_aout.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11660 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
  2017-02-13 16:12   ` kbuild test robot
@ 2017-02-13 16:39   ` kbuild test robot
  2017-02-13 17:27   ` Mika Penttilä
  2017-02-13 18:04   ` [PATCH V2 " Oleg Nesterov
  3 siblings, 0 replies; 93+ messages in thread
From: kbuild test robot @ 2017-02-13 16:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kbuild-all, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook,
	Michal Hocko, Ulrich Obergfell, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 777 bytes --]

Hi Oleg,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.10-rc8 next-20170213]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Oleg-Nesterov/fix-the-traced-mt-exec-deadlock/20170213-221943
config: x86_64-randconfig-ne0-02132256 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> ERROR: "de_thread" [arch/x86/ia32/ia32_aout.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 22364 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 16:12   ` kbuild test robot
@ 2017-02-13 16:47     ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-13 16:47 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook,
	Michal Hocko, Ulrich Obergfell, linux-kernel

On 02/14, kbuild test robot wrote:
>
> Hi Oleg,
>
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.10-rc8 next-20170213]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Oleg-Nesterov/fix-the-traced-mt-exec-deadlock/20170213-221943
> config: m68k-sun3_defconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 4.9.0
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=m68k 
>
> All errors (new ones prefixed by >>):
> 
> >> ERROR: "de_thread" [fs/binfmt_aout.ko] undefined!

Aaah thanks...

Of course, I forgot to add EXPORT_SYMBOL(de_thread), will send V2.

Thanks a lot!

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
  2017-02-13 16:12   ` kbuild test robot
  2017-02-13 16:39   ` kbuild test robot
@ 2017-02-13 17:27   ` Mika Penttilä
  2017-02-13 18:01     ` Oleg Nesterov
  2017-02-13 18:04   ` [PATCH V2 " Oleg Nesterov
  3 siblings, 1 reply; 93+ messages in thread
From: Mika Penttilä @ 2017-02-13 17:27 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton
  Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel


On 13.02.2017 16:15, Oleg Nesterov wrote:
> +	retval = de_thread(current);
> +	if (retval)
> +		return retval;
>  
>  	if (N_MAGIC(ex) == OMAGIC) {
>  		unsigned long text_addr, map_size;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 4223702..79508f7 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  	setup_new_exec(bprm);
>  	install_exec_creds(bprm);
>  
> +	retval = de_thread(current);
> +	if (retval)
> +		goto out_free_dentry;
> +
>  	/* Do this so that we can load the interpreter, if need be.  We will
>  	   change some of these later */
>  	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
>  				 executable_stack);
>  	if (retval < 0)
>  		goto out_free_dentry;
> -	
> +
>  	current->mm->start_stack = bprm->p;
>  
>  	/* Now we do a little grungy work by mmapping the ELF image into
> diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> index d2e36f8..75fd6d8 100644
> --- a/fs/binfmt_elf_fdpic.c
> +++ b/fs/binfmt_elf_fdpic.c
> @@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
>  #endif
>  
>  	install_exec_creds(bprm);
> +	retval = de_thread(current);
> +	if (retval)
> +		goto error;
> +
>  	if (create_elf_fdpic_tables(bprm, current->mm,
>  				    &exec_params, &interp_params) < 0)
>  		goto error;
> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
> index 9b2917a..a0ad9a3 100644
> --- a/fs/binfmt_flat.c
> +++ b/fs/binfmt_flat.c
> @@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm)
>  	}
>  
>  	install_exec_creds(bprm);
> +	res = de_thread(current);
> +	if (res)
> +		return res;
>  
>  	set_binfmt(&flat_format);
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index e579466..8591c56 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm)
>  	return 0;
>  }
>  
> +static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig)
> +{
> +	for (;;) {
> +		if (unlikely(__fatal_signal_pending(tsk)))
> +			goto killed;
> +		set_current_state(TASK_KILLABLE);
> +		if (!sig->notify_count)
> +			break;
> +		schedule();
> +	}
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +
> +killed:
> +	/* protects against exit_notify() and __exit_signal() */
> +	read_lock(&tasklist_lock);
> +	sig->group_exit_task = NULL;
> +	sig->notify_count = 0;
> +	read_unlock(&tasklist_lock);
> +	return -EINTR;
> +}
> +
> +/*
> + * Kill all the sub-threads and wait until they all pass exit_notify().
> + */
> +static int kill_sub_threads(struct task_struct *tsk)
> +{
> +	struct signal_struct *sig = tsk->signal;
> +	int err = -EINTR;
> +
> +	if (thread_group_empty(tsk))
> +		return 0;
> +
> +	read_lock(&tasklist_lock);
> +	spin_lock_irq(&tsk->sighand->siglock);
> +	if (!signal_group_exit(sig)) {
> +		sig->group_exit_task = tsk;
> +		sig->notify_count = -zap_other_threads(tsk);
> +		err = 0;
> +	}
> +	spin_unlock_irq(&tsk->sighand->siglock);
> +	read_unlock(&tasklist_lock);
> +
> +	if (!err)
> +		err = wait_for_notify_count(tsk, sig);
> +	return err;
> +
> +}
> +
>  /*
> - * This function makes sure the current process has its own signal table,
> - * so that flush_signal_handlers can later reset the handlers without
> - * disturbing other processes.  (Other processes might share the signal
> - * table via the CLONE_SIGHAND option to clone().)
> + * This function makes sure the current process has no other threads and
> + * has a private signal table so that flush_signal_handlers() can reset
> + * the handlers without disturbing other processes which might share the
> + * signal table via the CLONE_SIGHAND option to clone().
>   */
> -static int de_thread(struct task_struct *tsk)
> +int de_thread(struct task_struct *tsk)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
> @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk)
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
>  
> -	/*
> -	 * Kill all other threads in the thread group.
> -	 */
>  	spin_lock_irq(lock);
> -	if (signal_group_exit(sig)) {
> -		/*
> -		 * Another group action in progress, just
> -		 * return so that the signal is processed.
> -		 */
> -		spin_unlock_irq(lock);
> -		return -EAGAIN;
> -	}
> -
> -	sig->group_exit_task = tsk;
> -	sig->notify_count = zap_other_threads(tsk);
> +	sig->notify_count = sig->nr_threads;


maybe nr_threads - 1 since nr_threads includes us ?

+	sig->notify_count = sig->nr_threads - 1;



>  	if (!thread_group_leader(tsk))
>  		sig->notify_count--;
> -
> -	while (sig->notify_count) {
> -		__set_current_state(TASK_KILLABLE);
> -		spin_unlock_irq(lock);
> -		schedule();
> -		if (unlikely(__fatal_signal_pending(tsk)))
> -			goto killed;
> -		spin_lock_irq(lock);
> -	}
>  	spin_unlock_irq(lock);
>  
> +	if (wait_for_notify_count(tsk, sig))
> +		return -EINTR;
> +
>  	/*
>  	 * At this point all other threads have exited, all we have to
> -	 * do is to wait for the thread group leader to become inactive,
> -	 * and to assume its PID:
> +	 * do is to reap the old leader and assume its PID.
>  	 */
>  	if (!thread_group_leader(tsk)) {
>  		struct task_struct *leader = tsk->group_leader;
>  
> -		for (;;) {
> -			threadgroup_change_begin(tsk);
> -			write_lock_irq(&tasklist_lock);
> -			/*
> -			 * Do this under tasklist_lock to ensure that
> -			 * exit_notify() can't miss ->group_exit_task
> -			 */
> -			sig->notify_count = -1;
> -			if (likely(leader->exit_state))
> -				break;
> -			__set_current_state(TASK_KILLABLE);
> -			write_unlock_irq(&tasklist_lock);
> -			threadgroup_change_end(tsk);
> -			schedule();
> -			if (unlikely(__fatal_signal_pending(tsk)))
> -				goto killed;
> -		}
> -
> +		threadgroup_change_begin(tsk);
> +		write_lock_irq(&tasklist_lock);
>  		/*
>  		 * The only record we have of the real-time age of a
>  		 * process, regardless of execs it's done, is start_time.
> @@ -1162,10 +1175,9 @@ static int de_thread(struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +no_thread_group:
>  	sig->group_exit_task = NULL;
>  	sig->notify_count = 0;
> -
> -no_thread_group:
>  	/* we have changed execution domain */
>  	tsk->exit_signal = SIGCHLD;
>  
> @@ -1198,15 +1210,8 @@ static int de_thread(struct task_struct *tsk)
>  	}
>  
>  	BUG_ON(!thread_group_leader(tsk));
> +	flush_signal_handlers(current, 0);
>  	return 0;
> -
> -killed:
> -	/* protects against exit_notify() and __exit_signal() */
> -	read_lock(&tasklist_lock);
> -	sig->group_exit_task = NULL;
> -	sig->notify_count = 0;
> -	read_unlock(&tasklist_lock);
> -	return -EAGAIN;
>  }
>  
>  char *get_task_comm(char *buf, struct task_struct *tsk)
> @@ -1237,11 +1242,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  {
>  	int retval;
>  
> -	/*
> -	 * Make sure we have a private signal table and that
> -	 * we are unassociated from the previous thread group.
> -	 */
> -	retval = de_thread(current);
> +	retval = kill_sub_threads(current);
>  	if (retval)
>  		goto out;
>  
> @@ -1336,7 +1337,6 @@ void setup_new_exec(struct linux_binprm * bprm)
>  	/* An exec changes our domain. We are no longer part of the thread
>  	   group */
>  	current->self_exec_id++;
> -	flush_signal_handlers(current, 0);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
>  
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 1303b57..06a5a7b 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *);
>  extern int search_binary_handler(struct linux_binprm *);
>  extern int flush_old_exec(struct linux_binprm * bprm);
>  extern void setup_new_exec(struct linux_binprm * bprm);
> +extern int de_thread(struct task_struct *tsk);
>  extern void would_dump(struct linux_binprm *, struct file *);
>  
>  extern int suid_dumpable;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 8f14b86..169d9f2 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -699,8 +699,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	if (tsk->exit_state == EXIT_DEAD)
>  		list_add(&tsk->ptrace_entry, &dead);
>  
> -	/* mt-exec, de_thread() is waiting for group leader */
> -	if (unlikely(tsk->signal->notify_count < 0))
> +	/* mt-exec, kill_sub_threads() is waiting for group exit */
> +	if (unlikely(tsk->signal->notify_count < 0) &&
> +	    !++tsk->signal->notify_count)
>  		wake_up_process(tsk->signal->group_exit_task);
>  	write_unlock_irq(&tasklist_lock);
>  
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 3603d93..b78ce63 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1200,13 +1200,12 @@ int zap_other_threads(struct task_struct *p)
>  
>  	while_each_thread(p, t) {
>  		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> -		count++;
> -
>  		/* Don't bother with already dead threads */
>  		if (t->exit_state)
>  			continue;
>  		sigaddset(&t->pending.signal, SIGKILL);
>  		signal_wake_up(t, 1);
> +		count++;
>  	}
>  
>  	return count;

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 17:27   ` Mika Penttilä
@ 2017-02-13 18:01     ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-13 18:01 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko,
	Ulrich Obergfell, linux-kernel

On 02/13, Mika Penttilä wrote:
>
> > +int de_thread(struct task_struct *tsk)
> >  {
> >  	struct signal_struct *sig = tsk->signal;
> >  	struct sighand_struct *oldsighand = tsk->sighand;
> > @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk)
> >  	if (thread_group_empty(tsk))
> >  		goto no_thread_group;
> >
> > -	/*
> > -	 * Kill all other threads in the thread group.
> > -	 */
> >  	spin_lock_irq(lock);
> > -	if (signal_group_exit(sig)) {
> > -		/*
> > -		 * Another group action in progress, just
> > -		 * return so that the signal is processed.
> > -		 */
> > -		spin_unlock_irq(lock);
> > -		return -EAGAIN;
> > -	}
> > -
> > -	sig->group_exit_task = tsk;
> > -	sig->notify_count = zap_other_threads(tsk);
> > +	sig->notify_count = sig->nr_threads;
>
>
> maybe nr_threads - 1 since nr_threads includes us ?

Damn. Of course you are right, thanks a lot! Please see v2.

Hmm. I didn't even notice my own test-case didn't pass because of this
off-by-one.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
                     ` (2 preceding siblings ...)
  2017-02-13 17:27   ` Mika Penttilä
@ 2017-02-13 18:04   ` Oleg Nesterov
  2017-02-16 11:42     ` Eric W. Biederman
  2017-02-17  4:42     ` Eric W. Biederman
  3 siblings, 2 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-13 18:04 UTC (permalink / raw)
  To: Andrew Morton, Mika Penttilä
  Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel

de_thread() waits for other threads with ->cred_guard_mutex held and this
is really bad because the time is not bounded, debugger can delay the exit
and this lock has a lot of users (mostly abusers imo) in fs/proc and more.
And this leads to deadlock if debugger tries to take the same mutex:

	#include <unistd.h>
	#include <signal.h>
	#include <sys/ptrace.h>
	#include <pthread.h>

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp("echo", "echo", "passed", NULL);
		}

		sleep(1);
		// or anything else which needs ->cred_guard_mutex,
		// say open(/proc/$pid/mem)
		ptrace(PTRACE_ATTACH, pid, 0,0);
		kill(pid, SIGCONT);

		return 0;
	}

This hangs because de_thread() waits for debugger which should release the
killed thread with cred_guard_mutex held, while the debugger sleeps waiting
for the same mutex. Not really that bad, the tracer can be killed, but still
this is a bug and people hit it in practice.

The patch changes flush_old_exec() to wait until all the threads have passed
exit_notify(), we use the new kill_sub_threads() helper instead of de_thread().

However, we still need to wait until they all disappear. The main reason is
that we can not unshare sighand until that, execing thread and zombies must
use the same sighand->siglock to serialize the access to ->thread_head/etc.
So the patch changes ->load_binary() methods to call de_thread() right after
install_exec_creds() drops cred_guard_mutex.

Note that de_thread() is simplified, it no longer needs to send SIGKILL to
sub-threads and wait for the group_leader to become a zombie. But since it
is called after flush_old_exec() we also need to move flush_signal_handlers()
to de_thread().

TODO:

- Move install_exec_creds() and de_thread() into setup_new_exec(), unexport
  de_thread().

- Avoid tasklist_lock in kill_sub_threads(), we can change exit_notify() to
  set ->exit_state under siglock.

- Simplify/cleanup the ->notify_count logic, it is really ugly. I think we
  should just turn this counter into signal->nr_live_threads. __exit_signal
  can use signal->nr_threads instead.

- In any case we should limit the scope of cred_guard_mutex in execve paths.
  It is not clear why do we take it at the start of execve, and worse, it is
  not clear why we do we actually overload this mutex to exclude other threads
  (except check_unsafe_exec() but this is solveable). The original motivation
  was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221
  ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to
  call copy_strings/etc with this mutex held.

v2:
- add EXPORT_SYMBOL(de_thread)
- fix the usage of nr_threads in de_thread(), pointed by
  Mika Penttila <mika.penttila@nextfour.com>

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Reported-by: Attila Fazekas <afazekas@redhat.com>
Reported-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/ia32/ia32_aout.c |   3 ++
 fs/binfmt_aout.c          |   3 ++
 fs/binfmt_elf.c           |   6 ++-
 fs/binfmt_elf_fdpic.c     |   4 ++
 fs/binfmt_flat.c          |   3 ++
 fs/exec.c                 | 129 +++++++++++++++++++++++-----------------------
 include/linux/binfmts.h   |   1 +
 kernel/exit.c             |   5 +-
 kernel/signal.c           |   3 +-
 9 files changed, 88 insertions(+), 69 deletions(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 7c0a711..a6b9cc9 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -312,6 +312,9 @@ static int load_aout_binary(struct linux_binprm *bprm)
 		return retval;
 
 	install_exec_creds(bprm);
+	retval = de_thread(current);
+	if (retval)
+		return retval;
 
 	if (N_MAGIC(ex) == OMAGIC) {
 		unsigned long text_addr, map_size;
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 2a59139..edd1335 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -256,6 +256,9 @@ static int load_aout_binary(struct linux_binprm * bprm)
 		return retval;
 
 	install_exec_creds(bprm);
+	retval = de_thread(current);
+	if (retval)
+		return retval;
 
 	if (N_MAGIC(ex) == OMAGIC) {
 		unsigned long text_addr, map_size;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 4223702..79508f7 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	setup_new_exec(bprm);
 	install_exec_creds(bprm);
 
+	retval = de_thread(current);
+	if (retval)
+		goto out_free_dentry;
+
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
 	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
 				 executable_stack);
 	if (retval < 0)
 		goto out_free_dentry;
-	
+
 	current->mm->start_stack = bprm->p;
 
 	/* Now we do a little grungy work by mmapping the ELF image into
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index d2e36f8..75fd6d8 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 #endif
 
 	install_exec_creds(bprm);
+	retval = de_thread(current);
+	if (retval)
+		goto error;
+
 	if (create_elf_fdpic_tables(bprm, current->mm,
 				    &exec_params, &interp_params) < 0)
 		goto error;
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index 9b2917a..a0ad9a3 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm)
 	}
 
 	install_exec_creds(bprm);
+	res = de_thread(current);
+	if (res)
+		return res;
 
 	set_binfmt(&flat_format);
 
diff --git a/fs/exec.c b/fs/exec.c
index e579466..8e8f2ef 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig)
+{
+	for (;;) {
+		if (unlikely(__fatal_signal_pending(tsk)))
+			goto killed;
+		set_current_state(TASK_KILLABLE);
+		if (!sig->notify_count)
+			break;
+		schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+
+killed:
+	/* protects against exit_notify() and __exit_signal() */
+	read_lock(&tasklist_lock);
+	sig->group_exit_task = NULL;
+	sig->notify_count = 0;
+	read_unlock(&tasklist_lock);
+	return -EINTR;
+}
+
+/*
+ * Kill all the sub-threads and wait until they all pass exit_notify().
+ */
+static int kill_sub_threads(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+	int err = -EINTR;
+
+	if (thread_group_empty(tsk))
+		return 0;
+
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (!signal_group_exit(sig)) {
+		sig->group_exit_task = tsk;
+		sig->notify_count = -zap_other_threads(tsk);
+		err = 0;
+	}
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
+
+	if (!err)
+		err = wait_for_notify_count(tsk, sig);
+	return err;
+
+}
+
 /*
- * This function makes sure the current process has its own signal table,
- * so that flush_signal_handlers can later reset the handlers without
- * disturbing other processes.  (Other processes might share the signal
- * table via the CLONE_SIGHAND option to clone().)
+ * This function makes sure the current process has no other threads and
+ * has a private signal table so that flush_signal_handlers() can reset
+ * the handlers without disturbing other processes which might share the
+ * signal table via the CLONE_SIGHAND option to clone().
  */
-static int de_thread(struct task_struct *tsk)
+int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
@@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk)
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
 
-	/*
-	 * Kill all other threads in the thread group.
-	 */
 	spin_lock_irq(lock);
-	if (signal_group_exit(sig)) {
-		/*
-		 * Another group action in progress, just
-		 * return so that the signal is processed.
-		 */
-		spin_unlock_irq(lock);
-		return -EAGAIN;
-	}
-
-	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
+	sig->notify_count = sig->nr_threads - 1;
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
-
-	while (sig->notify_count) {
-		__set_current_state(TASK_KILLABLE);
-		spin_unlock_irq(lock);
-		schedule();
-		if (unlikely(__fatal_signal_pending(tsk)))
-			goto killed;
-		spin_lock_irq(lock);
-	}
 	spin_unlock_irq(lock);
 
+	if (wait_for_notify_count(tsk, sig))
+		return -EINTR;
+
 	/*
 	 * At this point all other threads have exited, all we have to
-	 * do is to wait for the thread group leader to become inactive,
-	 * and to assume its PID:
+	 * do is to reap the old leader and assume its PID.
 	 */
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
-
+		threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 		/*
 		 * The only record we have of the real-time age of a
 		 * process, regardless of execs it's done, is start_time.
@@ -1162,10 +1175,9 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+no_thread_group:
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
-
-no_thread_group:
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
 
@@ -1198,16 +1210,10 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	BUG_ON(!thread_group_leader(tsk));
+	flush_signal_handlers(current, 0);
 	return 0;
-
-killed:
-	/* protects against exit_notify() and __exit_signal() */
-	read_lock(&tasklist_lock);
-	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
-	read_unlock(&tasklist_lock);
-	return -EAGAIN;
 }
+EXPORT_SYMBOL(de_thread);
 
 char *get_task_comm(char *buf, struct task_struct *tsk)
 {
@@ -1237,11 +1243,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 {
 	int retval;
 
-	/*
-	 * Make sure we have a private signal table and that
-	 * we are unassociated from the previous thread group.
-	 */
-	retval = de_thread(current);
+	retval = kill_sub_threads(current);
 	if (retval)
 		goto out;
 
@@ -1336,7 +1338,6 @@ void setup_new_exec(struct linux_binprm * bprm)
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
 	current->self_exec_id++;
-	flush_signal_handlers(current, 0);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 1303b57..06a5a7b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *);
 extern int search_binary_handler(struct linux_binprm *);
 extern int flush_old_exec(struct linux_binprm * bprm);
 extern void setup_new_exec(struct linux_binprm * bprm);
+extern int de_thread(struct task_struct *tsk);
 extern void would_dump(struct linux_binprm *, struct file *);
 
 extern int suid_dumpable;
diff --git a/kernel/exit.c b/kernel/exit.c
index 8f14b86..169d9f2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -699,8 +699,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
+	/* mt-exec, kill_sub_threads() is waiting for group exit */
+	if (unlikely(tsk->signal->notify_count < 0) &&
+	    !++tsk->signal->notify_count)
 		wake_up_process(tsk->signal->group_exit_task);
 	write_unlock_irq(&tasklist_lock);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 3603d93..b78ce63 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1200,13 +1200,12 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
-
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+		count++;
 	}
 
 	return count;
-- 
2.5.0

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 18:04   ` [PATCH V2 " Oleg Nesterov
@ 2017-02-16 11:42     ` Eric W. Biederman
  2017-02-20 15:22       ` Oleg Nesterov
  2017-02-17  4:42     ` Eric W. Biederman
  1 sibling, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-02-16 11:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel


I am slowly working my way through this and I have found something that
I have an issue with.

Oleg Nesterov <oleg@redhat.com> writes:

> - In any case we should limit the scope of cred_guard_mutex in execve paths.
>   It is not clear why do we take it at the start of execve, and worse, it is
>   not clear why we do we actually overload this mutex to exclude other threads
>   (except check_unsafe_exec() but this is solveable). The original motivation
>   was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221
>   ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to
>   call copy_strings/etc with this mutex held.


The original changes that introduced cred_guard_mutex are:
a6f76f23d297 ("CRED: Make execve() take advantage of copy-on-write credentials")
d84f4f992cbd ("CRED: Inaugurate COW credentials")

So I don't think you actually have your history right.

Beyond that there is a compelling reason to have exec appear atomic from
the perspective of ptrace_attach.   If the operation is a setuid exec
and the tracer does not have permission to trace the original or the
result of the exec there could be some significant information leakage
if the exec operation is not atomic from the perspective of
ptrace_attach.

This is the kind of thing that could easily defeat address space layout
randomization of setuid executables if we get this wrong.  So I don't
think this is a small thing we are talking about.

And while I would like to merge this patch, a patch description that
appears to encourage people to make changes that will create security
vulnerabilities does not make me comfortable.


Additionally your comment makes me nervous when you are wondering why we
take this mutex to exclude other threads and I look in the git history
and see:

commit 9b1bf12d5d51bca178dea21b04a0805e29d60cf1
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Wed Oct 27 15:34:08 2010 -0700

    signals: move cred_guard_mutex from task_struct to signal_struct
    
    Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
    itself and we can reuse ->cred_guard_mutex for it.  Yes, concurrent
    execve() has no worth.
    
    Let's move ->cred_guard_mutex from task_struct to signal_struct.  It
    naturally prevent multiple-threads-inside-exec.
    
    Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Roland McGrath <roland@redhat.com>
    Acked-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

AKA it was your idea and you reviewed the patch that made it happen.

So while I fully agree we have issues here that we need to address and
fix your patch description does not inspire confidence.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-13 18:04   ` [PATCH V2 " Oleg Nesterov
  2017-02-16 11:42     ` Eric W. Biederman
@ 2017-02-17  4:42     ` Eric W. Biederman
  2017-02-20 15:50       ` Oleg Nesterov
  1 sibling, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-02-17  4:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel,
	David Howells

Oleg Nesterov <oleg@redhat.com> writes:

> de_thread() waits for other threads with ->cred_guard_mutex held and this
> is really bad because the time is not bounded, debugger can delay the exit
> and this lock has a lot of users (mostly abusers imo) in fs/proc and more.
> And this leads to deadlock if debugger tries to take the same mutex:

Oleg.  I looked at the history in proc of users of cred_guard_mutex
and the proc users are grabbing cred_guard_mutex for the proper
semantic reasons.  To avoid races with setuid exec that could result in
an information disclosure.

I do agree that a mutex is the wrong data structure for the job
cred_guard_mutex is performing.  The job of sorting ensuring debuggers
and proc processes see either the old version or the new version of the
task.

I need to play with the code but I suspect the best we can handle this
preventing both security issues and problems in the future is to create
a new task struct and populate it appropriate with the new data from
exec (at least in the case of setuid exec).

I am thinking of generalizing the case of a non-leader exec where we
have to assume the leaders pid.

I don't yet know what the performance implications would be but
that would clean the users up a lot.

On that score I believe we can incrementally approach that point and
only grab the cred_guard_mutex in exec if we are performing an exec that
changes the processes credentials.

Right now I don't think it introduces any new security information
disclosures but the moving of flush_signal_handlers outside of
cred_guard_mutex feels wrong.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-16 11:42     ` Eric W. Biederman
@ 2017-02-20 15:22       ` Oleg Nesterov
  2017-02-20 15:36         ` Oleg Nesterov
  2017-02-20 22:30         ` Eric W. Biederman
  0 siblings, 2 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-20 15:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel

Eric,

Thanks for looking into this! and sorry for delay.

On 02/17, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > - In any case we should limit the scope of cred_guard_mutex in execve paths.
> >   It is not clear why do we take it at the start of execve, and worse, it is
> >   not clear why we do we actually overload this mutex to exclude other threads
> >   (except check_unsafe_exec() but this is solveable). The original motivation
> >   was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221
> >   ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to
> >   call copy_strings/etc with this mutex held.
>
>
> The original changes that introduced cred_guard_mutex are:
> a6f76f23d297 ("CRED: Make execve() take advantage of copy-on-write credentials")
> d84f4f992cbd ("CRED: Inaugurate COW credentials")
>
> So I don't think you actually have your history right.
>
> Beyond that there is a compelling reason to have exec appear atomic from
> the perspective of ptrace_attach.   If the operation is a setuid exec
> and the tracer does not have permission to trace the original or the
> result of the exec there could be some significant information leakage
> if the exec operation is not atomic from the perspective of
> ptrace_attach.

Yes sure.

But I meant execve() should not take cred_guard_mutex at the start, it
should take it later even if we do not rework the security hooks. At least
it should take it after copy_strings(), but probably this needs some work.

> Additionally your comment makes me nervous when you are wondering why we
> take this mutex to exclude other threads and I look in the git history
> and see:
>
> commit 9b1bf12d5d51bca178dea21b04a0805e29d60cf1
> Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date:   Wed Oct 27 15:34:08 2010 -0700
>
>     signals: move cred_guard_mutex from task_struct to signal_struct
>
>     Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
>     itself and we can reuse ->cred_guard_mutex for it.  Yes, concurrent
>     execve() has no worth.
>
>     Let's move ->cred_guard_mutex from task_struct to signal_struct.  It
>     naturally prevent multiple-threads-inside-exec.

Yes, and let me explain the original motivation for this change.

To remind, we had a problem with copy_strings() which can use a lot of
memory, and this memory was not visible to OOM-killer.

So we were going to add the new member,

	signal_struct->in_exec_mm = bprm->mm

and change OOM-killer to account both task->mm and task->signal->in_exec_mm.

And in this case we obviously need to ensure that only one thread
can enter exec and use signal_struct->in_exec_mm.

That patch was ready, but then we found another (better) solution:
3c77f8457221 ("exec: make argv/envp memory visible to oom-killer").

So I do not think we need to exclude other threads today, and we do
not need to hold cred_guard_mutex throughout the whole execve path.

Again, this needs some work. For example check_unsafe_exec() assumes
it can't race with another thread, see 9e00cdb091b008cb3c78192651180
"exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic".
But this looks solvable.


> So while I fully agree we have issues here that we need to address and
> fix your patch description does not inspire confidence.

See above... what do you think I should change in this part of changelog?

Thanks,

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-20 15:22       ` Oleg Nesterov
@ 2017-02-20 15:36         ` Oleg Nesterov
  2017-02-20 22:30         ` Eric W. Biederman
  1 sibling, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-20 15:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel

On 02/20, Oleg Nesterov wrote:
>
> Again, this needs some work. For example check_unsafe_exec() assumes
> it can't race with another thread, see 9e00cdb091b008cb3c78192651180
> "exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic".
> But this looks solvable.

Forgot to mention... plus check_unsafe_exec() checks ptrace, this is
another reason why we can't simply shift mutex_lock(cred_guard_mutex)
later.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-17  4:42     ` Eric W. Biederman
@ 2017-02-20 15:50       ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-20 15:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel,
	David Howells

On 02/17, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > de_thread() waits for other threads with ->cred_guard_mutex held and this
> > is really bad because the time is not bounded, debugger can delay the exit
> > and this lock has a lot of users (mostly abusers imo) in fs/proc and more.
> > And this leads to deadlock if debugger tries to take the same mutex:
>
> Oleg.  I looked at the history in proc of users of cred_guard_mutex
> and the proc users are grabbing cred_guard_mutex for the proper
> semantic reasons.  To avoid races with setuid exec that could result in
> an information disclosure.

This is clear. However I really think it is over-used. For example, I do
think that lock_trace() should die.

OK, of course I can be wrong. And in any case this is almost off-topic
right now, lets discuss this separately.

> On that score I believe we can incrementally approach that point and
> only grab the cred_guard_mutex in exec if we are performing an exec that
> changes the processes credentials.

May be... but afaics this is more complicated because of LSM hooks.
To me one the main problems is that lsm hook can fail if the task is
traced, that is why we can't simply take cred_guard_mutex and check
->ptrace after de_thread(). But again, lets discuss this later.

> Right now I don't think it introduces any new security information
> disclosures but the moving of flush_signal_handlers outside of
> cred_guard_mutex feels wrong.

Why?

Just in case, we can do flush_signal_handlers() only after we unshare
->sighand, that is why it was moved outside of cred_guard_mutex.

But why this is bad? collect_sigign_sigcatch() is called without this
mutex, who else can look at sa.sa_handler?

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-20 15:22       ` Oleg Nesterov
  2017-02-20 15:36         ` Oleg Nesterov
@ 2017-02-20 22:30         ` Eric W. Biederman
  2017-02-21 17:53           ` Oleg Nesterov
  1 sibling, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-02-20 22:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel


Oleg,

My apologies for not replying in-line but I think I can be clearer just
saying what I think needs to be said.

Today cred_guard_mutex is part of making exec appear to be an atomic
operation to ptrace and and proc.  To make exec appear to be atomic
we do need to take the mutex at the beginning and release it at the end
of exec.

The semantics of exec appear atomic to ptrace_attach and to proc readers
are necessary to ensure we use the proper process credentials in the
event of a suid exec.

I believe making cred_guard_mutex per task is an option.  Reducing the
scope of cred_guard_mutex concerns me.  There appear to be some fields
like sighand that we currently expose in proc that might possibly be
problematic.  So I am not certain we protect enough things in our proc
accessors.

Do you know if we can make cred_guard_mutex a per-task lock again?

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-20 22:30         ` Eric W. Biederman
@ 2017-02-21 17:53           ` Oleg Nesterov
  2017-02-21 20:20             ` Eric W. Biederman
  0 siblings, 1 reply; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-21 17:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel

On 02/21, Eric W. Biederman wrote:
>
> Today cred_guard_mutex is part of making exec appear to be an atomic
> operation to ptrace and and proc.  To make exec appear to be atomic
> we do need to take the mutex at the beginning and release it at the end
> of exec.
>
> The semantics of exec appear atomic to ptrace_attach and to proc readers
> are necessary to ensure we use the proper process credentials in the
> event of a suid exec.

This is clear. My point is that imo a) it is over-used in fs/proc and b)
the scope of this mutex if execve is too huge. I see absolutely no reason
to do copy_strings() with this mutex held, for example. And note that
copy_strings() can use a lot of memory/time, it can trigger oom,swapping,
etc.

But let me repeat, this is a bit off-topic right now, this patch doesn't
change anything in this respect, afaics.


> I believe making cred_guard_mutex per task is an option.  Reducing the
> scope of cred_guard_mutex concerns me.  There appear to be some fields
> like sighand that we currently expose in proc

please see another email, collect_sigign_sigcatch() is called without this
mutex.

> Do you know if we can make cred_guard_mutex a per-task lock again?

I think we can, but this needs some (afaics simple) changes too.

But for what? Note that the problem fixed by this series won't go away
if we do this.


So what do you think about this series?

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-21 17:53           ` Oleg Nesterov
@ 2017-02-21 20:20             ` Eric W. Biederman
  2017-02-22 17:41               ` Oleg Nesterov
  0 siblings, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-02-21 20:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> On 02/21, Eric W. Biederman wrote:
>>
>> Today cred_guard_mutex is part of making exec appear to be an atomic
>> operation to ptrace and and proc.  To make exec appear to be atomic
>> we do need to take the mutex at the beginning and release it at the end
>> of exec.
>>
>> The semantics of exec appear atomic to ptrace_attach and to proc readers
>> are necessary to ensure we use the proper process credentials in the
>> event of a suid exec.
>
> This is clear. My point is that imo a) it is over-used in fs/proc and b)
> the scope of this mutex if execve is too huge. I see absolutely no reason
> to do copy_strings() with this mutex held, for example. And note that
> copy_strings() can use a lot of memory/time, it can trigger oom,swapping,
> etc.

I agree that we can do things like copy_strings that don't change the
data structures and of the task without before taking the cred_guard_mutex.

> But let me repeat, this is a bit off-topic right now, this patch doesn't
> change anything in this respect, afaics.
>
>
>> I believe making cred_guard_mutex per task is an option.  Reducing the
>> scope of cred_guard_mutex concerns me.  There appear to be some fields
>> like sighand that we currently expose in proc
>
> please see another email, collect_sigign_sigcatch() is called without this
> mutex.

I agree that it is called without the mutex.  It is not clear to me that
is the correct behavior.  It violates the fundamental property that
exec of a setuid executable should be an atomic operation.  I don't know
how much we care but it disturbs me that we can read something of a
processes signal handling state with the wrong credentials.

Adopting an implementation where we can never fix this apparent bug
really really disturbs me.

>> Do you know if we can make cred_guard_mutex a per-task lock again?
>
> I think we can, but this needs some (afaics simple) changes too.
>
> But for what? Note that the problem fixed by this series won't go away
> if we do this.

I believe it will if the other waiters use mutex_lock_killable.

> So what do you think about this series?

I like the second patch.  That seems clean and reasonable.

I really don't like the first patch.  It makes an information leak part
a required detail of the implementation and as such possibly something
we can never change.  It attempts to paint a picture for a full fix in
the future that appears to result in an incorrect kernel.  That really
bugs me.

I suspect that a good fix that respects that proc and ptrace_attach need
to exclude the setuid exec case for semantic reasons would have a similar
complexity.

I think a mutex doing the job that cred_guard_mutex is doing especially
when we have multiple readers and a single writer is the wrong locking
primative.  A reader-writer lock or something even cheaper would
probably be much better.

I think fixing the deadlock is important.

I think structuring the fix in such a way that the code is easily
maintainable in the future and is also very important.

Right now it feels like your fix in patch 1 makes things a bit more
brittle and I don't like that at all.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
  2017-02-21 20:20             ` Eric W. Biederman
@ 2017-02-22 17:41               ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-22 17:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Mika Penttilä,
	Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel

On 02/22, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> Reducing the
> >> scope of cred_guard_mutex concerns me.  There appear to be some fields
> >> like sighand that we currently expose in proc
> >
> > please see another email, collect_sigign_sigcatch() is called without this
> > mutex.
>
> I agree that it is called without the mutex.  It is not clear to me that
> is the correct behavior.

I fail to understand how/why this can be wrong.

> >> Do you know if we can make cred_guard_mutex a per-task lock again?
> >
> > I think we can, but this needs some (afaics simple) changes too.
> >
> > But for what? Note that the problem fixed by this series won't go away
> > if we do this.
>
> I believe it will if the other waiters use mutex_lock_killable.

No. They already use mutex_lock_killable/interruptible. And the test-case
can be killed, it is not the hard-lockup.

> I really don't like the first patch.

Just in case, I don't really like it too. Simply because it makes execve
more complex, we need to wait for sub-threads twice.

> It makes an information leak part
> a required detail of the implementation and as such possibly something
> we can never change.

Again, I simply can't understand how flush_signal_handlers() outside of
cred_guard_mutex can be treated as information leak. Even _if_
collect_sigign_sigcatch() was called with this mutex held.

Or do you mean something else?

> I suspect that a good fix that respects that proc and ptrace_attach need
> to exclude the setuid exec case for semantic reasons would have a similar
> complexity.

I am not sure I understand how we can do this. We need cred_guard_mutex
or something else even if exec is not setuid and does not change the
credentials, an LSM module can nack exec-under-ptrace by any reason.

> I think fixing the deadlock is important.

Yes. People actually hit this bug, it was reported several times.

> Right now it feels like your fix in patch 1 makes things a bit more
> brittle and I don't like that at all.

See above, I am not proud of this change too. I even mentioned on 0/2
that it would be nice to reconsider this change in the long term.

But I do not see another simple and _backportable_ solution for now.

What do you think we can do instead for stable trees?

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
  2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
  2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
  2017-02-13 14:15 ` [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
@ 2017-02-24 16:03 ` Oleg Nesterov
  2017-03-03  1:05   ` Eric W. Biederman
  2 siblings, 1 reply; 93+ messages in thread
From: Oleg Nesterov @ 2017-02-24 16:03 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman
  Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn,
	Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel

Eric,

our discussion was a bit confusing, and it seems that we did not
fully convince each other. So let me ask what do you finally think
about this fix.

Let me repeat. Even if I do not agree with some of your objections,
I do agree that 1/2 does not look nice and clean. And we seem to
agree that either way, with or without this fix, we need more changes
in this area.

But we need a simple and backportable fix for stable trees, say for
rhel7. This bug was reported many times, and this is the simplest
solution I was able to find.

Oleg.


On 02/13, Oleg Nesterov wrote:
>
> Hello,
>
> Lets finally fix this problem, it was reported several times. I still think that
> in the longer term we should (try to) rework the security hooks and (partially)
> revert this change, but this is not trivial and we need something backportable
> anyway.
>
> Eric, Jann, we already discussed this change. 1/2 is the same patch I suggested 3
> months ago except now it compiles and moves flush_signal_handlers() to de_thread().
>
> Both patches ask for subsequent cleanups, see the changelogs.
>
> Oleg.
>
>  arch/x86/ia32/ia32_aout.c |   3 ++
>  fs/binfmt_aout.c          |   3 ++
>  fs/binfmt_elf.c           |   6 ++-
>  fs/binfmt_elf_fdpic.c     |   4 ++
>  fs/binfmt_flat.c          |   3 ++
>  fs/exec.c                 | 128 +++++++++++++++++++++++-----------------------
>  include/linux/binfmts.h   |   1 +
>  kernel/exit.c             |   5 +-
>  kernel/signal.c           |  21 +++++---
>  9 files changed, 101 insertions(+), 73 deletions(-)

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
  2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
@ 2017-03-03  1:05   ` Eric W. Biederman
  2017-03-03 17:33     ` Oleg Nesterov
  0 siblings, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03  1:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> Eric,
>
> our discussion was a bit confusing, and it seems that we did not
> fully convince each other. So let me ask what do you finally think
> about this fix.
>
> Let me repeat. Even if I do not agree with some of your objections,
> I do agree that 1/2 does not look nice and clean. And we seem to
> agree that either way, with or without this fix, we need more changes
> in this area.
>
> But we need a simple and backportable fix for stable trees, say for
> rhel7. This bug was reported many times, and this is the simplest
> solution I was able to find.

I am not 100% convinced that we need a backportable solution, but
regardless my experience is that good clean solutions are almost always
easier to backport that something we just settle for.

The patch below needs a little more looking and testing but arguably
it is sufficient.

It implements autoreaping for non-leader threads always.  We might want
to limit this to the case of exec.  For exec there is a strong argument
that no one cares as the exit_code/status returned by these exiting
threads has always been 0.  Maybe that is ok.  Exit_success but it
certainly seems wrong in the case of exec.  It would really be better to
have an exit status that you went away with exec.

I know it has always been 0 as fs/exec.c does not set
sig->group_exit_code so the value initialized in copy_signal from
kmem_cache_zalloc is used.  AKA 0.

This change has very little to do with the cred_guard_mutex.  Just a
question of which semantics we want to export.

In my preliminary testing this seems a viable.

My big problem with your previous patch is that it is in no way
obviously correct.  Nor is it well explained.  All signs that someone is
pushing too hard and something better can be achieved if someone steps
back a little.

When the following patch can be made to solve the same problem we
certainly need an explanation in the commit comment on why we are using
much more code for the same job.

diff --git a/kernel/exit.c b/kernel/exit.c
index 5cfbd595f918..5a432e0dcf4a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 				thread_group_empty(tsk) &&
 				!ptrace_reparented(tsk) ?
 			tsk->exit_signal : SIGCHLD;
-		autoreap = do_notify_parent(tsk, sig);
+		do_notify_parent(tsk, sig);
+		/* Autoreap threads even when ptraced */
+		autoreap = !thread_group_leader(tsk);
 	} else if (thread_group_leader(tsk)) {
 		autoreap = thread_group_empty(tsk) &&
 			do_notify_parent(tsk, tsk->exit_signal);
@@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	}
 
 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
-	if (tsk->exit_state == EXIT_DEAD)
-		list_add(&tsk->ptrace_entry, &dead);
 
 	/* mt-exec, de_thread() is waiting for group leader */
 	if (unlikely(tsk->signal->notify_count < 0))
@@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		list_del_init(&p->ptrace_entry);
 		release_task(p);
 	}
+	if (autoreap)
+		release_task(tsk);
 }
 
 #ifdef CONFIG_DEBUG_STACK_USAGE

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
  2017-03-03  1:05   ` Eric W. Biederman
@ 2017-03-03 17:33     ` Oleg Nesterov
  2017-03-03 18:23         ` Eric W. Biederman
  0 siblings, 1 reply; 93+ messages in thread
From: Oleg Nesterov @ 2017-03-03 17:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel

On 03/02, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > our discussion was a bit confusing, and it seems that we did not
> > fully convince each other. So let me ask what do you finally think
> > about this fix.
> >
> > Let me repeat. Even if I do not agree with some of your objections,
> > I do agree that 1/2 does not look nice and clean. And we seem to
> > agree that either way, with or without this fix, we need more changes
> > in this area.
> >
> > But we need a simple and backportable fix for stable trees, say for
> > rhel7. This bug was reported many times, and this is the simplest
> > solution I was able to find.
>
> I am not 100% convinced that we need a backportable solution,

I think we need, this was already requested,

> but
> regardless my experience is that good clean solutions are almost always
> easier to backport that something we just settle for.

Sure.

> The patch below needs a little more looking and testing but arguably
> it is sufficient.
>
> It implements autoreaping for non-leader threads always.  We might want
> to limit this to the case of exec.

I should have mentioned this. Of course, this change was proposed from the
very beginning, when this problem was reported first time. And of course
I like this fix much more than my patch (but yes, we shouldd limit it to
the case of exec).

The only problem is that it is incompatible change, and I have no idea what
can be broken.

Trivial test-case:

	void *thread(void *arg)
	{
		for (;;)
			pause();
		return NULL;
	}

	int main(void)
	{
		pthread_t pt;
		pthread_create(&pt, NULL, thread, NULL);
		execlp("true", "true", NULL);
	}

with your patch applied

	$ strace -f ./test
	strace: wait4(__WALL): No child processes

Yes, this is just a warning, but still. Now we need to change strace. And we
can't know what else can be confused/broken by this change.

man(ptrace) even documents that all other threads except the thread group leader
report death as if they exited via _exit(2).

Yes, yes, it also says that other threads stop in PTRACE_EVENT_EXIT stop,
so 2/2 (which we need with your change too) is is not compatible too and
I am worried, but:

	- this was never really true, an already exiting thread won't stop
	  if it races with exec

	- PTRACE_O_TRACEEXEC is not used that often, it never really worked

	- man(ptrace) also mentions that PTRACE_EVENT_EXIT behaviour may be
	  change in the future.

In short. Of course I considered this change. Looks too risky to me. But.
I will be happy if you send this change and take all the blame ;) I won't
argue (if you limit it to execve case).



> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  				thread_group_empty(tsk) &&
>  				!ptrace_reparented(tsk) ?
>  			tsk->exit_signal : SIGCHLD;
> -		autoreap = do_notify_parent(tsk, sig);
> +		do_notify_parent(tsk, sig);
> +		/* Autoreap threads even when ptraced */
> +		autoreap = !thread_group_leader(tsk);
>  	} else if (thread_group_leader(tsk)) {
>  		autoreap = thread_group_empty(tsk) &&
>  			do_notify_parent(tsk, tsk->exit_signal);

This is all you need,

> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	}
>
>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> -	if (tsk->exit_state == EXIT_DEAD)
> -		list_add(&tsk->ptrace_entry, &dead);
>
>  	/* mt-exec, de_thread() is waiting for group leader */
>  	if (unlikely(tsk->signal->notify_count < 0))
> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  		list_del_init(&p->ptrace_entry);
>  		release_task(p);
>  	}
> +	if (autoreap)
> +		release_task(tsk);

These 2 changes are not needed. release_task(tsk) will be called by
list_for_each_entry_safe() above if autoreap == T.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-03-03 18:23         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 18:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Cc'd linux-api as we are talking about a deliberate user visible API
change here.

Oleg Nesterov <oleg@redhat.com> writes:

> On 03/02, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > our discussion was a bit confusing, and it seems that we did not
>> > fully convince each other. So let me ask what do you finally think
>> > about this fix.
>> >
>> > Let me repeat. Even if I do not agree with some of your objections,
>> > I do agree that 1/2 does not look nice and clean. And we seem to
>> > agree that either way, with or without this fix, we need more changes
>> > in this area.
>> >
>> > But we need a simple and backportable fix for stable trees, say for
>> > rhel7. This bug was reported many times, and this is the simplest
>> > solution I was able to find.
>>
>> I am not 100% convinced that we need a backportable solution,
>
> I think we need, this was already requested,
>
>> but
>> regardless my experience is that good clean solutions are almost always
>> easier to backport that something we just settle for.
>
> Sure.
>
>> The patch below needs a little more looking and testing but arguably
>> it is sufficient.
>>
>> It implements autoreaping for non-leader threads always.  We might want
>> to limit this to the case of exec.
>
> I should have mentioned this. Of course, this change was proposed from the
> very beginning, when this problem was reported first time. And of course
> I like this fix much more than my patch (but yes, we shouldd limit it to
> the case of exec).
>
> The only problem is that it is incompatible change, and I have no idea what
> can be broken.
>
> Trivial test-case:
>
> 	void *thread(void *arg)
> 	{
> 		for (;;)
> 			pause();
> 		return NULL;
> 	}
>
> 	int main(void)
> 	{
> 		pthread_t pt;
> 		pthread_create(&pt, NULL, thread, NULL);
> 		execlp("true", "true", NULL);
> 	}
>
> with your patch applied
>
> 	$ strace -f ./test
> 	strace: wait4(__WALL): No child processes
>
> Yes, this is just a warning, but still. Now we need to change strace. And we
> can't know what else can be confused/broken by this change.
>
> man(ptrace) even documents that all other threads except the thread group leader
> report death as if they exited via _exit(2).
>
> Yes, yes, it also says that other threads stop in PTRACE_EVENT_EXIT stop,
> so 2/2 (which we need with your change too) is is not compatible too and
> I am worried, but:
>
> 	- this was never really true, an already exiting thread won't stop
> 	  if it races with exec
>
> 	- PTRACE_O_TRACEEXEC is not used that often, it never really worked
>
> 	- man(ptrace) also mentions that PTRACE_EVENT_EXIT behaviour may be
> 	  change in the future.
>
> In short. Of course I considered this change. Looks too risky to me. But.
> I will be happy if you send this change and take all the blame ;) I won't
> argue (if you limit it to execve case).

Unfortunately you have found what already looks like a userspace
regression.  So I don't think we can do that.

I do think the user space visible change of modifying a multi-threaded
exec no to wait  for the other threads to be reaped is the least
dangerous change.

The big lesson for me, and what was not obvious from your change
description is that we are changing the user space visible semantics
of exec+ptrace and that cred_guard_mutex is not at all the problem (as
we always take cred_guard_mutex in a killable or interruptible way).

So I tentatively agree that we need to deliberately change the semantics
of exec to not wait for zombies in fs/exec.c:de_thread.  That is what I
would expect of exec and that seems to the minimal change.

As part of that change I expect we want to call __cleanup_sighand from
do_exit rather than release_task.  Semantically we sould not need the
sighand_struct for zombie process.

>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>>  				thread_group_empty(tsk) &&
>>  				!ptrace_reparented(tsk) ?
>>  			tsk->exit_signal : SIGCHLD;
>> -		autoreap = do_notify_parent(tsk, sig);
>> +		do_notify_parent(tsk, sig);
>> +		/* Autoreap threads even when ptraced */
>> +		autoreap = !thread_group_leader(tsk);
>>  	} else if (thread_group_leader(tsk)) {
>>  		autoreap = thread_group_empty(tsk) &&
>>  			do_notify_parent(tsk, tsk->exit_signal);
>
> This is all you need,
>
>> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>>  	}
>>
>>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
>> -	if (tsk->exit_state == EXIT_DEAD)
>> -		list_add(&tsk->ptrace_entry, &dead);
>>
>>  	/* mt-exec, de_thread() is waiting for group leader */
>>  	if (unlikely(tsk->signal->notify_count < 0))
>> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>>  		list_del_init(&p->ptrace_entry);
>>  		release_task(p);
>>  	}
>> +	if (autoreap)
>> +		release_task(tsk);
>
> These 2 changes are not needed. release_task(tsk) will be called by
> list_for_each_entry_safe() above if autoreap == T.

Except for the practical case that for threads that are ptraced
tsk->ptrace_entry is already in use.  Which means we can't use
list_add(&tsk->ptrace_entry, &dead).

Or in short we get a really pretty oops because we corrupt one of the
ptrace lists if we don't have my extra two hunks.  But I agree logically
they are separate changes.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-03-03 18:23         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 18:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Cc'd linux-api as we are talking about a deliberate user visible API
change here.

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On 03/02, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>>
>> > our discussion was a bit confusing, and it seems that we did not
>> > fully convince each other. So let me ask what do you finally think
>> > about this fix.
>> >
>> > Let me repeat. Even if I do not agree with some of your objections,
>> > I do agree that 1/2 does not look nice and clean. And we seem to
>> > agree that either way, with or without this fix, we need more changes
>> > in this area.
>> >
>> > But we need a simple and backportable fix for stable trees, say for
>> > rhel7. This bug was reported many times, and this is the simplest
>> > solution I was able to find.
>>
>> I am not 100% convinced that we need a backportable solution,
>
> I think we need, this was already requested,
>
>> but
>> regardless my experience is that good clean solutions are almost always
>> easier to backport that something we just settle for.
>
> Sure.
>
>> The patch below needs a little more looking and testing but arguably
>> it is sufficient.
>>
>> It implements autoreaping for non-leader threads always.  We might want
>> to limit this to the case of exec.
>
> I should have mentioned this. Of course, this change was proposed from the
> very beginning, when this problem was reported first time. And of course
> I like this fix much more than my patch (but yes, we shouldd limit it to
> the case of exec).
>
> The only problem is that it is incompatible change, and I have no idea what
> can be broken.
>
> Trivial test-case:
>
> 	void *thread(void *arg)
> 	{
> 		for (;;)
> 			pause();
> 		return NULL;
> 	}
>
> 	int main(void)
> 	{
> 		pthread_t pt;
> 		pthread_create(&pt, NULL, thread, NULL);
> 		execlp("true", "true", NULL);
> 	}
>
> with your patch applied
>
> 	$ strace -f ./test
> 	strace: wait4(__WALL): No child processes
>
> Yes, this is just a warning, but still. Now we need to change strace. And we
> can't know what else can be confused/broken by this change.
>
> man(ptrace) even documents that all other threads except the thread group leader
> report death as if they exited via _exit(2).
>
> Yes, yes, it also says that other threads stop in PTRACE_EVENT_EXIT stop,
> so 2/2 (which we need with your change too) is is not compatible too and
> I am worried, but:
>
> 	- this was never really true, an already exiting thread won't stop
> 	  if it races with exec
>
> 	- PTRACE_O_TRACEEXEC is not used that often, it never really worked
>
> 	- man(ptrace) also mentions that PTRACE_EVENT_EXIT behaviour may be
> 	  change in the future.
>
> In short. Of course I considered this change. Looks too risky to me. But.
> I will be happy if you send this change and take all the blame ;) I won't
> argue (if you limit it to execve case).

Unfortunately you have found what already looks like a userspace
regression.  So I don't think we can do that.

I do think the user space visible change of modifying a multi-threaded
exec no to wait  for the other threads to be reaped is the least
dangerous change.

The big lesson for me, and what was not obvious from your change
description is that we are changing the user space visible semantics
of exec+ptrace and that cred_guard_mutex is not at all the problem (as
we always take cred_guard_mutex in a killable or interruptible way).

So I tentatively agree that we need to deliberately change the semantics
of exec to not wait for zombies in fs/exec.c:de_thread.  That is what I
would expect of exec and that seems to the minimal change.

As part of that change I expect we want to call __cleanup_sighand from
do_exit rather than release_task.  Semantically we sould not need the
sighand_struct for zombie process.

>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>>  				thread_group_empty(tsk) &&
>>  				!ptrace_reparented(tsk) ?
>>  			tsk->exit_signal : SIGCHLD;
>> -		autoreap = do_notify_parent(tsk, sig);
>> +		do_notify_parent(tsk, sig);
>> +		/* Autoreap threads even when ptraced */
>> +		autoreap = !thread_group_leader(tsk);
>>  	} else if (thread_group_leader(tsk)) {
>>  		autoreap = thread_group_empty(tsk) &&
>>  			do_notify_parent(tsk, tsk->exit_signal);
>
> This is all you need,
>
>> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>>  	}
>>
>>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
>> -	if (tsk->exit_state == EXIT_DEAD)
>> -		list_add(&tsk->ptrace_entry, &dead);
>>
>>  	/* mt-exec, de_thread() is waiting for group leader */
>>  	if (unlikely(tsk->signal->notify_count < 0))
>> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>>  		list_del_init(&p->ptrace_entry);
>>  		release_task(p);
>>  	}
>> +	if (autoreap)
>> +		release_task(tsk);
>
> These 2 changes are not needed. release_task(tsk) will be called by
> list_for_each_entry_safe() above if autoreap == T.

Except for the practical case that for threads that are ptraced
tsk->ptrace_entry is already in use.  Which means we can't use
list_add(&tsk->ptrace_entry, &dead).

Or in short we get a really pretty oops because we corrupt one of the
ptrace lists if we don't have my extra two hunks.  But I agree logically
they are separate changes.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
  2017-03-03 18:23         ` Eric W. Biederman
@ 2017-03-03 18:59           ` Eric W. Biederman
  -1 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 18:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> The big lesson for me, and what was not obvious from your change
> description is that we are changing the user space visible semantics
> of exec+ptrace and that cred_guard_mutex is not at all the problem (as
> we always take cred_guard_mutex in a killable or interruptible way).

Just to follow up.

Because the cred_guard_mutex is fine as is we don't need to move
de_thread out from under cred_guard_mutex.  We just need to change
de_thread to wait until all of the other threads are zombies.
Which should remove about half your proposed patch.

The other key thing is that knowning it isn't cred_guard_mutex let's us
know that this kind of deadlock goes all of the way back to when
CLONE_THREAD was merged into the kernel.

Insteresingly enough looking at zap_other_threads and notify_count I
have found a second bug.  When a multi-threaded processes becomes a
zombie we don't send the notification to the parent process until the
non-leader threads have been reaped.  Which means ptrace can mess up
sending SIGCHLD to the parent.

Now arguably that might be what is desirable but I don't think so.  If
we aren't ptracing a thread then I don't think we want to delay sending
SIGCHLD to the parent.

So this whole area of the semantics of a ptrace'd multi-threaded process
exiting/exec'ing looks like it needs a thorough going over.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-03-03 18:59           ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 18:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> The big lesson for me, and what was not obvious from your change
> description is that we are changing the user space visible semantics
> of exec+ptrace and that cred_guard_mutex is not at all the problem (as
> we always take cred_guard_mutex in a killable or interruptible way).

Just to follow up.

Because the cred_guard_mutex is fine as is we don't need to move
de_thread out from under cred_guard_mutex.  We just need to change
de_thread to wait until all of the other threads are zombies.
Which should remove about half your proposed patch.

The other key thing is that knowning it isn't cred_guard_mutex let's us
know that this kind of deadlock goes all of the way back to when
CLONE_THREAD was merged into the kernel.

Insteresingly enough looking at zap_other_threads and notify_count I
have found a second bug.  When a multi-threaded processes becomes a
zombie we don't send the notification to the parent process until the
non-leader threads have been reaped.  Which means ptrace can mess up
sending SIGCHLD to the parent.

Now arguably that might be what is desirable but I don't think so.  If
we aren't ptracing a thread then I don't think we want to delay sending
SIGCHLD to the parent.

So this whole area of the semantics of a ptrace'd multi-threaded process
exiting/exec'ing looks like it needs a thorough going over.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-03-03 20:06             ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 20:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> The big lesson for me, and what was not obvious from your change
>> description is that we are changing the user space visible semantics
>> of exec+ptrace and that cred_guard_mutex is not at all the problem (as
>> we always take cred_guard_mutex in a killable or interruptible way).
>
> Just to follow up.
>
> Because the cred_guard_mutex is fine as is we don't need to move
> de_thread out from under cred_guard_mutex.  We just need to change
> de_thread to wait until all of the other threads are zombies.
> Which should remove about half your proposed patch.
>
> The other key thing is that knowning it isn't cred_guard_mutex let's us
> know that this kind of deadlock goes all of the way back to when
> CLONE_THREAD was merged into the kernel.
>
> Insteresingly enough looking at zap_other_threads and notify_count I
> have found a second bug.  When a multi-threaded processes becomes a
> zombie we don't send the notification to the parent process until the
> non-leader threads have been reaped.  Which means ptrace can mess up
> sending SIGCHLD to the parent.

Bah. I was misreading the code.  Nothing but exec uses notify_count
and group_exit_task.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-03-03 20:06             ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 20:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>
>> The big lesson for me, and what was not obvious from your change
>> description is that we are changing the user space visible semantics
>> of exec+ptrace and that cred_guard_mutex is not at all the problem (as
>> we always take cred_guard_mutex in a killable or interruptible way).
>
> Just to follow up.
>
> Because the cred_guard_mutex is fine as is we don't need to move
> de_thread out from under cred_guard_mutex.  We just need to change
> de_thread to wait until all of the other threads are zombies.
> Which should remove about half your proposed patch.
>
> The other key thing is that knowning it isn't cred_guard_mutex let's us
> know that this kind of deadlock goes all of the way back to when
> CLONE_THREAD was merged into the kernel.
>
> Insteresingly enough looking at zap_other_threads and notify_count I
> have found a second bug.  When a multi-threaded processes becomes a
> zombie we don't send the notification to the parent process until the
> non-leader threads have been reaped.  Which means ptrace can mess up
> sending SIGCHLD to the parent.

Bah. I was misreading the code.  Nothing but exec uses notify_count
and group_exit_task.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-03-03 20:11               ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 20:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Ever since CLONE_THREAD support was added to the kernel it has been
possible to dead-lock userspace by ptracing a process and not reaping
it's child threads.  With use of the cred_guard_mutex in proc the ways
userspace can unknowningly trigger a dead-lock have grown.  Which makes
a simple -f strace on a program that performs a mult-threaded exec
unsafe.

        #include <unistd.h>
        #include <signal.h>
        #include <sys/ptrace.h>
        #include <pthread.h>

        void *thread(void *arg)
        {
                ptrace(PTRACE_TRACEME, 0,0,0);
                return NULL;
        }

        int main(void)
        {
                int pid = fork();

                if (!pid) {
                        pthread_t pt;
                        pthread_create(&pt, NULL, thread, NULL);
                        pthread_join(pt, NULL);
                        execlp("echo", "echo", "passed", NULL);
                }

                sleep(1);
                // or anything else which needs ->cred_guard_mutex,
                // say open(/proc/$pid/mem)
                ptrace(PTRACE_ATTACH, pid, 0,0);
                kill(pid, SIGCONT);

                return 0;
        }

Sovle this by modifying exec to only wait until all of the other
threads are zombies, and not waiting until the other threads
are reaped.

As well as simplifying the userspace semantics this simplifies
the code.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Reported-by: Attila Fazekas <afazekas@redhat.com>
Reported-by: Aleksa Sarai <asarai@suse.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Fixes: v2.4.0
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c             | 31 +++++--------------------------
 include/linux/sched.h |  5 ++---
 kernel/exit.c         | 18 +++++-------------
 kernel/signal.c       |  6 +-----
 4 files changed, 13 insertions(+), 47 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 698a86094f76..f78205a72304 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
-		sig->notify_count--;
-
-	while (sig->notify_count) {
+	zap_other_threads(tsk);
+	while (atomic_read(&sig->live) > 1) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
 		schedule();
@@ -1081,29 +1078,13 @@ static int de_thread(struct task_struct *tsk)
 
 	/*
 	 * At this point all other threads have exited, all we have to
-	 * do is to wait for the thread group leader to become inactive,
-	 * and to assume its PID:
+	 * do is to assume the PID of the thread group leader.
 	 */
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
+		threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 
 		/*
 		 * The only record we have of the real-time age of a
@@ -1163,7 +1144,6 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
 
 no_thread_group:
 	/* we have changed execution domain */
@@ -1204,7 +1184,6 @@ static int de_thread(struct task_struct *tsk)
 	/* protects against exit_notify() and __exit_signal() */
 	read_lock(&tasklist_lock);
 	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
 	read_unlock(&tasklist_lock);
 	return -EAGAIN;
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6261bfc12853..ff6aa76beac5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -711,11 +711,10 @@ struct signal_struct {
 	/* thread group exit support */
 	int			group_exit_code;
 	/* overloaded:
-	 * - notify group_exit_task when ->count is equal to notify_count
+	 * - notify group_exit_task when ->live is equal to 1
 	 * - everyone except group_exit_task is stopped during signal delivery
 	 *   of fatal signals, group_exit_task processes the signal.
 	 */
-	int			notify_count;
 	struct task_struct	*group_exit_task;
 
 	/* thread group stop support, overloads group_exit_code too */
@@ -2763,7 +2762,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
-extern int zap_other_threads(struct task_struct *p);
+extern void zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
 extern int send_sigqueue(struct sigqueue *,  struct task_struct *, int group);
diff --git a/kernel/exit.c b/kernel/exit.c
index 5cfbd595f918..dc9bc2bdb45a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -111,13 +111,6 @@ static void __exit_signal(struct task_struct *tsk)
 		tty = sig->tty;
 		sig->tty = NULL;
 	} else {
-		/*
-		 * If there is any task waiting for the group exit
-		 * then notify it:
-		 */
-		if (sig->notify_count > 0 && !--sig->notify_count)
-			wake_up_process(sig->group_exit_task);
-
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
 	}
@@ -701,10 +694,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
-
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
-		wake_up_process(tsk->signal->group_exit_task);
 	write_unlock_irq(&tasklist_lock);
 
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
@@ -740,7 +729,7 @@ static inline void check_stack_usage(void) {}
 void __noreturn do_exit(long code)
 {
 	struct task_struct *tsk = current;
-	int group_dead;
+	int group_left, group_dead;
 	TASKS_RCU(int tasks_rcu_i);
 
 	profile_task_exit(tsk);
@@ -809,7 +798,8 @@ void __noreturn do_exit(long code)
 	if (tsk->mm)
 		sync_mm_rss(tsk->mm);
 	acct_update_integrals(tsk);
-	group_dead = atomic_dec_and_test(&tsk->signal->live);
+	group_left = atomic_dec_return(&tsk->signal->live);
+	group_dead = group_left == 0;
 	if (group_dead) {
 #ifdef CONFIG_POSIX_TIMERS
 		hrtimer_cancel(&tsk->signal->real_timer);
@@ -818,6 +808,8 @@ void __noreturn do_exit(long code)
 		if (tsk->mm)
 			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
 	}
+	if ((group_left == 1) && tsk->signal->group_exit_task)
+		wake_up_process(tsk->signal->group_exit_task);
 	acct_collect(code, group_dead);
 	if (group_dead)
 		tty_audit_exit();
diff --git a/kernel/signal.c b/kernel/signal.c
index fbbab5a7c84d..17312c71f1ae 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1191,16 +1191,14 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 /*
  * Nuke all other threads in the group.
  */
-int zap_other_threads(struct task_struct *p)
+void zap_other_threads(struct task_struct *p)
 {
 	struct task_struct *t = p;
-	int count = 0;
 
 	p->signal->group_stop_count = 0;
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
@@ -1208,8 +1206,6 @@ int zap_other_threads(struct task_struct *p)
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
 	}
-
-	return count;
 }
 
 struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-03-03 20:11               ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-03 20:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Ever since CLONE_THREAD support was added to the kernel it has been
possible to dead-lock userspace by ptracing a process and not reaping
it's child threads.  With use of the cred_guard_mutex in proc the ways
userspace can unknowningly trigger a dead-lock have grown.  Which makes
a simple -f strace on a program that performs a mult-threaded exec
unsafe.

        #include <unistd.h>
        #include <signal.h>
        #include <sys/ptrace.h>
        #include <pthread.h>

        void *thread(void *arg)
        {
                ptrace(PTRACE_TRACEME, 0,0,0);
                return NULL;
        }

        int main(void)
        {
                int pid = fork();

                if (!pid) {
                        pthread_t pt;
                        pthread_create(&pt, NULL, thread, NULL);
                        pthread_join(pt, NULL);
                        execlp("echo", "echo", "passed", NULL);
                }

                sleep(1);
                // or anything else which needs ->cred_guard_mutex,
                // say open(/proc/$pid/mem)
                ptrace(PTRACE_ATTACH, pid, 0,0);
                kill(pid, SIGCONT);

                return 0;
        }

Sovle this by modifying exec to only wait until all of the other
threads are zombies, and not waiting until the other threads
are reaped.

As well as simplifying the userspace semantics this simplifies
the code.

Reported-by: Ulrich Obergfell <uobergfe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reported-by: Attila Fazekas <afazekas-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reported-by: Aleksa Sarai <asarai-IBi9RG/b67k@public.gmane.org>
Reported-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Fixes: v2.4.0
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/exec.c             | 31 +++++--------------------------
 include/linux/sched.h |  5 ++---
 kernel/exit.c         | 18 +++++-------------
 kernel/signal.c       |  6 +-----
 4 files changed, 13 insertions(+), 47 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 698a86094f76..f78205a72304 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
-		sig->notify_count--;
-
-	while (sig->notify_count) {
+	zap_other_threads(tsk);
+	while (atomic_read(&sig->live) > 1) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
 		schedule();
@@ -1081,29 +1078,13 @@ static int de_thread(struct task_struct *tsk)
 
 	/*
 	 * At this point all other threads have exited, all we have to
-	 * do is to wait for the thread group leader to become inactive,
-	 * and to assume its PID:
+	 * do is to assume the PID of the thread group leader.
 	 */
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
+		threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 
 		/*
 		 * The only record we have of the real-time age of a
@@ -1163,7 +1144,6 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
 
 no_thread_group:
 	/* we have changed execution domain */
@@ -1204,7 +1184,6 @@ static int de_thread(struct task_struct *tsk)
 	/* protects against exit_notify() and __exit_signal() */
 	read_lock(&tasklist_lock);
 	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
 	read_unlock(&tasklist_lock);
 	return -EAGAIN;
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6261bfc12853..ff6aa76beac5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -711,11 +711,10 @@ struct signal_struct {
 	/* thread group exit support */
 	int			group_exit_code;
 	/* overloaded:
-	 * - notify group_exit_task when ->count is equal to notify_count
+	 * - notify group_exit_task when ->live is equal to 1
 	 * - everyone except group_exit_task is stopped during signal delivery
 	 *   of fatal signals, group_exit_task processes the signal.
 	 */
-	int			notify_count;
 	struct task_struct	*group_exit_task;
 
 	/* thread group stop support, overloads group_exit_code too */
@@ -2763,7 +2762,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
-extern int zap_other_threads(struct task_struct *p);
+extern void zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
 extern int send_sigqueue(struct sigqueue *,  struct task_struct *, int group);
diff --git a/kernel/exit.c b/kernel/exit.c
index 5cfbd595f918..dc9bc2bdb45a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -111,13 +111,6 @@ static void __exit_signal(struct task_struct *tsk)
 		tty = sig->tty;
 		sig->tty = NULL;
 	} else {
-		/*
-		 * If there is any task waiting for the group exit
-		 * then notify it:
-		 */
-		if (sig->notify_count > 0 && !--sig->notify_count)
-			wake_up_process(sig->group_exit_task);
-
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
 	}
@@ -701,10 +694,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
-
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
-		wake_up_process(tsk->signal->group_exit_task);
 	write_unlock_irq(&tasklist_lock);
 
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
@@ -740,7 +729,7 @@ static inline void check_stack_usage(void) {}
 void __noreturn do_exit(long code)
 {
 	struct task_struct *tsk = current;
-	int group_dead;
+	int group_left, group_dead;
 	TASKS_RCU(int tasks_rcu_i);
 
 	profile_task_exit(tsk);
@@ -809,7 +798,8 @@ void __noreturn do_exit(long code)
 	if (tsk->mm)
 		sync_mm_rss(tsk->mm);
 	acct_update_integrals(tsk);
-	group_dead = atomic_dec_and_test(&tsk->signal->live);
+	group_left = atomic_dec_return(&tsk->signal->live);
+	group_dead = group_left == 0;
 	if (group_dead) {
 #ifdef CONFIG_POSIX_TIMERS
 		hrtimer_cancel(&tsk->signal->real_timer);
@@ -818,6 +808,8 @@ void __noreturn do_exit(long code)
 		if (tsk->mm)
 			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
 	}
+	if ((group_left == 1) && tsk->signal->group_exit_task)
+		wake_up_process(tsk->signal->group_exit_task);
 	acct_collect(code, group_dead);
 	if (group_dead)
 		tty_audit_exit();
diff --git a/kernel/signal.c b/kernel/signal.c
index fbbab5a7c84d..17312c71f1ae 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1191,16 +1191,14 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 /*
  * Nuke all other threads in the group.
  */
-int zap_other_threads(struct task_struct *p)
+void zap_other_threads(struct task_struct *p)
 {
 	struct task_struct *t = p;
-	int count = 0;
 
 	p->signal->group_stop_count = 0;
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
@@ -1208,8 +1206,6 @@ int zap_other_threads(struct task_struct *p)
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
 	}
-
-	return count;
 }
 
 struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-03-04 16:54           ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-03-04 16:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 03/03, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >>  	}
> >>
> >>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> >> -	if (tsk->exit_state == EXIT_DEAD)
> >> -		list_add(&tsk->ptrace_entry, &dead);
> >>
> >>  	/* mt-exec, de_thread() is waiting for group leader */
> >>  	if (unlikely(tsk->signal->notify_count < 0))
> >> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >>  		list_del_init(&p->ptrace_entry);
> >>  		release_task(p);
> >>  	}
> >> +	if (autoreap)
> >> +		release_task(tsk);
> >
> > These 2 changes are not needed. release_task(tsk) will be called by
> > list_for_each_entry_safe() above if autoreap == T.
>
> Except for the practical case that for threads that are ptraced
> tsk->ptrace_entry is already in use.  Which means we can't use
> list_add(&tsk->ptrace_entry, &dead).

Yes, I was wrong here, thanks for correcting me.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/2] fix the traced mt-exec deadlock
@ 2017-03-04 16:54           ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-03-04 16:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 03/03, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
> >> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >>  	}
> >>
> >>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> >> -	if (tsk->exit_state == EXIT_DEAD)
> >> -		list_add(&tsk->ptrace_entry, &dead);
> >>
> >>  	/* mt-exec, de_thread() is waiting for group leader */
> >>  	if (unlikely(tsk->signal->notify_count < 0))
> >> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >>  		list_del_init(&p->ptrace_entry);
> >>  		release_task(p);
> >>  	}
> >> +	if (autoreap)
> >> +		release_task(tsk);
> >
> > These 2 changes are not needed. release_task(tsk) will be called by
> > list_for_each_entry_safe() above if autoreap == T.
>
> Except for the practical case that for threads that are ptraced
> tsk->ptrace_entry is already in use.  Which means we can't use
> list_add(&tsk->ptrace_entry, &dead).

Yes, I was wrong here, thanks for correcting me.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
  2017-03-03 20:11               ` Eric W. Biederman
  (?)
@ 2017-03-04 17:03               ` Oleg Nesterov
  2017-03-30  8:07                 ` Eric W. Biederman
  -1 siblings, 1 reply; 93+ messages in thread
From: Oleg Nesterov @ 2017-03-04 17:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 03/03, Eric W. Biederman wrote:
>
> Ever since CLONE_THREAD support was added to the kernel it has been
> possible to dead-lock userspace by ptracing a process and not reaping
> it's child threads.

Hmm. I disagree... I do not think this is a bug. But lets discuss this
separately, perhaps I misunderstood you.

> With use of the cred_guard_mutex in proc the ways
> userspace can unknowningly trigger a dead-lock have grown.

I think this particular problem did not exist until cred_guard_mutex
was introduced. Debugger can obviously "delay" exec if it doesn't
reap a zombie sub-thread, but this is another thing and not a bug imo.


> Sovle this by modifying exec to only wait until all of the other
> threads are zombies, and not waiting until the other threads
> are reaped.

This patch looks wrong in many ways.

> @@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk)
>  	}
>
>  	sig->group_exit_task = tsk;
> -	sig->notify_count = zap_other_threads(tsk);
> -	if (!thread_group_leader(tsk))
> -		sig->notify_count--;
> -
> -	while (sig->notify_count) {
> +	zap_other_threads(tsk);
> +	while (atomic_read(&sig->live) > 1) {
>  		__set_current_state(TASK_KILLABLE);
>  		spin_unlock_irq(lock);
>  		schedule();

Very nice. So de_thread() returns as soon as all other threads decrement
signal->live in do_exit(). Before they do, say, exit_mm(). This is already
wrong, for example this breaks OOM. Plus a lot more problems afaics,  but
lets ignore this.

Note that de_thread() also unshares ->sighand before return. So in the
case of mt exec it will likely see oldsighand->count != 1 and alloc the
new sighand_struct and this breaks the locking.

Because the execing thread will use newsighand->siglock to protect its
signal_struct while the zombie threads will use oldsighand->siglock to
protect the same signal struct. Yes, tasklist_lock + the fact irq_disable
implies rcu_lock mostly save us but not entirely, say, a foreign process
doing __send_signal() can take the right or the wrong lock depending on
/dev/random.


> @@ -818,6 +808,8 @@ void __noreturn do_exit(long code)
>  		if (tsk->mm)
>  			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
>  	}
> +	if ((group_left == 1) && tsk->signal->group_exit_task)
> +		wake_up_process(tsk->signal->group_exit_task);

This is racy, but this is minor.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
  2017-03-04 17:03               ` Oleg Nesterov
@ 2017-03-30  8:07                 ` Eric W. Biederman
  2017-04-01  5:11                     ` Eric W. Biederman
  2017-04-02 16:15                     ` Oleg Nesterov
  0 siblings, 2 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-03-30  8:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> On 03/03, Eric W. Biederman wrote:
>> @@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk)
>>  	}
>>
>>  	sig->group_exit_task = tsk;
>> -	sig->notify_count = zap_other_threads(tsk);
>> -	if (!thread_group_leader(tsk))
>> -		sig->notify_count--;
>> -
>> -	while (sig->notify_count) {
>> +	zap_other_threads(tsk);
>> +	while (atomic_read(&sig->live) > 1) {
>>  		__set_current_state(TASK_KILLABLE);
>>  		spin_unlock_irq(lock);
>>  		schedule();
>
> Very nice. So de_thread() returns as soon as all other threads decrement
> signal->live in do_exit(). Before they do, say, exit_mm(). This is already
> wrong, for example this breaks OOM. Plus a lot more problems afaics,  but
> lets ignore this.

Which means that we need to keep sig->notify_count.

> Note that de_thread() also unshares ->sighand before return. So in the
> case of mt exec it will likely see oldsighand->count != 1 and alloc the
> new sighand_struct and this breaks the locking.
>
> Because the execing thread will use newsighand->siglock to protect its
> signal_struct while the zombie threads will use oldsighand->siglock to
> protect the same signal struct. Yes, tasklist_lock + the fact irq_disable
> implies rcu_lock mostly save us but not entirely, say, a foreign process
> doing __send_signal() can take the right or the wrong lock depending on
> /dev/random.

Which leads to the question how can we get back tot he 2.4 behavior
of freeing sighand_struct in do_exit?

At which point as soon as we free sighand_struct if we are the last
to dying thread notify de_thread and everything works.

There are only two uses of sighand->siglock that I can see in
release_task:  __ptrace_unlink and __exit_signal.

For what __ptrace_unlink is doing we should just be able to skip
acquiring of siglock if PF_EXITING is set.

__exit_signal is a little more interesting but half of what it is
doing looks like it was pulled out of do_exit and just needs to
be put back.

Which probably adds up to 4 or 5 small carefully written patches to sort
out that part of the exit path, but I think it leads to a very simple
and straight forward result.  With the only side effect that we
can exec a process and still have a debugger reaping zombie threads.

Oleg does that sound reasonable to you?

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang
@ 2017-04-01  5:11                     ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-01  5:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


I spent a little more time with this and only waiting until the killed
thread are zombies (and not reaped as we do today) really looks like
the right fix.

Oleg the following two patches work on top of your PTRACE_EVENT_EXIT
change and probably need a little more cleanup until they are ready
for serious posting.

That said I want to I want to post the code so I have a change at
some feedback before I prepare the final round of patches.

These patches only handle the case when sighand_struct is not
shared between different multi-threaded processes.  The general
case is solvable but that is a quite a bit more code.

Eric W. Biederman (2):
      sighand: Count each thread group once in sighand_struct
      exec: If possible don't wait for ptraced threads to be reaped

 fs/exec.c                    | 15 ++++++++++-----
 include/linux/sched/signal.h |  2 +-
 kernel/exit.c                | 15 ++++++++++-----
 kernel/fork.c                |  6 ++++--
 kernel/signal.c              |  8 ++++++--
 5 files changed, 31 insertions(+), 15 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang
@ 2017-04-01  5:11                     ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-01  5:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


I spent a little more time with this and only waiting until the killed
thread are zombies (and not reaped as we do today) really looks like
the right fix.

Oleg the following two patches work on top of your PTRACE_EVENT_EXIT
change and probably need a little more cleanup until they are ready
for serious posting.

That said I want to I want to post the code so I have a change at
some feedback before I prepare the final round of patches.

These patches only handle the case when sighand_struct is not
shared between different multi-threaded processes.  The general
case is solvable but that is a quite a bit more code.

Eric W. Biederman (2):
      sighand: Count each thread group once in sighand_struct
      exec: If possible don't wait for ptraced threads to be reaped

 fs/exec.c                    | 15 ++++++++++-----
 include/linux/sched/signal.h |  2 +-
 kernel/exit.c                | 15 ++++++++++-----
 kernel/fork.c                |  6 ++++--
 kernel/signal.c              |  8 ++++++--
 5 files changed, 31 insertions(+), 15 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH 1/2] sighand: Count each thread group once in sighand_struct
@ 2017-04-01  5:14                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-01  5:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


In practice either a thread group is either using a sighand_struct or
it isn't.  Therefore simplify things a bit and only increment the
count in sighand_struct when a new thread group is created that uses
the existing sighand_struct, and only decrement the count in
sighand_struct when a thread group exits.

As well as standing on it's own merits this has the potential to simply
de_thread.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/exit.c | 2 +-
 kernel/fork.c | 6 ++++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index e126ebf2400c..8c5b3e106298 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk)
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
-	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 	if (group_dead) {
+		__cleanup_sighand(sighand);
 		flush_sigqueue(&sig->shared_pending);
 		tty_kref_put(tty);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80e93d..fe6f1bf32bb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 	struct sighand_struct *sig;
 
 	if (clone_flags & CLONE_SIGHAND) {
-		atomic_inc(&current->sighand->count);
+		if (!(clone_flags & CLONE_THREAD))
+			atomic_inc(&current->sighand->count);
 		return 0;
 	}
 	sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
@@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process(
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
 bad_fork_cleanup_sighand:
-	__cleanup_sighand(p->sighand);
+	if (!(clone_flags & CLONE_THREAD))
+		__cleanup_sighand(p->sighand);
 bad_fork_cleanup_fs:
 	exit_fs(p); /* blocking */
 bad_fork_cleanup_files:
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH 1/2] sighand: Count each thread group once in sighand_struct
@ 2017-04-01  5:14                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-01  5:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


In practice either a thread group is either using a sighand_struct or
it isn't.  Therefore simplify things a bit and only increment the
count in sighand_struct when a new thread group is created that uses
the existing sighand_struct, and only decrement the count in
sighand_struct when a thread group exits.

As well as standing on it's own merits this has the potential to simply
de_thread.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/exit.c | 2 +-
 kernel/fork.c | 6 ++++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index e126ebf2400c..8c5b3e106298 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk)
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
-	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 	if (group_dead) {
+		__cleanup_sighand(sighand);
 		flush_sigqueue(&sig->shared_pending);
 		tty_kref_put(tty);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80e93d..fe6f1bf32bb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 	struct sighand_struct *sig;
 
 	if (clone_flags & CLONE_SIGHAND) {
-		atomic_inc(&current->sighand->count);
+		if (!(clone_flags & CLONE_THREAD))
+			atomic_inc(&current->sighand->count);
 		return 0;
 	}
 	sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
@@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process(
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
 bad_fork_cleanup_sighand:
-	__cleanup_sighand(p->sighand);
+	if (!(clone_flags & CLONE_THREAD))
+		__cleanup_sighand(p->sighand);
 bad_fork_cleanup_fs:
 	exit_fs(p); /* blocking */
 bad_fork_cleanup_files:
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-01  5:16                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-01  5:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Take advantage of the situation when sighand->count == 1 to only wait
for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread.
Only old old linux threading libraries use CLONE_SIGHAND without
CLONE_THREAD.  So this situation should be present most of the time.

This allows ptracing through a multi-threaded exec without the danger
of stalling the exec.  As historically exec waits for the other
threads to be reaped in de_thread before completing.  This is
necessary as it is not safe to unshare the sighand_struct until all of
the other threads in this thread group are reaped, because the lock to
serialize threads in a thread group siglock lives in sighand_struct.

When oldsighand->count == 1 we know that there are no other
users and unsharing the sighand struct in exec is pointless.
This makes it safe to only wait for threads to become zombies
as the siglock won't change during exec and release_task
will use the samve siglock for the old threads as for
the new threads.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 15 ++++++++++-----
 include/linux/sched/signal.h |  2 +-
 kernel/exit.c                | 13 +++++++++----
 kernel/signal.c              |  8 ++++++--
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 65145a3df065..0fd29342bbe4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk)
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	bool may_hang;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk)
 		return -EAGAIN;
 	}
 
+	may_hang = atomic_read(&oldsighand->count) != 1;
 	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
+	sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1);
+	if (may_hang && !thread_group_leader(tsk))
 		sig->notify_count--;
 
 	while (sig->notify_count) {
@@ -1092,9 +1094,10 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			cgroup_threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
+		cgroup_threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
+
+		for (;may_hang;) {
 			/*
 			 * Do this under tasklist_lock to ensure that
 			 * exit_notify() can't miss ->group_exit_task
@@ -1108,6 +1111,8 @@ static int de_thread(struct task_struct *tsk)
 			schedule();
 			if (unlikely(__fatal_signal_pending(tsk)))
 				goto killed;
+			cgroup_threadgroup_change_begin(tsk);
+			write_lock_irq(&tasklist_lock);
 		}
 
 		/*
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2cf446704cd4..187a9e980d3a 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -298,7 +298,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
-extern int zap_other_threads(struct task_struct *p);
+extern int zap_other_threads(struct task_struct *p, int do_count);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
 extern int send_sigqueue(struct sigqueue *,  struct task_struct *, int group);
diff --git a/kernel/exit.c b/kernel/exit.c
index 8c5b3e106298..972df5ebf79f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -712,6 +712,8 @@ static void forget_original_parent(struct task_struct *father,
  */
 static void exit_notify(struct task_struct *tsk, int group_dead)
 {
+	struct sighand_struct *sighand = tsk->sighand;
+	struct signal_struct *signal = tsk->signal;
 	bool autoreap;
 	struct task_struct *p, *n;
 	LIST_HEAD(dead);
@@ -739,9 +741,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
-		wake_up_process(tsk->signal->group_exit_task);
+	spin_lock(&sighand->siglock);
+	/* mt-exec, de_thread is waiting for threads to exit */
+	if (signal->notify_count < 0 && !++signal->notify_count)
+		wake_up_process(signal->group_exit_task);
+
+	spin_unlock(&sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
@@ -975,7 +980,7 @@ do_group_exit(int exit_code)
 		else {
 			sig->group_exit_code = exit_code;
 			sig->flags = SIGNAL_GROUP_EXIT;
-			zap_other_threads(current);
+			zap_other_threads(current, 0);
 		}
 		spin_unlock_irq(&sighand->siglock);
 	}
diff --git a/kernel/signal.c b/kernel/signal.c
index 986ef55641ea..e3a5bc239345 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1196,7 +1196,7 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 /*
  * Nuke all other threads in the group.
  */
-int zap_other_threads(struct task_struct *p)
+int zap_other_threads(struct task_struct *p, int do_count)
 {
 	struct task_struct *t = p;
 	int count = 0;
@@ -1205,13 +1205,17 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
+		if (do_count > 0)
+			count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+
+		if (do_count < 0)
+			count--;
 	}
 
 	return count;
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-01  5:16                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-01  5:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Take advantage of the situation when sighand->count == 1 to only wait
for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread.
Only old old linux threading libraries use CLONE_SIGHAND without
CLONE_THREAD.  So this situation should be present most of the time.

This allows ptracing through a multi-threaded exec without the danger
of stalling the exec.  As historically exec waits for the other
threads to be reaped in de_thread before completing.  This is
necessary as it is not safe to unshare the sighand_struct until all of
the other threads in this thread group are reaped, because the lock to
serialize threads in a thread group siglock lives in sighand_struct.

When oldsighand->count == 1 we know that there are no other
users and unsharing the sighand struct in exec is pointless.
This makes it safe to only wait for threads to become zombies
as the siglock won't change during exec and release_task
will use the samve siglock for the old threads as for
the new threads.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/exec.c                    | 15 ++++++++++-----
 include/linux/sched/signal.h |  2 +-
 kernel/exit.c                | 13 +++++++++----
 kernel/signal.c              |  8 ++++++--
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 65145a3df065..0fd29342bbe4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk)
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	bool may_hang;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk)
 		return -EAGAIN;
 	}
 
+	may_hang = atomic_read(&oldsighand->count) != 1;
 	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
+	sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1);
+	if (may_hang && !thread_group_leader(tsk))
 		sig->notify_count--;
 
 	while (sig->notify_count) {
@@ -1092,9 +1094,10 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			cgroup_threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
+		cgroup_threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
+
+		for (;may_hang;) {
 			/*
 			 * Do this under tasklist_lock to ensure that
 			 * exit_notify() can't miss ->group_exit_task
@@ -1108,6 +1111,8 @@ static int de_thread(struct task_struct *tsk)
 			schedule();
 			if (unlikely(__fatal_signal_pending(tsk)))
 				goto killed;
+			cgroup_threadgroup_change_begin(tsk);
+			write_lock_irq(&tasklist_lock);
 		}
 
 		/*
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2cf446704cd4..187a9e980d3a 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -298,7 +298,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
-extern int zap_other_threads(struct task_struct *p);
+extern int zap_other_threads(struct task_struct *p, int do_count);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
 extern int send_sigqueue(struct sigqueue *,  struct task_struct *, int group);
diff --git a/kernel/exit.c b/kernel/exit.c
index 8c5b3e106298..972df5ebf79f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -712,6 +712,8 @@ static void forget_original_parent(struct task_struct *father,
  */
 static void exit_notify(struct task_struct *tsk, int group_dead)
 {
+	struct sighand_struct *sighand = tsk->sighand;
+	struct signal_struct *signal = tsk->signal;
 	bool autoreap;
 	struct task_struct *p, *n;
 	LIST_HEAD(dead);
@@ -739,9 +741,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
-		wake_up_process(tsk->signal->group_exit_task);
+	spin_lock(&sighand->siglock);
+	/* mt-exec, de_thread is waiting for threads to exit */
+	if (signal->notify_count < 0 && !++signal->notify_count)
+		wake_up_process(signal->group_exit_task);
+
+	spin_unlock(&sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
@@ -975,7 +980,7 @@ do_group_exit(int exit_code)
 		else {
 			sig->group_exit_code = exit_code;
 			sig->flags = SIGNAL_GROUP_EXIT;
-			zap_other_threads(current);
+			zap_other_threads(current, 0);
 		}
 		spin_unlock_irq(&sighand->siglock);
 	}
diff --git a/kernel/signal.c b/kernel/signal.c
index 986ef55641ea..e3a5bc239345 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1196,7 +1196,7 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 /*
  * Nuke all other threads in the group.
  */
-int zap_other_threads(struct task_struct *p)
+int zap_other_threads(struct task_struct *p, int do_count)
 {
 	struct task_struct *t = p;
 	int count = 0;
@@ -1205,13 +1205,17 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
+		if (do_count > 0)
+			count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+
+		if (do_count < 0)
+			count--;
 	}
 
 	return count;
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-02 15:35                         ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-02 15:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/01, Eric W. Biederman wrote:
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk)
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	bool may_hang;
>
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk)
>  		return -EAGAIN;
>  	}
>
> +	may_hang = atomic_read(&oldsighand->count) != 1;
>  	sig->group_exit_task = tsk;
> -	sig->notify_count = zap_other_threads(tsk);
> -	if (!thread_group_leader(tsk))
> +	sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1);

Eric, this is amazing. So with this patch exec does different things depening
on whether sighand is shared with another CLONE_SIGHAND task or not. To me
this doesn't look sane in any case.

And of course you do realize that it doesn't solve the problem entirely? If I
modify my test-case a little bit

	int xxx(void *arg)
	{
		for (;;)
			pause();
	}

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			char stack[16 * 1024];

			clone(xxx, stack + 16*1024, CLONE_SIGHAND|CLONE_VM, NULL);

			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp("echo", "echo", "passed", NULL);
		}

		sleep(1);
		// or anything else which needs ->cred_guard_mutex,
		// say open(/proc/$pid/mem)
		ptrace(PTRACE_ATTACH, pid, 0,0);
		kill(pid, SIGCONT);

		return 0;
	}

it should deadlock the same way?

So what is the point to make the, well imo insane, patch if it doesn't solve
the problem?

And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or
exit_notify() should set tsk->exit_state under siglock, otherwise zap() can
return the wrong count.

Finally. This patch creates the nice security hole. Let me modify my test-case
again:

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp(path-to-setuid-binary, args);
		}

		sleep(1);

		// Now we can send the signals to setiuid app
		kill(pid+1, ANYSIGNAL);

		return 0;
	}

I see another email from your with another proposal. I disagree, will reply soon.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-02 15:35                         ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-02 15:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 04/01, Eric W. Biederman wrote:
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk)
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	bool may_hang;
>
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk)
>  		return -EAGAIN;
>  	}
>
> +	may_hang = atomic_read(&oldsighand->count) != 1;
>  	sig->group_exit_task = tsk;
> -	sig->notify_count = zap_other_threads(tsk);
> -	if (!thread_group_leader(tsk))
> +	sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1);

Eric, this is amazing. So with this patch exec does different things depening
on whether sighand is shared with another CLONE_SIGHAND task or not. To me
this doesn't look sane in any case.

And of course you do realize that it doesn't solve the problem entirely? If I
modify my test-case a little bit

	int xxx(void *arg)
	{
		for (;;)
			pause();
	}

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			char stack[16 * 1024];

			clone(xxx, stack + 16*1024, CLONE_SIGHAND|CLONE_VM, NULL);

			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp("echo", "echo", "passed", NULL);
		}

		sleep(1);
		// or anything else which needs ->cred_guard_mutex,
		// say open(/proc/$pid/mem)
		ptrace(PTRACE_ATTACH, pid, 0,0);
		kill(pid, SIGCONT);

		return 0;
	}

it should deadlock the same way?

So what is the point to make the, well imo insane, patch if it doesn't solve
the problem?

And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or
exit_notify() should set tsk->exit_state under siglock, otherwise zap() can
return the wrong count.

Finally. This patch creates the nice security hole. Let me modify my test-case
again:

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp(path-to-setuid-binary, args);
		}

		sleep(1);

		// Now we can send the signals to setiuid app
		kill(pid+1, ANYSIGNAL);

		return 0;
	}

I see another email from your with another proposal. I disagree, will reply soon.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang
@ 2017-04-02 15:38                       ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-02 15:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/01, Eric W. Biederman wrote:
>
> These patches only handle the case when sighand_struct is not
> shared between different multi-threaded processes.

Ah, I didn't read 0/2 when looked at 2/2, so at least you documented this.

> The general
> case is solvable but that is a quite a bit more code.

I don't think so, please see my reply to 2/2.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang
@ 2017-04-02 15:38                       ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-02 15:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 04/01, Eric W. Biederman wrote:
>
> These patches only handle the case when sighand_struct is not
> shared between different multi-threaded processes.

Ah, I didn't read 0/2 when looked at 2/2, so at least you documented this.

> The general
> case is solvable but that is a quite a bit more code.

I don't think so, please see my reply to 2/2.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-02 16:15                     ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-02 16:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 03/30, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Very nice. So de_thread() returns as soon as all other threads decrement
> > signal->live in do_exit(). Before they do, say, exit_mm(). This is already
> > wrong, for example this breaks OOM. Plus a lot more problems afaics,  but
> > lets ignore this.
>
> Which means that we need to keep sig->notify_count.

Yes, although we need to make it less ugly.

> > Note that de_thread() also unshares ->sighand before return. So in the
> > case of mt exec it will likely see oldsighand->count != 1 and alloc the
> > new sighand_struct and this breaks the locking.
> >
> > Because the execing thread will use newsighand->siglock to protect its
> > signal_struct while the zombie threads will use oldsighand->siglock to
> > protect the same signal struct. Yes, tasklist_lock + the fact irq_disable
> > implies rcu_lock mostly save us but not entirely, say, a foreign process
> > doing __send_signal() can take the right or the wrong lock depending on
> > /dev/random.
>
> Which leads to the question how can we get back tot he 2.4 behavior
> of freeing sighand_struct in do_exit?
>
> At which point as soon as we free sighand_struct if we are the last
> to dying thread notify de_thread and everything works.

I was thinking about the similar option, see below, but decided that we
should not do this at least right now.

> For what __ptrace_unlink is doing we should just be able to skip
> acquiring of siglock if PF_EXITING is set.

We can even remove it from release_task() path, this is simple.

> __exit_signal is a little more interesting but half of what it is
> doing looks like it was pulled out of do_exit and just needs to
> be put back.

That is. I think we should actually unhash the exiting sub-thread even
if it is traced. IOW, remove it from thread/pid/parent/etc lists and
nullify its ->sighand. IMO, whatever we do thread_group_empty(current)
should be true after exec.

So the exiting sub-trace should look almost a EXIT_DEAD task except it
still should report to debugger.

But this is dangerous. Say, wait4(upid <= 0) becomes unsafe because
task_pid_type(PIDTYPE_PGID) won't work.

> Which probably adds up to 4 or 5 small carefully written patches to sort
> out that part of the exit path,

Perhaps I am wrong, but I think you underestimate the problems, and it is
not clear to me if we really want this.

=========================================================================
Anyway, Eric, even if we can and want to do this, why we can't do this on
top of my fix?

I simply fail to understand why you dislike it that much. Yes it is not
pretty, I said this many times, but it is safe in that it doesn't really
change the current behaviour.

I am much more worried about 2/2 you didn't argue with, this patch _can_
break something and this is obviously not good even if PTRACE_EVENT_EXIT
was always broken.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-02 16:15                     ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-02 16:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 03/30, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
> > Very nice. So de_thread() returns as soon as all other threads decrement
> > signal->live in do_exit(). Before they do, say, exit_mm(). This is already
> > wrong, for example this breaks OOM. Plus a lot more problems afaics,  but
> > lets ignore this.
>
> Which means that we need to keep sig->notify_count.

Yes, although we need to make it less ugly.

> > Note that de_thread() also unshares ->sighand before return. So in the
> > case of mt exec it will likely see oldsighand->count != 1 and alloc the
> > new sighand_struct and this breaks the locking.
> >
> > Because the execing thread will use newsighand->siglock to protect its
> > signal_struct while the zombie threads will use oldsighand->siglock to
> > protect the same signal struct. Yes, tasklist_lock + the fact irq_disable
> > implies rcu_lock mostly save us but not entirely, say, a foreign process
> > doing __send_signal() can take the right or the wrong lock depending on
> > /dev/random.
>
> Which leads to the question how can we get back tot he 2.4 behavior
> of freeing sighand_struct in do_exit?
>
> At which point as soon as we free sighand_struct if we are the last
> to dying thread notify de_thread and everything works.

I was thinking about the similar option, see below, but decided that we
should not do this at least right now.

> For what __ptrace_unlink is doing we should just be able to skip
> acquiring of siglock if PF_EXITING is set.

We can even remove it from release_task() path, this is simple.

> __exit_signal is a little more interesting but half of what it is
> doing looks like it was pulled out of do_exit and just needs to
> be put back.

That is. I think we should actually unhash the exiting sub-thread even
if it is traced. IOW, remove it from thread/pid/parent/etc lists and
nullify its ->sighand. IMO, whatever we do thread_group_empty(current)
should be true after exec.

So the exiting sub-trace should look almost a EXIT_DEAD task except it
still should report to debugger.

But this is dangerous. Say, wait4(upid <= 0) becomes unsafe because
task_pid_type(PIDTYPE_PGID) won't work.

> Which probably adds up to 4 or 5 small carefully written patches to sort
> out that part of the exit path,

Perhaps I am wrong, but I think you underestimate the problems, and it is
not clear to me if we really want this.

=========================================================================
Anyway, Eric, even if we can and want to do this, why we can't do this on
top of my fix?

I simply fail to understand why you dislike it that much. Yes it is not
pretty, I said this many times, but it is safe in that it doesn't really
change the current behaviour.

I am much more worried about 2/2 you didn't argue with, this patch _can_
break something and this is obviously not good even if PTRACE_EVENT_EXIT
was always broken.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-02 18:53                           ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 18:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/01, Eric W. Biederman wrote:
>>
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk)
>>  	struct signal_struct *sig = tsk->signal;
>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>  	spinlock_t *lock = &oldsighand->siglock;
>> +	bool may_hang;
>>
>>  	if (thread_group_empty(tsk))
>>  		goto no_thread_group;
>> @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk)
>>  		return -EAGAIN;
>>  	}
>>
>> +	may_hang = atomic_read(&oldsighand->count) != 1;
>>  	sig->group_exit_task = tsk;
>> -	sig->notify_count = zap_other_threads(tsk);
>> -	if (!thread_group_leader(tsk))
>> +	sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1);
>
> Eric, this is amazing. So with this patch exec does different things depening
> on whether sighand is shared with another CLONE_SIGHAND task or not. To me
> this doesn't look sane in any case.

It is a 99% solution that makes it possible to talk about and review
letting the exec continue after the subthreads are killed but not
reaped.

Sigh I should have made may_hang say:

may_hang = (atomic_read(&oldsignand->count) != 1) && (sig->nr_threads > 1)

Which covers all know ways userspace actually uses these clone flags.

> And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or
> exit_notify() should set tsk->exit_state under siglock, otherwise zap() can
> return the wrong count.

zap_other_thread(tsk, 0) only gets called in the case where we don't
care about the return value.  It does not get called from fs/exec.c

> Finally. This patch creates the nice security hole. Let me modify my test-case
> again:
>
> 	void *thread(void *arg)
> 	{
> 		ptrace(PTRACE_TRACEME, 0,0,0);
> 		return NULL;
> 	}
>
> 	int main(void)
> 	{
> 		int pid = fork();
>
> 		if (!pid) {
> 			pthread_t pt;
> 			pthread_create(&pt, NULL, thread, NULL);
> 			pthread_join(pt, NULL);
> 			execlp(path-to-setuid-binary, args);
> 		}
>
> 		sleep(1);
>
> 		// Now we can send the signals to setiuid app
> 		kill(pid+1, ANYSIGNAL);
>
> 		return 0;
> 	}

That is a substantive objection, and something that definitely needs
to get fixed.   Can you think of anything else?

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-02 18:53                           ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 18:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On 04/01, Eric W. Biederman wrote:
>>
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk)
>>  	struct signal_struct *sig = tsk->signal;
>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>  	spinlock_t *lock = &oldsighand->siglock;
>> +	bool may_hang;
>>
>>  	if (thread_group_empty(tsk))
>>  		goto no_thread_group;
>> @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk)
>>  		return -EAGAIN;
>>  	}
>>
>> +	may_hang = atomic_read(&oldsighand->count) != 1;
>>  	sig->group_exit_task = tsk;
>> -	sig->notify_count = zap_other_threads(tsk);
>> -	if (!thread_group_leader(tsk))
>> +	sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1);
>
> Eric, this is amazing. So with this patch exec does different things depening
> on whether sighand is shared with another CLONE_SIGHAND task or not. To me
> this doesn't look sane in any case.

It is a 99% solution that makes it possible to talk about and review
letting the exec continue after the subthreads are killed but not
reaped.

Sigh I should have made may_hang say:

may_hang = (atomic_read(&oldsignand->count) != 1) && (sig->nr_threads > 1)

Which covers all know ways userspace actually uses these clone flags.

> And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or
> exit_notify() should set tsk->exit_state under siglock, otherwise zap() can
> return the wrong count.

zap_other_thread(tsk, 0) only gets called in the case where we don't
care about the return value.  It does not get called from fs/exec.c

> Finally. This patch creates the nice security hole. Let me modify my test-case
> again:
>
> 	void *thread(void *arg)
> 	{
> 		ptrace(PTRACE_TRACEME, 0,0,0);
> 		return NULL;
> 	}
>
> 	int main(void)
> 	{
> 		int pid = fork();
>
> 		if (!pid) {
> 			pthread_t pt;
> 			pthread_create(&pt, NULL, thread, NULL);
> 			pthread_join(pt, NULL);
> 			execlp(path-to-setuid-binary, args);
> 		}
>
> 		sleep(1);
>
> 		// Now we can send the signals to setiuid app
> 		kill(pid+1, ANYSIGNAL);
>
> 		return 0;
> 	}

That is a substantive objection, and something that definitely needs
to get fixed.   Can you think of anything else?

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-02 21:07                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 21:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> Perhaps I am wrong, but I think you underestimate the problems, and it is
> not clear to me if we really want this.

I worked through quite a bit of it and I realized a few fundamental
issues.  The task struct must remain visible until it is reaped
and we use siglock to protect in unhash process to protect that reaping.
Further tsk->sighand == NULL winds up being a flag used to tell
if release_task has been called.

To get an usable count on sighand struct all that needed to be done
was to change the reference counting of sighand_struct to count
processes and not threads.

Which is what I wound up posting.

> Anyway, Eric, even if we can and want to do this, why we can't do this on
> top of my fix?

Because your reduction in scope of cred_guard_mutex is fundamentally
broken and unnecessary.

> I simply fail to understand why you dislike it that much. Yes it is not
> pretty, I said this many times, but it is safe in that it doesn't really
> change the current behaviour.

No it is not safe.  And it promotes wrong thinking which is even more
dangerous.

I reviewed the code and cred_guard_mutex needs to cover what it covers.

> I am much more worried about 2/2 you didn't argue with, this patch _can_
> break something and this is obviously not good even if PTRACE_EVENT_EXIT
> was always broken.

I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
know what the implications of changing it are.  Let's see...

gdb - no
upstart - no
lldb - yes
strace - no


It looks like lldb is worth testing with your PTRACE_EVENT_EXIT change
to see if anything breaks.

I think we can get away with changing the exec case but it does look
worth testing.  I hadn't realized you hadn't looked to see what was
using PTRACE_O_TRACEEXIT to see if any part of userspace cares.

Hmm.  This is interesting.  From the strace documentation:
> Tracer cannot assume that ptrace-stopped tracee exists. There are many
> scenarios when tracee may die while stopped (such as SIGKILL).
> Therefore, tracer must always be prepared to handle ESRCH error on any
> ptrace operation. Unfortunately, the same error is returned if tracee
> exists but is not ptrace-stopped (for commands which require stopped
> tracee), or if it is not traced by process which issued ptrace call.
> Tracer needs to keep track of stopped/running state, and interpret
> ESRCH as "tracee died unexpectedly" only if it knows that tracee has
> been observed to enter ptrace-stop. Note that there is no guarantee
> that waitpid(WNOHANG) will reliably report tracee's death status if
> ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead.
> IOW: tracee may be "not yet fully dead" but already refusing ptrace
> ops.

If delivering a second SIGKILL to a ptraced stopped processes will
make it continue we have a very interesting out..

When we stop in ptrace_stop we stop in TASK_TRACED == (TASK_WAKEKILL|__TASK_TRACED)

Delivery of a SIGKILL to that task has queue SIGKILL
and call signal_wake_up_state(t, TASK_WAKEKILL).
Which becomes wake_up_state(t, TASK_INTERRUPTIBLE | TASK_WAKEKILL)
Which wakes up the process.

So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT
before the tracers find it.

Therefore we are only talking a quality of implementation issue
if we actually stop and wait for the tracer or not.

....

Which brings us to your PTRACE_EVENT_EXIT patch.

I think may_ptrace_stop is tested in the wrong place,
and is probably buggy.

- We should send the signal in all cases except when the ptracing parent
  does not exist aka (!current->ptrace).  The siginfo contains enough
  information to understand what happened if anyone is listening.

- Then we should send the group stop.

- Then if we don't want to wait we should:
  __set_current_state(TASK_RUNNING)

- Then we should drop the locks and only call freezable_schedule if
  we want to wait.

That way userspace thinks someone else just sent a SIGKILL and killed
the thread before it had a chance to look (which is effectively what
we are doing).  That sounds idea for both core-dumps and this case.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-02 21:07                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 21:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> Perhaps I am wrong, but I think you underestimate the problems, and it is
> not clear to me if we really want this.

I worked through quite a bit of it and I realized a few fundamental
issues.  The task struct must remain visible until it is reaped
and we use siglock to protect in unhash process to protect that reaping.
Further tsk->sighand == NULL winds up being a flag used to tell
if release_task has been called.

To get an usable count on sighand struct all that needed to be done
was to change the reference counting of sighand_struct to count
processes and not threads.

Which is what I wound up posting.

> Anyway, Eric, even if we can and want to do this, why we can't do this on
> top of my fix?

Because your reduction in scope of cred_guard_mutex is fundamentally
broken and unnecessary.

> I simply fail to understand why you dislike it that much. Yes it is not
> pretty, I said this many times, but it is safe in that it doesn't really
> change the current behaviour.

No it is not safe.  And it promotes wrong thinking which is even more
dangerous.

I reviewed the code and cred_guard_mutex needs to cover what it covers.

> I am much more worried about 2/2 you didn't argue with, this patch _can_
> break something and this is obviously not good even if PTRACE_EVENT_EXIT
> was always broken.

I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
know what the implications of changing it are.  Let's see...

gdb - no
upstart - no
lldb - yes
strace - no


It looks like lldb is worth testing with your PTRACE_EVENT_EXIT change
to see if anything breaks.

I think we can get away with changing the exec case but it does look
worth testing.  I hadn't realized you hadn't looked to see what was
using PTRACE_O_TRACEEXIT to see if any part of userspace cares.

Hmm.  This is interesting.  From the strace documentation:
> Tracer cannot assume that ptrace-stopped tracee exists. There are many
> scenarios when tracee may die while stopped (such as SIGKILL).
> Therefore, tracer must always be prepared to handle ESRCH error on any
> ptrace operation. Unfortunately, the same error is returned if tracee
> exists but is not ptrace-stopped (for commands which require stopped
> tracee), or if it is not traced by process which issued ptrace call.
> Tracer needs to keep track of stopped/running state, and interpret
> ESRCH as "tracee died unexpectedly" only if it knows that tracee has
> been observed to enter ptrace-stop. Note that there is no guarantee
> that waitpid(WNOHANG) will reliably report tracee's death status if
> ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead.
> IOW: tracee may be "not yet fully dead" but already refusing ptrace
> ops.

If delivering a second SIGKILL to a ptraced stopped processes will
make it continue we have a very interesting out..

When we stop in ptrace_stop we stop in TASK_TRACED == (TASK_WAKEKILL|__TASK_TRACED)

Delivery of a SIGKILL to that task has queue SIGKILL
and call signal_wake_up_state(t, TASK_WAKEKILL).
Which becomes wake_up_state(t, TASK_INTERRUPTIBLE | TASK_WAKEKILL)
Which wakes up the process.

So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT
before the tracers find it.

Therefore we are only talking a quality of implementation issue
if we actually stop and wait for the tracer or not.

....

Which brings us to your PTRACE_EVENT_EXIT patch.

I think may_ptrace_stop is tested in the wrong place,
and is probably buggy.

- We should send the signal in all cases except when the ptracing parent
  does not exist aka (!current->ptrace).  The siginfo contains enough
  information to understand what happened if anyone is listening.

- Then we should send the group stop.

- Then if we don't want to wait we should:
  __set_current_state(TASK_RUNNING)

- Then we should drop the locks and only call freezable_schedule if
  we want to wait.

That way userspace thinks someone else just sent a SIGKILL and killed
the thread before it had a chance to look (which is effectively what
we are doing).  That sounds idea for both core-dumps and this case.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 0/5] exec: Fixing ptrace'd mulit-threaded hang
@ 2017-04-02 22:50                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Oleg your comment about kill being able to send signal was an important
dimension I had missed thank you.

This patchset just denies the case of SIGHAND between different
multi-threaded processes as I don't think anyone cares.  I can
fix that if anyone cares but I am not certain we actally do.

I have reworked the ptrace notification code so that we always
send notifications if we can but don't wait if it is a coredump
or an exec.  Which simpilifies the code nicely.

A few more tweaks are needed before a final version but I think
things are compelling.

 fs/exec.c                    |  23 ++-------
 include/linux/sched/signal.h |   1 +
 kernel/exit.c                |  20 ++++----
 kernel/fork.c                |  14 +++++-
 kernel/ptrace.c              |   4 ++
 kernel/signal.c              | 112 +++++++++++++++++++------------------------
 6 files changed, 78 insertions(+), 96 deletions(-)

Eric W. Biederman (5):
      ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump
      sighand: Count each thread group once in sighand_struct
      clone: Disallown CLONE_THREAD with a shared sighand_struct
      exec: If possible don't wait for ptraced threads to be reaped
      signal: Don't allow accessing signal_struct by old threads after exec

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 0/5] exec: Fixing ptrace'd mulit-threaded hang
@ 2017-04-02 22:50                       ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Oleg your comment about kill being able to send signal was an important
dimension I had missed thank you.

This patchset just denies the case of SIGHAND between different
multi-threaded processes as I don't think anyone cares.  I can
fix that if anyone cares but I am not certain we actally do.

I have reworked the ptrace notification code so that we always
send notifications if we can but don't wait if it is a coredump
or an exec.  Which simpilifies the code nicely.

A few more tweaks are needed before a final version but I think
things are compelling.

 fs/exec.c                    |  23 ++-------
 include/linux/sched/signal.h |   1 +
 kernel/exit.c                |  20 ++++----
 kernel/fork.c                |  14 +++++-
 kernel/ptrace.c              |   4 ++
 kernel/signal.c              | 112 +++++++++++++++++++------------------------
 6 files changed, 78 insertions(+), 96 deletions(-)

Eric W. Biederman (5):
      ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump
      sighand: Count each thread group once in sighand_struct
      clone: Disallown CLONE_THREAD with a shared sighand_struct
      exec: If possible don't wait for ptraced threads to be reaped
      signal: Don't allow accessing signal_struct by old threads after exec

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump
@ 2017-04-02 22:51                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Take advantage of the fact that no ptrace stop will wait for
a debugger if another process sends SIGKILL to the waiting task.

In the case of exec and coredump which have many interesting deadlock
opportunities and no one tests the what happens if you are ptraced
during exec or coredump act like another SIGKILL was immediately
sent to the process and don't stop.

Keep sending the signal to the tracer so that this appears like
the worst case where someone else sent the process a SIGKILL before
the tracer could react.  So all non-buggy tracers must support
this case.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/signal.c | 103 +++++++++++++++++++++++---------------------------------
 1 file changed, 42 insertions(+), 61 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 7e59ebc2c25e..11fa736eb2ae 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1740,30 +1740,6 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
-static inline int may_ptrace_stop(void)
-{
-	if (!likely(current->ptrace))
-		return 0;
-	/*
-	 * Are we in the middle of do_coredump?
-	 * If so and our tracer is also part of the coredump stopping
-	 * is a deadlock situation, and pointless because our tracer
-	 * is dead so don't allow us to stop.
-	 * If SIGKILL was already sent before the caller unlocked
-	 * ->siglock we must see ->core_state != NULL. Otherwise it
-	 * is safe to enter schedule().
-	 *
-	 * This is almost outdated, a task with the pending SIGKILL can't
-	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
-	 * after SIGKILL was already dequeued.
-	 */
-	if (unlikely(current->mm->core_state) &&
-	    unlikely(current->mm == current->parent->mm))
-		return 0;
-
-	return 1;
-}
-
 /*
  * Return non-zero if there is a SIGKILL that should be waking us up.
  * Called with the siglock held.
@@ -1789,6 +1765,8 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 	__releases(&current->sighand->siglock)
 	__acquires(&current->sighand->siglock)
 {
+	bool ptrace_parent;
+	bool ptrace_wait;
 	bool gstop_done = false;
 
 	if (arch_ptrace_stop_needed(exit_code, info)) {
@@ -1842,53 +1820,56 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 
 	spin_unlock_irq(&current->sighand->siglock);
 	read_lock(&tasklist_lock);
-	if (may_ptrace_stop()) {
-		/*
-		 * Notify parents of the stop.
-		 *
-		 * While ptraced, there are two parents - the ptracer and
-		 * the real_parent of the group_leader.  The ptracer should
-		 * know about every stop while the real parent is only
-		 * interested in the completion of group stop.  The states
-		 * for the two don't interact with each other.  Notify
-		 * separately unless they're gonna be duplicates.
-		 */
+	ptrace_parent = likely(current->ptrace);
+	/*
+	 * Notify parents of the stop.
+	 *
+	 * While ptraced, there are two parents - the ptracer and
+	 * the real_parent of the group_leader.  The ptracer should
+	 * know about every stop while the real parent is only
+	 * interested in the completion of group stop.  The states
+	 * for the two don't interact with each other.  Notify
+	 * separately unless they're gonna be duplicates.
+	 */
+	if (ptrace_parent)
 		do_notify_parent_cldstop(current, true, why);
-		if (gstop_done && ptrace_reparented(current))
-			do_notify_parent_cldstop(current, false, why);
+	if (gstop_done && (ptrace_reparented(current) || !ptrace_parent))
+		do_notify_parent_cldstop(current, false, why);
 
-		/*
-		 * Don't want to allow preemption here, because
-		 * sys_ptrace() needs this task to be inactive.
-		 *
-		 * XXX: implement read_unlock_no_resched().
-		 */
-		preempt_disable();
-		read_unlock(&tasklist_lock);
-		preempt_enable_no_resched();
-		freezable_schedule();
-	} else {
-		/*
-		 * By the time we got the lock, our tracer went away.
-		 * Don't drop the lock yet, another tracer may come.
-		 *
-		 * If @gstop_done, the ptracer went away between group stop
-		 * completion and here.  During detach, it would have set
-		 * JOBCTL_STOP_PENDING on us and we'll re-enter
-		 * TASK_STOPPED in do_signal_stop() on return, so notifying
-		 * the real parent of the group stop completion is enough.
-		 */
-		if (gstop_done)
-			do_notify_parent_cldstop(current, false, why);
+	/*
+	 * Always wait for the traceer if the process is traced unless
+	 * the process is in the middle of an exec or core dump.
+	 *
+	 * In exec we risk a deadlock with de_thread waiting for the
+	 * thread to be killed, the tracer, and acquring cred_guard_mutex.
+	 *
+	 * In coredump we risk a deadlock if the tracer is also part
+	 * of the coredump stopping.
+	 */
+	ptrace_wait = ptrace_parent &&
+		!current->signal->group_exit_task &&
+		!current->mm->core_state;
 
+	if (!ptrace_wait) {
 		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		if (clear_code)
 			current->exit_code = 0;
-		read_unlock(&tasklist_lock);
 	}
 
 	/*
+	 * Don't want to allow preemption here, because
+	 * sys_ptrace() needs this task to be inactive.
+	 *
+	 * XXX: implement read_unlock_no_resched().
+	 */
+	preempt_disable();
+	read_unlock(&tasklist_lock);
+	preempt_enable_no_resched();
+	if (ptrace_wait)
+		freezable_schedule();
+
+	/*
 	 * We are back.  Now reacquire the siglock before touching
 	 * last_siginfo, so that we are sure to have synchronized with
 	 * any signal-sending on another CPU that wants to examine it.
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump
@ 2017-04-02 22:51                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Take advantage of the fact that no ptrace stop will wait for
a debugger if another process sends SIGKILL to the waiting task.

In the case of exec and coredump which have many interesting deadlock
opportunities and no one tests the what happens if you are ptraced
during exec or coredump act like another SIGKILL was immediately
sent to the process and don't stop.

Keep sending the signal to the tracer so that this appears like
the worst case where someone else sent the process a SIGKILL before
the tracer could react.  So all non-buggy tracers must support
this case.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/signal.c | 103 +++++++++++++++++++++++---------------------------------
 1 file changed, 42 insertions(+), 61 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 7e59ebc2c25e..11fa736eb2ae 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1740,30 +1740,6 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
-static inline int may_ptrace_stop(void)
-{
-	if (!likely(current->ptrace))
-		return 0;
-	/*
-	 * Are we in the middle of do_coredump?
-	 * If so and our tracer is also part of the coredump stopping
-	 * is a deadlock situation, and pointless because our tracer
-	 * is dead so don't allow us to stop.
-	 * If SIGKILL was already sent before the caller unlocked
-	 * ->siglock we must see ->core_state != NULL. Otherwise it
-	 * is safe to enter schedule().
-	 *
-	 * This is almost outdated, a task with the pending SIGKILL can't
-	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
-	 * after SIGKILL was already dequeued.
-	 */
-	if (unlikely(current->mm->core_state) &&
-	    unlikely(current->mm == current->parent->mm))
-		return 0;
-
-	return 1;
-}
-
 /*
  * Return non-zero if there is a SIGKILL that should be waking us up.
  * Called with the siglock held.
@@ -1789,6 +1765,8 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 	__releases(&current->sighand->siglock)
 	__acquires(&current->sighand->siglock)
 {
+	bool ptrace_parent;
+	bool ptrace_wait;
 	bool gstop_done = false;
 
 	if (arch_ptrace_stop_needed(exit_code, info)) {
@@ -1842,53 +1820,56 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 
 	spin_unlock_irq(&current->sighand->siglock);
 	read_lock(&tasklist_lock);
-	if (may_ptrace_stop()) {
-		/*
-		 * Notify parents of the stop.
-		 *
-		 * While ptraced, there are two parents - the ptracer and
-		 * the real_parent of the group_leader.  The ptracer should
-		 * know about every stop while the real parent is only
-		 * interested in the completion of group stop.  The states
-		 * for the two don't interact with each other.  Notify
-		 * separately unless they're gonna be duplicates.
-		 */
+	ptrace_parent = likely(current->ptrace);
+	/*
+	 * Notify parents of the stop.
+	 *
+	 * While ptraced, there are two parents - the ptracer and
+	 * the real_parent of the group_leader.  The ptracer should
+	 * know about every stop while the real parent is only
+	 * interested in the completion of group stop.  The states
+	 * for the two don't interact with each other.  Notify
+	 * separately unless they're gonna be duplicates.
+	 */
+	if (ptrace_parent)
 		do_notify_parent_cldstop(current, true, why);
-		if (gstop_done && ptrace_reparented(current))
-			do_notify_parent_cldstop(current, false, why);
+	if (gstop_done && (ptrace_reparented(current) || !ptrace_parent))
+		do_notify_parent_cldstop(current, false, why);
 
-		/*
-		 * Don't want to allow preemption here, because
-		 * sys_ptrace() needs this task to be inactive.
-		 *
-		 * XXX: implement read_unlock_no_resched().
-		 */
-		preempt_disable();
-		read_unlock(&tasklist_lock);
-		preempt_enable_no_resched();
-		freezable_schedule();
-	} else {
-		/*
-		 * By the time we got the lock, our tracer went away.
-		 * Don't drop the lock yet, another tracer may come.
-		 *
-		 * If @gstop_done, the ptracer went away between group stop
-		 * completion and here.  During detach, it would have set
-		 * JOBCTL_STOP_PENDING on us and we'll re-enter
-		 * TASK_STOPPED in do_signal_stop() on return, so notifying
-		 * the real parent of the group stop completion is enough.
-		 */
-		if (gstop_done)
-			do_notify_parent_cldstop(current, false, why);
+	/*
+	 * Always wait for the traceer if the process is traced unless
+	 * the process is in the middle of an exec or core dump.
+	 *
+	 * In exec we risk a deadlock with de_thread waiting for the
+	 * thread to be killed, the tracer, and acquring cred_guard_mutex.
+	 *
+	 * In coredump we risk a deadlock if the tracer is also part
+	 * of the coredump stopping.
+	 */
+	ptrace_wait = ptrace_parent &&
+		!current->signal->group_exit_task &&
+		!current->mm->core_state;
 
+	if (!ptrace_wait) {
 		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		if (clear_code)
 			current->exit_code = 0;
-		read_unlock(&tasklist_lock);
 	}
 
 	/*
+	 * Don't want to allow preemption here, because
+	 * sys_ptrace() needs this task to be inactive.
+	 *
+	 * XXX: implement read_unlock_no_resched().
+	 */
+	preempt_disable();
+	read_unlock(&tasklist_lock);
+	preempt_enable_no_resched();
+	if (ptrace_wait)
+		freezable_schedule();
+
+	/*
 	 * We are back.  Now reacquire the siglock before touching
 	 * last_siginfo, so that we are sure to have synchronized with
 	 * any signal-sending on another CPU that wants to examine it.
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 2/5] sighand: Count each thread group once in sighand_struct
@ 2017-04-02 22:51                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


In practice either a thread group is either using a sighand_struct or
it isn't.  Therefore simplify things a bit and only increment the
count in sighand_struct when a new thread group is created that uses
the existing sighand_struct, and only decrement the count in
sighand_struct when a thread group exits.

As well as standing on it's own merits this has the potential to simply
de_thread.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/exit.c | 2 +-
 kernel/fork.c | 6 ++++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index e126ebf2400c..8c5b3e106298 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk)
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
-	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 	if (group_dead) {
+		__cleanup_sighand(sighand);
 		flush_sigqueue(&sig->shared_pending);
 		tty_kref_put(tty);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80e93d..fe6f1bf32bb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 	struct sighand_struct *sig;
 
 	if (clone_flags & CLONE_SIGHAND) {
-		atomic_inc(&current->sighand->count);
+		if (!(clone_flags & CLONE_THREAD))
+			atomic_inc(&current->sighand->count);
 		return 0;
 	}
 	sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
@@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process(
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
 bad_fork_cleanup_sighand:
-	__cleanup_sighand(p->sighand);
+	if (!(clone_flags & CLONE_THREAD))
+		__cleanup_sighand(p->sighand);
 bad_fork_cleanup_fs:
 	exit_fs(p); /* blocking */
 bad_fork_cleanup_files:
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 2/5] sighand: Count each thread group once in sighand_struct
@ 2017-04-02 22:51                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


In practice either a thread group is either using a sighand_struct or
it isn't.  Therefore simplify things a bit and only increment the
count in sighand_struct when a new thread group is created that uses
the existing sighand_struct, and only decrement the count in
sighand_struct when a thread group exits.

As well as standing on it's own merits this has the potential to simply
de_thread.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/exit.c | 2 +-
 kernel/fork.c | 6 ++++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index e126ebf2400c..8c5b3e106298 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk)
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
-	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 	if (group_dead) {
+		__cleanup_sighand(sighand);
 		flush_sigqueue(&sig->shared_pending);
 		tty_kref_put(tty);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80e93d..fe6f1bf32bb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 	struct sighand_struct *sig;
 
 	if (clone_flags & CLONE_SIGHAND) {
-		atomic_inc(&current->sighand->count);
+		if (!(clone_flags & CLONE_THREAD))
+			atomic_inc(&current->sighand->count);
 		return 0;
 	}
 	sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
@@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process(
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
 bad_fork_cleanup_sighand:
-	__cleanup_sighand(p->sighand);
+	if (!(clone_flags & CLONE_THREAD))
+		__cleanup_sighand(p->sighand);
 bad_fork_cleanup_fs:
 	exit_fs(p); /* blocking */
 bad_fork_cleanup_files:
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct
@ 2017-04-02 22:52                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Old threading libraries used CLONE_SIGHAND without clone thread.
Modern threadding libraries always use CLONE_SIGHAND | CLONE_THREAD.

Therefore let's simplify our lives and stop supporting a case no one cares about.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index fe6f1bf32bb9..0632ac1180be 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process(
 	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
 		return ERR_PTR(-EINVAL);
 
+	/* Disallow CLONE_THREAD with a shared SIGHAND structure.  No
+	 * one cares and supporting it leads to unnecessarily complex
+	 * code.
+	 */
+	if ((clone_flags & CLONE_THREAD) && (atomic_read(&current->sighand->count) > 1))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * Shared signal handlers imply shared VM. By way of the above,
 	 * thread groups also imply shared VM. Blocking this case allows
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct
@ 2017-04-02 22:52                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Old threading libraries used CLONE_SIGHAND without clone thread.
Modern threadding libraries always use CLONE_SIGHAND | CLONE_THREAD.

Therefore let's simplify our lives and stop supporting a case no one cares about.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/fork.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index fe6f1bf32bb9..0632ac1180be 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process(
 	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
 		return ERR_PTR(-EINVAL);
 
+	/* Disallow CLONE_THREAD with a shared SIGHAND structure.  No
+	 * one cares and supporting it leads to unnecessarily complex
+	 * code.
+	 */
+	if ((clone_flags & CLONE_THREAD) && (atomic_read(&current->sighand->count) > 1))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * Shared signal handlers imply shared VM. By way of the above,
 	 * thread groups also imply shared VM. Blocking this case allows
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-02 22:53                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Take advantage of the situation when sighand->count == 1 to only wait
for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread.
Only old old linux threading libraries use CLONE_SIGHAND without
CLONE_THREAD.  So this situation should be present most of the time.

This allows ptracing through a multi-threaded exec without the danger
of stalling the exec.  As historically exec waits for the other
threads to be reaped in de_thread before completing.  This is
necessary as it is not safe to unshare the sighand_struct until all of
the other threads in this thread group are reaped, because the lock to
serialize threads in a thread group siglock lives in sighand_struct.

When oldsighand->count == 1 we know that there are no other
users and unsharing the sighand struct in exec is pointless.
This makes it safe to only wait for threads to become zombies
as the siglock won't change during exec and release_task
will use the samve siglock for the old threads as for
the new threads.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c       | 22 ++--------------------
 kernel/exit.c   | 18 ++++++++----------
 kernel/signal.c |  2 +-
 3 files changed, 11 insertions(+), 31 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 65145a3df065..303a114b00ce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1071,9 +1071,6 @@ static int de_thread(struct task_struct *tsk)
 
 	sig->group_exit_task = tsk;
 	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
-		sig->notify_count--;
-
 	while (sig->notify_count) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
@@ -1092,23 +1089,8 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			cgroup_threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			cgroup_threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
+		cgroup_threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 
 		/*
 		 * The only record we have of the real-time age of a
diff --git a/kernel/exit.c b/kernel/exit.c
index 8c5b3e106298..955c96e3fc12 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -118,13 +118,6 @@ static void __exit_signal(struct task_struct *tsk)
 		tty = sig->tty;
 		sig->tty = NULL;
 	} else {
-		/*
-		 * If there is any task waiting for the group exit
-		 * then notify it:
-		 */
-		if (sig->notify_count > 0 && !--sig->notify_count)
-			wake_up_process(sig->group_exit_task);
-
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
 	}
@@ -712,6 +705,8 @@ static void forget_original_parent(struct task_struct *father,
  */
 static void exit_notify(struct task_struct *tsk, int group_dead)
 {
+	struct sighand_struct *sighand = tsk->sighand;
+	struct signal_struct *signal = tsk->signal;
 	bool autoreap;
 	struct task_struct *p, *n;
 	LIST_HEAD(dead);
@@ -739,9 +734,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
-		wake_up_process(tsk->signal->group_exit_task);
+	spin_lock(&sighand->siglock);
+	/* mt-exec, de_thread is waiting for threads to exit */
+	if (signal->notify_count > 0 && !--signal->notify_count)
+		wake_up_process(signal->group_exit_task);
+
+	spin_unlock(&sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 11fa736eb2ae..fd75ba33ee3d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1205,13 +1205,13 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+		count++;
 	}
 
 	return count;
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-02 22:53                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Take advantage of the situation when sighand->count == 1 to only wait
for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread.
Only old old linux threading libraries use CLONE_SIGHAND without
CLONE_THREAD.  So this situation should be present most of the time.

This allows ptracing through a multi-threaded exec without the danger
of stalling the exec.  As historically exec waits for the other
threads to be reaped in de_thread before completing.  This is
necessary as it is not safe to unshare the sighand_struct until all of
the other threads in this thread group are reaped, because the lock to
serialize threads in a thread group siglock lives in sighand_struct.

When oldsighand->count == 1 we know that there are no other
users and unsharing the sighand struct in exec is pointless.
This makes it safe to only wait for threads to become zombies
as the siglock won't change during exec and release_task
will use the samve siglock for the old threads as for
the new threads.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/exec.c       | 22 ++--------------------
 kernel/exit.c   | 18 ++++++++----------
 kernel/signal.c |  2 +-
 3 files changed, 11 insertions(+), 31 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 65145a3df065..303a114b00ce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1071,9 +1071,6 @@ static int de_thread(struct task_struct *tsk)
 
 	sig->group_exit_task = tsk;
 	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
-		sig->notify_count--;
-
 	while (sig->notify_count) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
@@ -1092,23 +1089,8 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			cgroup_threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			cgroup_threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
+		cgroup_threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 
 		/*
 		 * The only record we have of the real-time age of a
diff --git a/kernel/exit.c b/kernel/exit.c
index 8c5b3e106298..955c96e3fc12 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -118,13 +118,6 @@ static void __exit_signal(struct task_struct *tsk)
 		tty = sig->tty;
 		sig->tty = NULL;
 	} else {
-		/*
-		 * If there is any task waiting for the group exit
-		 * then notify it:
-		 */
-		if (sig->notify_count > 0 && !--sig->notify_count)
-			wake_up_process(sig->group_exit_task);
-
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
 	}
@@ -712,6 +705,8 @@ static void forget_original_parent(struct task_struct *father,
  */
 static void exit_notify(struct task_struct *tsk, int group_dead)
 {
+	struct sighand_struct *sighand = tsk->sighand;
+	struct signal_struct *signal = tsk->signal;
 	bool autoreap;
 	struct task_struct *p, *n;
 	LIST_HEAD(dead);
@@ -739,9 +734,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
-		wake_up_process(tsk->signal->group_exit_task);
+	spin_lock(&sighand->siglock);
+	/* mt-exec, de_thread is waiting for threads to exit */
+	if (signal->notify_count > 0 && !--signal->notify_count)
+		wake_up_process(signal->group_exit_task);
+
+	spin_unlock(&sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 11fa736eb2ae..fd75ba33ee3d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1205,13 +1205,13 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+		count++;
 	}
 
 	return count;
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-02 22:57                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api


Add exec_id to signal_struct and compare it at a few choice moments.

I believe this closes the security holes that letting the zombie
threads linger after exec opens up.

The problem is that old threads may have different creds after a setuid
exec, and then formerly shared resources may change.  So signal sending
and accesses by proc need to be blocked.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 1 +
 include/linux/sched/signal.h | 1 +
 kernel/fork.c                | 1 +
 kernel/ptrace.c              | 4 ++++
 kernel/signal.c              | 7 ++++++-
 5 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 303a114b00ce..730dee8bb2f8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1323,6 +1323,7 @@ void setup_new_exec(struct linux_binprm * bprm)
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
 	current->self_exec_id++;
+	current->signal->exec_id = current->self_exec_id;
 	flush_signal_handlers(current, 0);
 }
 EXPORT_SYMBOL(setup_new_exec);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2cf446704cd4..63ae951ee330 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -80,6 +80,7 @@ struct signal_struct {
 	atomic_t		live;
 	int			nr_threads;
 	struct list_head	thread_head;
+	u32			exec_id;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 0632ac1180be..a442fa099842 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1387,6 +1387,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
+	sig->exec_id = current->self_exec_id;
 
 	mutex_init(&sig->cred_guard_mutex);
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 0af928712174..cc6b10b1ffbe 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -277,6 +277,10 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	 * or halting the specified task is impossible.
 	 */
 
+	/* Don't allow inspecting a thread after exec */
+	if (task->self_exec_id != task->signal->exec_id)
+		return 1;
+
 	/* Don't let security modules deny introspection */
 	if (same_thread_group(task, current))
 		return 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index fd75ba33ee3d..fe8dcdb622f5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			from_ancestor_ns || (info == SEND_SIG_FORCED)))
 		goto ret;
 
+	/* Don't allow thread group signals after exec */
+	if (group && (t->signal->exec_id != t->self_exec_id))
+		goto ret;
+
 	pending = group ? &t->signal->shared_pending : &t->pending;
 	/*
 	 * Short-circuit ignored signals and support queuing
@@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
 		 * must see ->sighand == NULL.
 		 */
 		spin_lock(&sighand->siglock);
-		if (likely(sighand == tsk->sighand)) {
+		if (likely((sighand == tsk->sighand) &&
+			   (tsk->self_exec_id == tsk->signal->exec_id))) {
 			rcu_read_unlock();
 			break;
 		}
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-02 22:57                         ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-02 22:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


Add exec_id to signal_struct and compare it at a few choice moments.

I believe this closes the security holes that letting the zombie
threads linger after exec opens up.

The problem is that old threads may have different creds after a setuid
exec, and then formerly shared resources may change.  So signal sending
and accesses by proc need to be blocked.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/exec.c                    | 1 +
 include/linux/sched/signal.h | 1 +
 kernel/fork.c                | 1 +
 kernel/ptrace.c              | 4 ++++
 kernel/signal.c              | 7 ++++++-
 5 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 303a114b00ce..730dee8bb2f8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1323,6 +1323,7 @@ void setup_new_exec(struct linux_binprm * bprm)
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
 	current->self_exec_id++;
+	current->signal->exec_id = current->self_exec_id;
 	flush_signal_handlers(current, 0);
 }
 EXPORT_SYMBOL(setup_new_exec);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2cf446704cd4..63ae951ee330 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -80,6 +80,7 @@ struct signal_struct {
 	atomic_t		live;
 	int			nr_threads;
 	struct list_head	thread_head;
+	u32			exec_id;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 0632ac1180be..a442fa099842 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1387,6 +1387,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
+	sig->exec_id = current->self_exec_id;
 
 	mutex_init(&sig->cred_guard_mutex);
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 0af928712174..cc6b10b1ffbe 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -277,6 +277,10 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	 * or halting the specified task is impossible.
 	 */
 
+	/* Don't allow inspecting a thread after exec */
+	if (task->self_exec_id != task->signal->exec_id)
+		return 1;
+
 	/* Don't let security modules deny introspection */
 	if (same_thread_group(task, current))
 		return 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index fd75ba33ee3d..fe8dcdb622f5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			from_ancestor_ns || (info == SEND_SIG_FORCED)))
 		goto ret;
 
+	/* Don't allow thread group signals after exec */
+	if (group && (t->signal->exec_id != t->self_exec_id))
+		goto ret;
+
 	pending = group ? &t->signal->shared_pending : &t->pending;
 	/*
 	 * Short-circuit ignored signals and support queuing
@@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
 		 * must see ->sighand == NULL.
 		 */
 		spin_lock(&sighand->siglock);
-		if (likely(sighand == tsk->sighand)) {
+		if (likely((sighand == tsk->sighand) &&
+			   (tsk->self_exec_id == tsk->signal->exec_id))) {
 			rcu_read_unlock();
 			break;
 		}
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-03 18:12                             ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-03 18:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/02, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or
> > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can
> > return the wrong count.
>
> zap_other_thread(tsk, 0) only gets called in the case where we don't
> care about the return value.  It does not get called from fs/exec.c

I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should
return the number of threads which didn't pass exit_notify(). The returned value
can be wrong unless you change exit_notify() to set exit_state under siglock.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
@ 2017-04-03 18:12                             ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-03 18:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 04/02, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
> > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or
> > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can
> > return the wrong count.
>
> zap_other_thread(tsk, 0) only gets called in the case where we don't
> care about the return value.  It does not get called from fs/exec.c

I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should
return the number of threads which didn't pass exit_notify(). The returned value
can be wrong unless you change exit_notify() to set exit_state under siglock.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-03 18:37                         ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-03 18:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Eric,

I see another series from you, but I simply failed to force myself to read
it carefully. Because at first glance it makes me really sad, I do dislike
it even if it is correct. Yes, yes, sure, I can be wrong. Will try tomorrow.

On 04/02, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Anyway, Eric, even if we can and want to do this, why we can't do this on
> > top of my fix?
>
> Because your reduction in scope of cred_guard_mutex is fundamentally
> broken and unnecessary.

And you never explained why it is wrong, or I failed to understand you.

> > I simply fail to understand why you dislike it that much. Yes it is not
> > pretty, I said this many times, but it is safe in that it doesn't really
> > change the current behaviour.
>
> No it is not safe.  And it promotes wrong thinking which is even more
> dangerous.

So please explain why it is not safe and why it is dangerous.

Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex,
please explain what I have missed in case you still think this is wrong.

> I reviewed the code and cred_guard_mutex needs to cover what it covers.

I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow
it in any case, even if the current code was not bugy. But this is almost
offtopic, lets discuss this separately.

> > I am much more worried about 2/2 you didn't argue with, this patch _can_
> > break something and this is obviously not good even if PTRACE_EVENT_EXIT
> > was always broken.
>
> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
> know what the implications of changing it are.  Let's see...

And nobody knows ;) This is the problem, even the clear ptrace bugfix can
break something, this happened before and we had to revert the obviously-
correct patches; the bug was already used as feature.

> If delivering a second SIGKILL
...
> So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT
> before the tracers find it.
>
> Therefore we are only talking a quality of implementation issue
> if we actually stop and wait for the tracer or not.

Oh, this is another story, needs another discussion. We really need some
changes in this area, we need to distinguish SIGKILL sent from user-space
and (say) from group-exit, and we need to decide when should we stop.

But at least I think the tracee should never stop if SIGKILL comes from
user space. And yes ptrace_stop() is ugly and wrong, just look at the
arch_ptrace_stop_needed() check. The problem, again, is that any fix will
be user-visible.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-03 18:37                         ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-03 18:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Eric,

I see another series from you, but I simply failed to force myself to read
it carefully. Because at first glance it makes me really sad, I do dislike
it even if it is correct. Yes, yes, sure, I can be wrong. Will try tomorrow.

On 04/02, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
> > Anyway, Eric, even if we can and want to do this, why we can't do this on
> > top of my fix?
>
> Because your reduction in scope of cred_guard_mutex is fundamentally
> broken and unnecessary.

And you never explained why it is wrong, or I failed to understand you.

> > I simply fail to understand why you dislike it that much. Yes it is not
> > pretty, I said this many times, but it is safe in that it doesn't really
> > change the current behaviour.
>
> No it is not safe.  And it promotes wrong thinking which is even more
> dangerous.

So please explain why it is not safe and why it is dangerous.

Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex,
please explain what I have missed in case you still think this is wrong.

> I reviewed the code and cred_guard_mutex needs to cover what it covers.

I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow
it in any case, even if the current code was not bugy. But this is almost
offtopic, lets discuss this separately.

> > I am much more worried about 2/2 you didn't argue with, this patch _can_
> > break something and this is obviously not good even if PTRACE_EVENT_EXIT
> > was always broken.
>
> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
> know what the implications of changing it are.  Let's see...

And nobody knows ;) This is the problem, even the clear ptrace bugfix can
break something, this happened before and we had to revert the obviously-
correct patches; the bug was already used as feature.

> If delivering a second SIGKILL
...
> So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT
> before the tracers find it.
>
> Therefore we are only talking a quality of implementation issue
> if we actually stop and wait for the tracer or not.

Oh, this is another story, needs another discussion. We really need some
changes in this area, we need to distinguish SIGKILL sent from user-space
and (say) from group-exit, and we need to decide when should we stop.

But at least I think the tracee should never stop if SIGKILL comes from
user space. And yes ptrace_stop() is ugly and wrong, just look at the
arch_ptrace_stop_needed() check. The problem, again, is that any fix will
be user-visible.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
  2017-04-03 18:12                             ` Oleg Nesterov
  (?)
@ 2017-04-03 21:04                             ` Eric W. Biederman
  2017-04-05 16:44                               ` Oleg Nesterov
  -1 siblings, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-03 21:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/02, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or
>> > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can
>> > return the wrong count.
>>
>> zap_other_thread(tsk, 0) only gets called in the case where we don't
>> care about the return value.  It does not get called from fs/exec.c
>
> I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should
> return the number of threads which didn't pass exit_notify(). The returned value
> can be wrong unless you change exit_notify() to set exit_state under
> siglock.

Interesting an existing bug.  I won't deny that one.

Subtle to catch but easy enough to fix.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-03 22:49                           ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> Eric,
>
> I see another series from you, but I simply failed to force myself to read
> it carefully. Because at first glance it makes me really sad, I do dislike
> it even if it is correct. Yes, yes, sure, I can be wrong. Will try
> tomorrow.

Yes.  I needed to get my thoughts concrete.  I missed fixing the race in
zap_other_threads.  But overall I think things are moving in a good
direction.

>>
>> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
>> know what the implications of changing it are.  Let's see...
>
> And nobody knows ;) This is the problem, even the clear ptrace bugfix can
> break something, this happened before and we had to revert the obviously-
> correct patches; the bug was already used as feature.

Yes that is the challenge of changing userspace.   Which is why it helps
to test as much of a userspace change as possible.  Or to get very
clever, and figure out how to avoid the userspace change.

So I think it is worth knowing the lldb actually uses
PTRACE_O_TRACEEXIT.  So we can test at least some programs to verify
that all is well.

I don't see any way around cleaning up PTRACE_O_TRACEEXIT.  As
we fundamentally have the non-thread-group-leader exec problem.
We have to reap that previous leader thread with release_task.
Which means we can't stop for a PTRACE_O_TRACEEXIT.


>> If delivering a second SIGKILL
> ...
>> So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT
>> before the tracers find it.
>>
>> Therefore we are only talking a quality of implementation issue
>> if we actually stop and wait for the tracer or not.
>
> Oh, this is another story, needs another discussion. We really need some
> changes in this area, we need to distinguish SIGKILL sent from user-space
> and (say) from group-exit, and we need to decide when should we stop.
>
> But at least I think the tracee should never stop if SIGKILL comes from
> user space. And yes ptrace_stop() is ugly and wrong, just look at the
> arch_ptrace_stop_needed() check. The problem, again, is that any fix will
> be user-visible.

The only issue I see is that arch_ptrace_stop() may sleep (sparc and
ia64 do as they flush the register stack to memory).  As the
code may sleep it means we can't set TASK_TRACED until after calling
arch_ptrace_stop().

My inclination is to just solve that by saying:
if (!sigkill_pending(current))
	set_current_task(TASK_TRACED);

That removes the special case.  We have to handle SIGKILL being
delivered immediately after set_current_state in any event.  And as we
are talking about something that happens on rare architecutres I don't
see any problem with tweaking that code at all.

It is closely enough related I will fold that into the next version of
my patch.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-04-03 22:49                           ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> Eric,
>
> I see another series from you, but I simply failed to force myself to read
> it carefully. Because at first glance it makes me really sad, I do dislike
> it even if it is correct. Yes, yes, sure, I can be wrong. Will try
> tomorrow.

Yes.  I needed to get my thoughts concrete.  I missed fixing the race in
zap_other_threads.  But overall I think things are moving in a good
direction.

>>
>> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
>> know what the implications of changing it are.  Let's see...
>
> And nobody knows ;) This is the problem, even the clear ptrace bugfix can
> break something, this happened before and we had to revert the obviously-
> correct patches; the bug was already used as feature.

Yes that is the challenge of changing userspace.   Which is why it helps
to test as much of a userspace change as possible.  Or to get very
clever, and figure out how to avoid the userspace change.

So I think it is worth knowing the lldb actually uses
PTRACE_O_TRACEEXIT.  So we can test at least some programs to verify
that all is well.

I don't see any way around cleaning up PTRACE_O_TRACEEXIT.  As
we fundamentally have the non-thread-group-leader exec problem.
We have to reap that previous leader thread with release_task.
Which means we can't stop for a PTRACE_O_TRACEEXIT.


>> If delivering a second SIGKILL
> ...
>> So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT
>> before the tracers find it.
>>
>> Therefore we are only talking a quality of implementation issue
>> if we actually stop and wait for the tracer or not.
>
> Oh, this is another story, needs another discussion. We really need some
> changes in this area, we need to distinguish SIGKILL sent from user-space
> and (say) from group-exit, and we need to decide when should we stop.
>
> But at least I think the tracee should never stop if SIGKILL comes from
> user space. And yes ptrace_stop() is ugly and wrong, just look at the
> arch_ptrace_stop_needed() check. The problem, again, is that any fix will
> be user-visible.

The only issue I see is that arch_ptrace_stop() may sleep (sparc and
ia64 do as they flush the register stack to memory).  As the
code may sleep it means we can't set TASK_TRACED until after calling
arch_ptrace_stop().

My inclination is to just solve that by saying:
if (!sigkill_pending(current))
	set_current_task(TASK_TRACED);

That removes the special case.  We have to handle SIGKILL being
delivered immediately after set_current_state in any event.  And as we
are talking about something that happens on rare architecutres I don't
see any problem with tweaking that code at all.

It is closely enough related I will fold that into the next version of
my patch.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* scope of cred_guard_mutex.
@ 2017-04-03 22:49                           ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/02, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > Anyway, Eric, even if we can and want to do this, why we can't do this on
>> > top of my fix?
>>
>> Because your reduction in scope of cred_guard_mutex is fundamentally
>> broken and unnecessary.
>
> And you never explained why it is wrong, or I failed to understand you.
>
>> > I simply fail to understand why you dislike it that much. Yes it is not
>> > pretty, I said this many times, but it is safe in that it doesn't really
>> > change the current behaviour.
>>
>> No it is not safe.  And it promotes wrong thinking which is even more
>> dangerous.
>
> So please explain why it is not safe and why it is dangerous.
>
> Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex,
> please explain what I have missed in case you still think this is wrong.

>> I reviewed the code and cred_guard_mutex needs to cover what it covers.
>
> I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow
> it in any case, even if the current code was not bugy. But this is almost
> offtopic, lets discuss this separately.

You have asked why I have problems with your patch and so I am going to
try to explain.  Partly I want to see a clean set of patches that we
can merge into Linus's tree before we make any compromises.  Because the
work preparing a clean patchset may inform us of something better.  Plus
we need to make something clean and long term maintainable in any event.

Partly I object because your understanding and my understanding of
cred_guard_mutex are very different.

As I read and understand the code the job of cred_guard_mutex is to keep
ptrace (and other threads of the proccess) from interferring in
exec and to ensure old resources are accessed with permission checks
using our original credentials and that new and modified resources are
accessed with permission checks using our new credentials.


I object to your patch in particular because you deliberately mess up
the part of only making old resources available with old creds and
new resources available with new creds.  Even if the current permission
checks are a don't care it still remains conceptually wrong.  And
conceptually wrong tends code tends towards maintenance problems
and real surprises when someone makes small changes to the code.  Which
is what I mean when I say your patch is dangerous.


AKA What I see neededing to be protected looks something like:
	mutex_lock();
        new_cred = compute_new_cred(tsk);
        new_mm = compute_new_mm(tsk);
        tsk->mm = new_mm;
        tsk->cred = new_cred;
        zap_other_threads(tsk);
        update_sighand(tsk);
        update_signal(tsk);
        do_close_on_exec();
        update_tsk_fields(tsk);
        mutex_unlock();

The only way I can see of reducing the scope of cred_guard_mutex is
performing work in such a way that ptrace and the other threads can't
interfere and then taking the lock.  Computing the new mm and the new
credentials are certainly candidates for that kind of treatment.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* scope of cred_guard_mutex.
@ 2017-04-03 22:49                           ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On 04/02, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>>
>> > Anyway, Eric, even if we can and want to do this, why we can't do this on
>> > top of my fix?
>>
>> Because your reduction in scope of cred_guard_mutex is fundamentally
>> broken and unnecessary.
>
> And you never explained why it is wrong, or I failed to understand you.
>
>> > I simply fail to understand why you dislike it that much. Yes it is not
>> > pretty, I said this many times, but it is safe in that it doesn't really
>> > change the current behaviour.
>>
>> No it is not safe.  And it promotes wrong thinking which is even more
>> dangerous.
>
> So please explain why it is not safe and why it is dangerous.
>
> Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex,
> please explain what I have missed in case you still think this is wrong.

>> I reviewed the code and cred_guard_mutex needs to cover what it covers.
>
> I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow
> it in any case, even if the current code was not bugy. But this is almost
> offtopic, lets discuss this separately.

You have asked why I have problems with your patch and so I am going to
try to explain.  Partly I want to see a clean set of patches that we
can merge into Linus's tree before we make any compromises.  Because the
work preparing a clean patchset may inform us of something better.  Plus
we need to make something clean and long term maintainable in any event.

Partly I object because your understanding and my understanding of
cred_guard_mutex are very different.

As I read and understand the code the job of cred_guard_mutex is to keep
ptrace (and other threads of the proccess) from interferring in
exec and to ensure old resources are accessed with permission checks
using our original credentials and that new and modified resources are
accessed with permission checks using our new credentials.


I object to your patch in particular because you deliberately mess up
the part of only making old resources available with old creds and
new resources available with new creds.  Even if the current permission
checks are a don't care it still remains conceptually wrong.  And
conceptually wrong tends code tends towards maintenance problems
and real surprises when someone makes small changes to the code.  Which
is what I mean when I say your patch is dangerous.


AKA What I see neededing to be protected looks something like:
	mutex_lock();
        new_cred = compute_new_cred(tsk);
        new_mm = compute_new_mm(tsk);
        tsk->mm = new_mm;
        tsk->cred = new_cred;
        zap_other_threads(tsk);
        update_sighand(tsk);
        update_signal(tsk);
        do_close_on_exec();
        update_tsk_fields(tsk);
        mutex_unlock();

The only way I can see of reducing the scope of cred_guard_mutex is
performing work in such a way that ptrace and the other threads can't
interfere and then taking the lock.  Computing the new mm and the new
credentials are certainly candidates for that kind of treatment.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
  2017-04-03 22:49                           ` Eric W. Biederman
  (?)
@ 2017-04-05 16:08                           ` Oleg Nesterov
  2017-04-05 16:11                             ` Kees Cook
  2017-04-05 17:53                             ` Eric W. Biederman
  -1 siblings, 2 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/03, Eric W. Biederman wrote:
>
> You have asked why I have problems with your patch and so I am going to
> try to explain.  Partly I want to see a clean set of patches that we
> can merge into Linus's tree before we make any compromises.  Because the
> work preparing a clean patchset may inform us of something better.  Plus
> we need to make something clean and long term maintainable in any event.
>
> Partly I object because your understanding and my understanding of
> cred_guard_mutex are very different.

And I think there is another problem, your understanding and my understanding
of "clean" differ too much and it seems that we can not convince each other ;)

The last series looks buggy (I'll send more emails later today), but the
main problem is that - in my opinion! - your approach is "obviously wrong
and much less clean". But yes, yes, I understand that this is my opinion,
and I can be wrong.

Eric, I think we need more CC's. Linus, probably security list, the more
the better.

I am going to resend my series with more CC's, then you can nack it and
explain what you think we should do. Perhaps someone else will suggest
a better solution, or at least review the patches. OK?

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
  2017-04-05 16:08                           ` Oleg Nesterov
@ 2017-04-05 16:11                             ` Kees Cook
  2017-04-05 17:53                             ` Eric W. Biederman
  1 sibling, 0 replies; 93+ messages in thread
From: Kees Cook @ 2017-04-05 16:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Jann Horn, Michal Hocko, Ulrich Obergfell, LKML,
	Linux API

On Wed, Apr 5, 2017 at 9:08 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 04/03, Eric W. Biederman wrote:
>>
>> You have asked why I have problems with your patch and so I am going to
>> try to explain.  Partly I want to see a clean set of patches that we
>> can merge into Linus's tree before we make any compromises.  Because the
>> work preparing a clean patchset may inform us of something better.  Plus
>> we need to make something clean and long term maintainable in any event.
>>
>> Partly I object because your understanding and my understanding of
>> cred_guard_mutex are very different.
>
> And I think there is another problem, your understanding and my understanding
> of "clean" differ too much and it seems that we can not convince each other ;)
>
> The last series looks buggy (I'll send more emails later today), but the
> main problem is that - in my opinion! - your approach is "obviously wrong
> and much less clean". But yes, yes, I understand that this is my opinion,
> and I can be wrong.
>
> Eric, I think we need more CC's. Linus, probably security list, the more
> the better.
>
> I am going to resend my series with more CC's, then you can nack it and
> explain what you think we should do. Perhaps someone else will suggest
> a better solution, or at least review the patches. OK?

I've been following along, but it seems like there are a lot of edge
cases in these changes. I'll try to meaningfully comment on the coming
emails... having code examples of why various things will/won't work
go a long way for helping understand what's safe or not...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped
  2017-04-02 22:53                         ` Eric W. Biederman
  (?)
@ 2017-04-05 16:15                         ` Oleg Nesterov
  -1 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/02, Eric W. Biederman wrote:
>
> Take advantage of the situation when sighand->count == 1 to only wait
> for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread.

Let me comment this patch first, it looks mostly fine to me.

And note that this is what my patch does too: exec() waits until
all threads pass exit_notify() and drops cred_guard_mutex.

However, with my patch patch exec() then waits until all threads
disappear. This is uglifies the code but this is simple and safe.

With your patches exec doesn't do another wait and succeeds after
the 1st wait. I think this is wrong and the next patch is not enough.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-05 16:18                           ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/02, Eric W. Biederman wrote:
>
> Add exec_id to signal_struct and compare it at a few choice moments.

I really dislike this change no matter what, sorry.

Firstly, task_struct->*_exec_id should simply die (I already have the
patch), or at least they should be moved into signal_struct simply
because this is per-process thing.

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>  			from_ancestor_ns || (info == SEND_SIG_FORCED)))
>  		goto ret;
>
> +	/* Don't allow thread group signals after exec */
> +	if (group && (t->signal->exec_id != t->self_exec_id))
> +		goto ret;

Hmm. Either we do not need this exec_id check at all, or we should not
take "group" into account; a fatal signal (say SIGKILL) will kill the
whole thread-group.

> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
>  		 * must see ->sighand == NULL.
>  		 */
>  		spin_lock(&sighand->siglock);
> -		if (likely(sighand == tsk->sighand)) {
> +		if (likely((sighand == tsk->sighand) &&
> +			   (tsk->self_exec_id == tsk->signal->exec_id))) {

Oh, this doesn't look good to me. Yes, with your approach we probably need
this to, say, ensure that posix-cpu-timer can't kill the process after exec,
but I'd rather add the exit_state check into run_posix_timers().

But OK, suppose that we fix the problems with signal-after-exec.

====================================================================
Now lets fix another problem. A mt exec suceeds and apllication does
sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds
another (zombie) SECCOMP_MODE_FILTER thread.

And after we fix this problem, what else we will need to fix?


I really think that - whatever we do - there should be no other threads
after exec, even zombies.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-05 16:18                           ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 04/02, Eric W. Biederman wrote:
>
> Add exec_id to signal_struct and compare it at a few choice moments.

I really dislike this change no matter what, sorry.

Firstly, task_struct->*_exec_id should simply die (I already have the
patch), or at least they should be moved into signal_struct simply
because this is per-process thing.

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>  			from_ancestor_ns || (info == SEND_SIG_FORCED)))
>  		goto ret;
>
> +	/* Don't allow thread group signals after exec */
> +	if (group && (t->signal->exec_id != t->self_exec_id))
> +		goto ret;

Hmm. Either we do not need this exec_id check at all, or we should not
take "group" into account; a fatal signal (say SIGKILL) will kill the
whole thread-group.

> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
>  		 * must see ->sighand == NULL.
>  		 */
>  		spin_lock(&sighand->siglock);
> -		if (likely(sighand == tsk->sighand)) {
> +		if (likely((sighand == tsk->sighand) &&
> +			   (tsk->self_exec_id == tsk->signal->exec_id))) {

Oh, this doesn't look good to me. Yes, with your approach we probably need
this to, say, ensure that posix-cpu-timer can't kill the process after exec,
but I'd rather add the exit_state check into run_posix_timers().

But OK, suppose that we fix the problems with signal-after-exec.

====================================================================
Now lets fix another problem. A mt exec suceeds and apllication does
sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds
another (zombie) SECCOMP_MODE_FILTER thread.

And after we fix this problem, what else we will need to fix?


I really think that - whatever we do - there should be no other threads
after exec, even zombies.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump
  2017-04-02 22:51                         ` Eric W. Biederman
  (?)
@ 2017-04-05 16:19                         ` Oleg Nesterov
  -1 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/02, Eric W. Biederman wrote:
>
> In the case of exec and coredump which have many interesting deadlock
> opportunities

So this patch is very close to my 2/2 one-liner, except

	- you removed the current->mm == current->parent->mm check

	  I didn't do this on purpose, because even the->core_state
	  is not really needed if we check ->group_exit_task, this
	  need more changes anyway, but I won't argue.

	- With your patch we send the notification to debugger even
	  if we are not going to stop.

	  This is not wrong, but why? This is pointless, nobody rely
	  on SIGCHLD, if nothing else it doesn't queue.

	  Again, I won't argue, but this complicates both the patch
	  and the code for no reason. Unless I missed something.

> Keep sending the signal to the tracer so that this appears like
> the worst case where someone else sent the process a SIGKILL before
> the tracer could react.  So all non-buggy tracers must support
> this case.

Well, I can't understand the changelog. Sure, debugger must support
this case, but obviously this can break things anyway.

For example. The coredumping thread must stop in PTRACE_EVENT_EXIT.
There is a tool (I don't remember its name) which does
ptrace_attach(PTRACE_SEIZE, PTRACE_O_TRACEEXIT) after the coredump
was already started, closes the pipe, and reads the registers when
this thread actually exits.

This patch or my 2/2 should not break it, ->group_exit_task will be
cleared after do_coredump(), but unfortunately something else can be
broken.

So I think the changelog should mention that yes, this is the user
visible change which _can_ break something anyway.

In short. I will be really happy if this patch comes from you, not me ;)

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct
@ 2017-04-05 16:24                           ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/02, Eric W. Biederman wrote:
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process(
>  	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
>  		return ERR_PTR(-EINVAL);
>
> +	/* Disallow CLONE_THREAD with a shared SIGHAND structure.  No
> +	 * one cares

Well, can't resists... I won't argue, but we can't know if no one cares
or not. I agree that most probably this won't break something, but who
knows... I am always scared when we add the incompatible changes.

> and supporting it leads to unnecessarily complex
> +	 * code.
> +	 */
> +	if ((clone_flags & CLONE_THREAD) && (atomic_read(&current->sighand->count) > 1))
> +		return ERR_PTR(-EINVAL);

Perhaps the comment should explain why we do this and say that
sighand-unsharing in de_thread() depends on this.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct
@ 2017-04-05 16:24                           ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 04/02, Eric W. Biederman wrote:
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process(
>  	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
>  		return ERR_PTR(-EINVAL);
>
> +	/* Disallow CLONE_THREAD with a shared SIGHAND structure.  No
> +	 * one cares

Well, can't resists... I won't argue, but we can't know if no one cares
or not. I agree that most probably this won't break something, but who
knows... I am always scared when we add the incompatible changes.

> and supporting it leads to unnecessarily complex
> +	 * code.
> +	 */
> +	if ((clone_flags & CLONE_THREAD) && (atomic_read(&current->sighand->count) > 1))
> +		return ERR_PTR(-EINVAL);

Perhaps the comment should explain why we do this and say that
sighand-unsharing in de_thread() depends on this.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped
  2017-04-03 21:04                             ` Eric W. Biederman
@ 2017-04-05 16:44                               ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 16:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/03, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should
> > return the number of threads which didn't pass exit_notify(). The returned value
> > can be wrong unless you change exit_notify() to set exit_state under
> > siglock.

but I forgot to add that, of course, this problem is very minor because
we can only miss a thread which is already at the end of exit_notify()
so nothing bad can happen.

But imo should be fixed anyway, simply because this looks wrong/racy.
Your recent 4/5 has the same problem.

> Interesting an existing bug.

Hmm... what do you mean? The current code looks fine.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct
  2017-04-05 16:24                           ` Oleg Nesterov
  (?)
@ 2017-04-05 17:34                           ` Eric W. Biederman
  2017-04-05 18:11                             ` Oleg Nesterov
  -1 siblings, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-05 17:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/02, Eric W. Biederman wrote:
>>
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process(
>>  	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
>>  		return ERR_PTR(-EINVAL);
>>
>> +	/* Disallow CLONE_THREAD with a shared SIGHAND structure.  No
>> +	 * one cares
>
> Well, can't resists... I won't argue, but we can't know if no one cares
> or not. I agree that most probably this won't break something, but who
> knows... I am always scared when we add the incompatible changes.

I agree that changing userspace semantics is something to be very
careful with.  But at least for purposes of discussion I think this is a
good patch.

I can avoid this change but it requires moving sighand->siglock
into signal_struct and introducing a new spinlock into sighand_struct
to just guard the signal handlers.

However I think the change to move siglock would be a distraction from
the larger issues of this patchset.

Once we address the core issues I will be happy to revisit this.

>> and supporting it leads to unnecessarily complex
>> +	 * code.
>> +	 */
>> +	if ((clone_flags & CLONE_THREAD) && (atomic_read(&current->sighand->count) > 1))
>> +		return ERR_PTR(-EINVAL);
>
> Perhaps the comment should explain why we do this and say that
> sighand-unsharing in de_thread() depends on this.

That would be a better comment.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
  2017-04-05 16:08                           ` Oleg Nesterov
  2017-04-05 16:11                             ` Kees Cook
@ 2017-04-05 17:53                             ` Eric W. Biederman
  2017-04-05 18:15                               ` Oleg Nesterov
  1 sibling, 1 reply; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-05 17:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/03, Eric W. Biederman wrote:
>>
>> You have asked why I have problems with your patch and so I am going to
>> try to explain.  Partly I want to see a clean set of patches that we
>> can merge into Linus's tree before we make any compromises.  Because the
>> work preparing a clean patchset may inform us of something better.  Plus
>> we need to make something clean and long term maintainable in any event.
>>
>> Partly I object because your understanding and my understanding of
>> cred_guard_mutex are very different.
>
> And I think there is another problem, your understanding and my understanding
> of "clean" differ too much and it seems that we can not convince each other ;)

We have barely begun.  You have not shown anyone what your idea of a
clean fix actually is.  All I have seen from you is a quick hack that is
a hack that is back-portable.

Focusing on the back port is the wrong order to solve the issue in.  We
need to solve this in an upstream mergable and maintainable way and then
we can worry about backports.

>From a userspace perspective.  I find anything in the kernel blocking on
a zombie to be just wrong.  A zombie is dead.  Waiting for a zombie should
in all cases be optional.  The system should not break if we don't reap
a zombie.

You have made a clear case that the zombies need to exist for strace -f
to wait on.  So since the zombies must exist we should make them follow
normal zombie rules.

With your change exec still blocks waiting for zombies.

Furthermore you have to violate the reasonable rule that:
* pre-exec resources are guarded with the pre-exec process cred.
* post-exec resources are guarded with the post-exec process cred.

So from a userspace perspective I think your semantics are absolutely
insane.

We also need clean code to implement all of this (which I am still
inching towards).  But we need to be implementing something that makes
sense from a userspace perspective.


> The last series looks buggy (I'll send more emails later today), but the
> main problem is that - in my opinion! - your approach is "obviously wrong
> and much less clean". But yes, yes, I understand that this is my opinion,
> and I can be wrong.

How is changing fixing the implementation so that we don't block waiting
for zombies to be reaped wrong?

> Eric, I think we need more CC's. Linus, probably security list, the more
> the better.
>
> I am going to resend my series with more CC's, then you can nack it and
> explain what you think we should do. Perhaps someone else will suggest
> a better solution, or at least review the patches. OK?

I will be happy to look but my primary objectionions to your patch were:

- You implemented a hack for backporting rather than fixing things
  cleanly the first time.

- You made comments about cred_guard_mutex and it's scope that when I
  reviewed the code were false.  cred_guard_mutex although probably the
  wrong locking structure is semantically and fundamentally where it
  needs to be.  We can optimize it but we can't change what is protected
  to make our lives easier.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct
  2017-04-05 17:34                           ` Eric W. Biederman
@ 2017-04-05 18:11                             ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 18:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/05, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> I agree that changing userspace semantics is something to be very
> careful with.  But at least for purposes of discussion I think this is a
> good patch.

I agree that we need it with your approach,

but imo it would be much better to not depend on the subtle changes
like this. My 2/2 or your 1/5 are already bad enough.

> I can avoid this change but it requires moving sighand->siglock
> into signal_struct and introducing a new spinlock into sighand_struct
> to just guard the signal handlers.

Oh, this looks much, much worse to me.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
  2017-04-05 17:53                             ` Eric W. Biederman
@ 2017-04-05 18:15                               ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-05 18:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

On 04/05, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> - You made comments about cred_guard_mutex and it's scope that when I
>   reviewed the code were false.

Too late for me. I'll try to read other emails from you and reply tomorrow.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-05 18:16                             ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-05 18:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/02, Eric W. Biederman wrote:
>>
>> Add exec_id to signal_struct and compare it at a few choice moments.
>
> I really dislike this change no matter what, sorry.
>
> Firstly, task_struct->*_exec_id should simply die (I already have the
> patch), or at least they should be moved into signal_struct simply
> because this is per-process thing.

I am quite happy to find a better way to implement this.  More than
anything this was my proof of concept that it is possible to close
the security holes created if we allow our zombies to be normal zombies.


>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>>  			from_ancestor_ns || (info == SEND_SIG_FORCED)))
>>  		goto ret;
>>
>> +	/* Don't allow thread group signals after exec */
>> +	if (group && (t->signal->exec_id != t->self_exec_id))
>> +		goto ret;
>
> Hmm. Either we do not need this exec_id check at all, or we should not
> take "group" into account; a fatal signal (say SIGKILL) will kill the
> whole thread-group.

Wow.  Those are crazy semantics for fatal signals.  Sending a tkill
should not affect the entire thread group.  Oleg I think this is a bug
you introduced and likely requires a separate fix.

I really don't understand the logic in:

commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567
Author: Oleg Nesterov <oleg@tv-sign.ru>
Date:   Wed Apr 30 00:52:55 2008 -0700

    signals: use __group_complete_signal() for the specific signals too
    
    Based on Pavel Emelyanov's suggestion.
    
    Rename __group_complete_signal() to complete_signal() and use it to process
    the specific signals too.  To do this we simply add the "int group" argument.
    
    This allows us to greatly simply the signal-sending code and adds a useful
    behaviour change.  We can avoid the unneeded wakeups for the private signals
    because wants_signal() is more clever than sigismember(blocked), but more
    importantly we now take into account the fatal specific signals too.
    
    The latter allows us to kill some subtle checks in handle_stop_signal() and
    makes the specific/group signal's behaviour more consistent.  For example,
    currently sigtimedwait(FATAL_SIGNAL) behaves differently depending on was the
    signal sent by kill() or tkill() if the signal was not blocked.
    
    And.  This allows us to tweak/fix the behaviour when the specific signal is
    sent to the dying/dead ->group_leader.
    
    Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
    Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
    Cc: Roland McGrath <roland@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

>> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
>>  		 * must see ->sighand == NULL.
>>  		 */
>>  		spin_lock(&sighand->siglock);
>> -		if (likely(sighand == tsk->sighand)) {
>> +		if (likely((sighand == tsk->sighand) &&
>> +			   (tsk->self_exec_id == tsk->signal->exec_id))) {
>
> Oh, this doesn't look good to me. Yes, with your approach we probably need
> this to, say, ensure that posix-cpu-timer can't kill the process after exec,
> but I'd rather add the exit_state check into run_posix_timers().

The entire point of lock_task_sighand is to not operate on
tasks/processes that have exited. The fact it even sighand in there is
deceptive because it is all about siglock and nothing to do with
sighand.

> But OK, suppose that we fix the problems with signal-after-exec.
>
> ====================================================================
> Now lets fix another problem. A mt exec suceeds and apllication does
> sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds
> another (zombie) SECCOMP_MODE_FILTER thread.
>
> And after we fix this problem, what else we will need to fix?
>
>
> I really think that - whatever we do - there should be no other threads
> after exec, even zombies.

I see where you are coming from.

I need to stare at this a bit longer.  Because you are right.  Reusing
the signal_struct and leaving zombies around is very prone to bugs.  So
it is not very maintainable.

I suspect the answer here is to simply allocate a new sighand_struct and
a new signal_struct if there we are not single threaded by the time we
get down to the end of de_thread.

However even if it is a case of whack-a-mole semantically
not-blocking-for-zombies looks like the right thing to do and we need to
figure out how to do it maintainably.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-05 18:16                             ` Eric W. Biederman
  0 siblings, 0 replies; 93+ messages in thread
From: Eric W. Biederman @ 2017-04-05 18:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On 04/02, Eric W. Biederman wrote:
>>
>> Add exec_id to signal_struct and compare it at a few choice moments.
>
> I really dislike this change no matter what, sorry.
>
> Firstly, task_struct->*_exec_id should simply die (I already have the
> patch), or at least they should be moved into signal_struct simply
> because this is per-process thing.

I am quite happy to find a better way to implement this.  More than
anything this was my proof of concept that it is possible to close
the security holes created if we allow our zombies to be normal zombies.


>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>>  			from_ancestor_ns || (info == SEND_SIG_FORCED)))
>>  		goto ret;
>>
>> +	/* Don't allow thread group signals after exec */
>> +	if (group && (t->signal->exec_id != t->self_exec_id))
>> +		goto ret;
>
> Hmm. Either we do not need this exec_id check at all, or we should not
> take "group" into account; a fatal signal (say SIGKILL) will kill the
> whole thread-group.

Wow.  Those are crazy semantics for fatal signals.  Sending a tkill
should not affect the entire thread group.  Oleg I think this is a bug
you introduced and likely requires a separate fix.

I really don't understand the logic in:

commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567
Author: Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
Date:   Wed Apr 30 00:52:55 2008 -0700

    signals: use __group_complete_signal() for the specific signals too
    
    Based on Pavel Emelyanov's suggestion.
    
    Rename __group_complete_signal() to complete_signal() and use it to process
    the specific signals too.  To do this we simply add the "int group" argument.
    
    This allows us to greatly simply the signal-sending code and adds a useful
    behaviour change.  We can avoid the unneeded wakeups for the private signals
    because wants_signal() is more clever than sigismember(blocked), but more
    importantly we now take into account the fatal specific signals too.
    
    The latter allows us to kill some subtle checks in handle_stop_signal() and
    makes the specific/group signal's behaviour more consistent.  For example,
    currently sigtimedwait(FATAL_SIGNAL) behaves differently depending on was the
    signal sent by kill() or tkill() if the signal was not blocked.
    
    And.  This allows us to tweak/fix the behaviour when the specific signal is
    sent to the dying/dead ->group_leader.
    
    Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
    Signed-off-by: Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
    Cc: Roland McGrath <roland-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
    Signed-off-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

>> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
>>  		 * must see ->sighand == NULL.
>>  		 */
>>  		spin_lock(&sighand->siglock);
>> -		if (likely(sighand == tsk->sighand)) {
>> +		if (likely((sighand == tsk->sighand) &&
>> +			   (tsk->self_exec_id == tsk->signal->exec_id))) {
>
> Oh, this doesn't look good to me. Yes, with your approach we probably need
> this to, say, ensure that posix-cpu-timer can't kill the process after exec,
> but I'd rather add the exit_state check into run_posix_timers().

The entire point of lock_task_sighand is to not operate on
tasks/processes that have exited. The fact it even sighand in there is
deceptive because it is all about siglock and nothing to do with
sighand.

> But OK, suppose that we fix the problems with signal-after-exec.
>
> ====================================================================
> Now lets fix another problem. A mt exec suceeds and apllication does
> sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds
> another (zombie) SECCOMP_MODE_FILTER thread.
>
> And after we fix this problem, what else we will need to fix?
>
>
> I really think that - whatever we do - there should be no other threads
> after exec, even zombies.

I see where you are coming from.

I need to stare at this a bit longer.  Because you are right.  Reusing
the signal_struct and leaving zombies around is very prone to bugs.  So
it is not very maintainable.

I suspect the answer here is to simply allocate a new sighand_struct and
a new signal_struct if there we are not single threaded by the time we
get down to the end of de_thread.

However even if it is a case of whack-a-mole semantically
not-blocking-for-zombies looks like the right thing to do and we need to
figure out how to do it maintainably.

Eric

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-06 15:48                               ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-06 15:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api, Eugene Syromiatnikov

On 04/05, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> --- a/kernel/signal.c
> >> +++ b/kernel/signal.c
> >> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
> >>  			from_ancestor_ns || (info == SEND_SIG_FORCED)))
> >>  		goto ret;
> >>
> >> +	/* Don't allow thread group signals after exec */
> >> +	if (group && (t->signal->exec_id != t->self_exec_id))
> >> +		goto ret;
> >
> > Hmm. Either we do not need this exec_id check at all, or we should not
> > take "group" into account; a fatal signal (say SIGKILL) will kill the
> > whole thread-group.
>
> Wow.  Those are crazy semantics for fatal signals.  Sending a tkill
> should not affect the entire thread group.

How so? SIGKILL or any fatal signal should kill the whole process, even if
it was sent by tkill().

> Oleg I think this is a bug
> you introduced and likely requires a separate fix.
>
> I really don't understand the logic in:
>
> commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567
> Author: Oleg Nesterov <oleg@tv-sign.ru>
> Date:   Wed Apr 30 00:52:55 2008 -0700
>
>     signals: use __group_complete_signal() for the specific signals too

No. You can even forget about "send" path for the moment. Just suppose that
a thread dequeues SIGKILL sent by tkill(). In this case it will call
do_group_exit() and kill the group anyway. It is not possible to kill an
individual thread, and linux never did this.

Afaics, this commit also fixes the case when SIGKILL can be lost when tkill()
races with the exiting target. Or if the target is a zombie-leader. Exactly
because they obviously can't dequeue SIGKILL.

Plus we want to shutdown the whole thread-group "asap", that is why
complete_signal() sets SIGNAL_GROUP_EXIT and sends SIGKILL to other threads
in the "send" path.

This btw reminds me that we want to do the same with sig_kernel_coredump()
signals too, but this is not simple.

> >> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
> >>  		 * must see ->sighand == NULL.
> >>  		 */
> >>  		spin_lock(&sighand->siglock);
> >> -		if (likely(sighand == tsk->sighand)) {
> >> +		if (likely((sighand == tsk->sighand) &&
> >> +			   (tsk->self_exec_id == tsk->signal->exec_id))) {
> >
> > Oh, this doesn't look good to me. Yes, with your approach we probably need
> > this to, say, ensure that posix-cpu-timer can't kill the process after exec,
> > but I'd rather add the exit_state check into run_posix_timers().
>
> The entire point of lock_task_sighand is to not operate on
> tasks/processes that have exited.

Well, the entire point of lock_task_sighand() is take ->siglock if possible.

> The fact it even sighand in there is
> deceptive because it is all about siglock and nothing to do with
> sighand.

Not sure I understand what you mean...

Yes, lock_task_sighand() can obviously fail, and yes the failure is used
as an indication that this thread has gone. But a zombie thread controlled
by the parent/debugger has not gone yet.

> > ====================================================================
> > Now lets fix another problem. A mt exec suceeds and apllication does
> > sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds
> > another (zombie) SECCOMP_MODE_FILTER thread.
> >
> > And after we fix this problem, what else we will need to fix?
> >
> >
> > I really think that - whatever we do - there should be no other threads
> > after exec, even zombies.
>
> I see where you are coming from.
>
> I need to stare at this a bit longer.  Because you are right.  Reusing
> the signal_struct and leaving zombies around is very prone to bugs.  So
> it is not very maintainable.

Yes, yes, yes. This is what I was arguing with.

> I suspect the answer here is to simply allocate a new sighand_struct and
> a new signal_struct if there we are not single threaded by the time we
> get down to the end of de_thread.

May be. Not sure. Looks very nontrivial.

And I still think that if we do this, we should fix the bug first, then try
to do something like this.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec
@ 2017-04-06 15:48                               ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-06 15:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Eugene Syromiatnikov

On 04/05, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
> >> --- a/kernel/signal.c
> >> +++ b/kernel/signal.c
> >> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
> >>  			from_ancestor_ns || (info == SEND_SIG_FORCED)))
> >>  		goto ret;
> >>
> >> +	/* Don't allow thread group signals after exec */
> >> +	if (group && (t->signal->exec_id != t->self_exec_id))
> >> +		goto ret;
> >
> > Hmm. Either we do not need this exec_id check at all, or we should not
> > take "group" into account; a fatal signal (say SIGKILL) will kill the
> > whole thread-group.
>
> Wow.  Those are crazy semantics for fatal signals.  Sending a tkill
> should not affect the entire thread group.

How so? SIGKILL or any fatal signal should kill the whole process, even if
it was sent by tkill().

> Oleg I think this is a bug
> you introduced and likely requires a separate fix.
>
> I really don't understand the logic in:
>
> commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567
> Author: Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
> Date:   Wed Apr 30 00:52:55 2008 -0700
>
>     signals: use __group_complete_signal() for the specific signals too

No. You can even forget about "send" path for the moment. Just suppose that
a thread dequeues SIGKILL sent by tkill(). In this case it will call
do_group_exit() and kill the group anyway. It is not possible to kill an
individual thread, and linux never did this.

Afaics, this commit also fixes the case when SIGKILL can be lost when tkill()
races with the exiting target. Or if the target is a zombie-leader. Exactly
because they obviously can't dequeue SIGKILL.

Plus we want to shutdown the whole thread-group "asap", that is why
complete_signal() sets SIGNAL_GROUP_EXIT and sends SIGKILL to other threads
in the "send" path.

This btw reminds me that we want to do the same with sig_kernel_coredump()
signals too, but this is not simple.

> >> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
> >>  		 * must see ->sighand == NULL.
> >>  		 */
> >>  		spin_lock(&sighand->siglock);
> >> -		if (likely(sighand == tsk->sighand)) {
> >> +		if (likely((sighand == tsk->sighand) &&
> >> +			   (tsk->self_exec_id == tsk->signal->exec_id))) {
> >
> > Oh, this doesn't look good to me. Yes, with your approach we probably need
> > this to, say, ensure that posix-cpu-timer can't kill the process after exec,
> > but I'd rather add the exit_state check into run_posix_timers().
>
> The entire point of lock_task_sighand is to not operate on
> tasks/processes that have exited.

Well, the entire point of lock_task_sighand() is take ->siglock if possible.

> The fact it even sighand in there is
> deceptive because it is all about siglock and nothing to do with
> sighand.

Not sure I understand what you mean...

Yes, lock_task_sighand() can obviously fail, and yes the failure is used
as an indication that this thread has gone. But a zombie thread controlled
by the parent/debugger has not gone yet.

> > ====================================================================
> > Now lets fix another problem. A mt exec suceeds and apllication does
> > sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds
> > another (zombie) SECCOMP_MODE_FILTER thread.
> >
> > And after we fix this problem, what else we will need to fix?
> >
> >
> > I really think that - whatever we do - there should be no other threads
> > after exec, even zombies.
>
> I see where you are coming from.
>
> I need to stare at this a bit longer.  Because you are right.  Reusing
> the signal_struct and leaving zombies around is very prone to bugs.  So
> it is not very maintainable.

Yes, yes, yes. This is what I was arguing with.

> I suspect the answer here is to simply allocate a new sighand_struct and
> a new signal_struct if there we are not single threaded by the time we
> get down to the end of de_thread.

May be. Not sure. Looks very nontrivial.

And I still think that if we do this, we should fix the bug first, then try
to do something like this.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
@ 2017-04-06 15:55                             ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-06 15:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel, linux-api, Eugene Syromiatnikov

On 04/03, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> I reviewed the code and cred_guard_mutex needs to cover what it covers.
> >
> > I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow
> > it in any case, even if the current code was not bugy. But this is almost
> > offtopic, lets discuss this separately.

And let me repeat/clarify.

I meant, I see absolutely no reason to do copy_strings() with this mutex held,
for example. And note that copy_strings() can use a lot of memory/time, it can
trigger oom,swapping, etc. I also think that we can probably do
check_unsafe_exec() which in particular sets LSM_UNSAFE_ at bit later, but I
am less sure about this and this needs more work.

And perhaps more changes like this to narrow the scope of this mutex. And I
thought you were already agree with this?


> You have asked why I have problems with your patch and so I am going to
> try to explain.  Partly I want to see a clean set of patches that we
> can merge into Linus's tree before we make any compromises.

Sure, me too. I do not see a simple and clean fix, your attempts were wrong
so far and imo were worse even if they were correct.

And this makes me think again that we need to restart this discusion with
more CC's.

> Partly I object because your understanding and my understanding of
> cred_guard_mutex are very different.
>
> As I read and understand the code the job of cred_guard_mutex is to keep
> ptrace (and other threads of the proccess) from interferring in
> exec and to ensure old resources are accessed with permission checks
> using our original credentials and that new and modified resources are
> accessed with permission checks using our new credentials.

Yes, this is clear.

> I object to your patch in particular because you deliberately mess up
> the part of only making old resources available with old creds and
> new resources available with new creds.

Could you spell please? I don't understand.

> AKA What I see neededing to be protected looks something like:
> 	mutex_lock();
>         new_cred = compute_new_cred(tsk);
>         new_mm = compute_new_mm(tsk);
>         tsk->mm = new_mm;
>         tsk->cred = new_cred;
>         zap_other_threads(tsk);
>         update_sighand(tsk);
>         update_signal(tsk);
>         do_close_on_exec();
>         update_tsk_fields(tsk);
>         mutex_unlock();
>
> The only way I can see of reducing the scope of cred_guard_mutex is
> performing work in such a way that ptrace and the other threads can't
> interfere and then taking the lock.  Computing the new mm and the new
> credentials are certainly candidates for that kind of treatment.

OK. And yes, I am not sure this all is optimal, but didn't I say from
the very beginning that unlikely we can change this?

Now let me quote your next email:

On 04/05, Eric W. Biederman wrote:
>
> We have barely begun.  You have not shown anyone what your idea of a
> clean fix actually is.  All I have seen from you is a quick hack that is
> a hack that is back-portable.

Yes. I really tried to make a back-portable fix. Otherwise I would start
with ->notify_count cleanups.

> With your change exec still blocks waiting for zombies.

And this is what we currently do. Whether we should do this may be debatable,
but why do you blame my patch?

> Furthermore you have to violate the reasonable rule that:
> * pre-exec resources are guarded with the pre-exec process cred.
> * post-exec resources are guarded with the post-exec process cred.

For example?

> So from a userspace perspective I think your semantics are absolutely
> insane.

Which semantics were changed by my patch?

> - You made comments about cred_guard_mutex and it's scope that when I
>   reviewed the code were false.

Which of my comments was wrong?

> We can optimize it

And this is what I meant when I said "we should narrow it in any case"

> but we can't change what is protected

I am not that sure. But a) I did not even try to suggest to change anything
in this area right now, and b) I said that unlikely this is possible.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
@ 2017-04-06 15:55                             ` Oleg Nesterov
  0 siblings, 0 replies; 93+ messages in thread
From: Oleg Nesterov @ 2017-04-06 15:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas,
	Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Eugene Syromiatnikov

On 04/03, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
> >> I reviewed the code and cred_guard_mutex needs to cover what it covers.
> >
> > I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow
> > it in any case, even if the current code was not bugy. But this is almost
> > offtopic, lets discuss this separately.

And let me repeat/clarify.

I meant, I see absolutely no reason to do copy_strings() with this mutex held,
for example. And note that copy_strings() can use a lot of memory/time, it can
trigger oom,swapping, etc. I also think that we can probably do
check_unsafe_exec() which in particular sets LSM_UNSAFE_ at bit later, but I
am less sure about this and this needs more work.

And perhaps more changes like this to narrow the scope of this mutex. And I
thought you were already agree with this?


> You have asked why I have problems with your patch and so I am going to
> try to explain.  Partly I want to see a clean set of patches that we
> can merge into Linus's tree before we make any compromises.

Sure, me too. I do not see a simple and clean fix, your attempts were wrong
so far and imo were worse even if they were correct.

And this makes me think again that we need to restart this discusion with
more CC's.

> Partly I object because your understanding and my understanding of
> cred_guard_mutex are very different.
>
> As I read and understand the code the job of cred_guard_mutex is to keep
> ptrace (and other threads of the proccess) from interferring in
> exec and to ensure old resources are accessed with permission checks
> using our original credentials and that new and modified resources are
> accessed with permission checks using our new credentials.

Yes, this is clear.

> I object to your patch in particular because you deliberately mess up
> the part of only making old resources available with old creds and
> new resources available with new creds.

Could you spell please? I don't understand.

> AKA What I see neededing to be protected looks something like:
> 	mutex_lock();
>         new_cred = compute_new_cred(tsk);
>         new_mm = compute_new_mm(tsk);
>         tsk->mm = new_mm;
>         tsk->cred = new_cred;
>         zap_other_threads(tsk);
>         update_sighand(tsk);
>         update_signal(tsk);
>         do_close_on_exec();
>         update_tsk_fields(tsk);
>         mutex_unlock();
>
> The only way I can see of reducing the scope of cred_guard_mutex is
> performing work in such a way that ptrace and the other threads can't
> interfere and then taking the lock.  Computing the new mm and the new
> credentials are certainly candidates for that kind of treatment.

OK. And yes, I am not sure this all is optimal, but didn't I say from
the very beginning that unlikely we can change this?

Now let me quote your next email:

On 04/05, Eric W. Biederman wrote:
>
> We have barely begun.  You have not shown anyone what your idea of a
> clean fix actually is.  All I have seen from you is a quick hack that is
> a hack that is back-portable.

Yes. I really tried to make a back-portable fix. Otherwise I would start
with ->notify_count cleanups.

> With your change exec still blocks waiting for zombies.

And this is what we currently do. Whether we should do this may be debatable,
but why do you blame my patch?

> Furthermore you have to violate the reasonable rule that:
> * pre-exec resources are guarded with the pre-exec process cred.
> * post-exec resources are guarded with the post-exec process cred.

For example?

> So from a userspace perspective I think your semantics are absolutely
> insane.

Which semantics were changed by my patch?

> - You made comments about cred_guard_mutex and it's scope that when I
>   reviewed the code were false.

Which of my comments was wrong?

> We can optimize it

And this is what I meant when I said "we should narrow it in any case"

> but we can't change what is protected

I am not that sure. But a) I did not even try to suggest to change anything
in this area right now, and b) I said that unlikely this is possible.

Oleg.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
@ 2017-04-07 22:07                               ` Kees Cook
  0 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2017-04-07 22:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Jann Horn, Michal Hocko, Ulrich Obergfell, LKML,
	Linux API, Eugene Syromiatnikov

On Thu, Apr 6, 2017 at 8:55 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> And this makes me think again that we need to restart this discusion with
> more CC's.

I'm a fan of that; I've not been able to follow this thread as it
seems to have gone far from the original deadlock problem. :) I've
seen issues with ptrace, zombies, and now exec. I'm lost. :P

>> Partly I object because your understanding and my understanding of
>> cred_guard_mutex are very different.
>>
>> As I read and understand the code the job of cred_guard_mutex is to keep
>> ptrace (and other threads of the proccess) from interferring in
>> exec and to ensure old resources are accessed with permission checks
>> using our original credentials and that new and modified resources are
>> accessed with permission checks using our new credentials.
>
> Yes, this is clear.

Maybe stupid idea: can we get a patch that just adds this kind of
documentation somewhere in the source? If we can agree on the purpose
of cred_guard_mutex, and get it into the code, that seems like a good
step in discussion...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: scope of cred_guard_mutex.
@ 2017-04-07 22:07                               ` Kees Cook
  0 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2017-04-07 22:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Jann Horn, Michal Hocko, Ulrich Obergfell, LKML,
	Linux API, Eugene Syromiatnikov

On Thu, Apr 6, 2017 at 8:55 AM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> And this makes me think again that we need to restart this discusion with
> more CC's.

I'm a fan of that; I've not been able to follow this thread as it
seems to have gone far from the original deadlock problem. :) I've
seen issues with ptrace, zombies, and now exec. I'm lost. :P

>> Partly I object because your understanding and my understanding of
>> cred_guard_mutex are very different.
>>
>> As I read and understand the code the job of cred_guard_mutex is to keep
>> ptrace (and other threads of the proccess) from interferring in
>> exec and to ensure old resources are accessed with permission checks
>> using our original credentials and that new and modified resources are
>> accessed with permission checks using our new credentials.
>
> Yes, this is clear.

Maybe stupid idea: can we get a patch that just adds this kind of
documentation somewhere in the source? If we can agree on the purpose
of cred_guard_mutex, and get it into the code, that seems like a good
step in discussion...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-09-04  3:19                         ` Robert O'Callahan
  0 siblings, 0 replies; 93+ messages in thread
From: Robert O'Callahan @ 2017-09-04  3:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko,
	Ulrich Obergfell, open list, linux-api

Sorry about replying to this old thread, but...

On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
> know what the implications of changing it are.  Let's see...
>
> gdb - no
> upstart - no
> lldb - yes
> strace - no

For the record, rr uses PTRACE_O_TRACEEXIT.

When a thread exits we need to examine its address space to find its
robust futex list and record the changes that will be performed by the
kernel as it cleans up that list. So, any failures to deliver
PTRACE_EVENT_EXIT are potentially problematic for us because we won't
get a chance to examine the address space before it disappears.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
@ 2017-09-04  3:19                         ` Robert O'Callahan
  0 siblings, 0 replies; 93+ messages in thread
From: Robert O'Callahan @ 2017-09-04  3:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Aleksa Sarai, Andy Lutomirski,
	Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko,
	Ulrich Obergfell, open list, linux-api-u79uwXL29TY76Z2rM5mHXA

Sorry about replying to this old thread, but...

On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
> know what the implications of changing it are.  Let's see...
>
> gdb - no
> upstart - no
> lldb - yes
> strace - no

For the record, rr uses PTRACE_O_TRACEEXIT.

When a thread exits we need to examine its address space to find its
robust futex list and record the changes that will be performed by the
kernel as it cleans up that list. So, any failures to deliver
PTRACE_EVENT_EXIT are potentially problematic for us because we won't
get a chance to examine the address space before it disappears.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2017-09-04  3:19 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
2017-02-13 16:12   ` kbuild test robot
2017-02-13 16:47     ` Oleg Nesterov
2017-02-13 16:39   ` kbuild test robot
2017-02-13 17:27   ` Mika Penttilä
2017-02-13 18:01     ` Oleg Nesterov
2017-02-13 18:04   ` [PATCH V2 " Oleg Nesterov
2017-02-16 11:42     ` Eric W. Biederman
2017-02-20 15:22       ` Oleg Nesterov
2017-02-20 15:36         ` Oleg Nesterov
2017-02-20 22:30         ` Eric W. Biederman
2017-02-21 17:53           ` Oleg Nesterov
2017-02-21 20:20             ` Eric W. Biederman
2017-02-22 17:41               ` Oleg Nesterov
2017-02-17  4:42     ` Eric W. Biederman
2017-02-20 15:50       ` Oleg Nesterov
2017-02-13 14:15 ` [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
2017-03-03  1:05   ` Eric W. Biederman
2017-03-03 17:33     ` Oleg Nesterov
2017-03-03 18:23       ` Eric W. Biederman
2017-03-03 18:23         ` Eric W. Biederman
2017-03-03 18:59         ` Eric W. Biederman
2017-03-03 18:59           ` Eric W. Biederman
2017-03-03 20:06           ` Eric W. Biederman
2017-03-03 20:06             ` Eric W. Biederman
2017-03-03 20:11             ` [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped Eric W. Biederman
2017-03-03 20:11               ` Eric W. Biederman
2017-03-04 17:03               ` Oleg Nesterov
2017-03-30  8:07                 ` Eric W. Biederman
2017-04-01  5:11                   ` [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang Eric W. Biederman
2017-04-01  5:11                     ` Eric W. Biederman
2017-04-01  5:14                     ` [RFC][PATCH 1/2] sighand: Count each thread group once in sighand_struct Eric W. Biederman
2017-04-01  5:14                       ` Eric W. Biederman
2017-04-01  5:16                     ` [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped Eric W. Biederman
2017-04-01  5:16                       ` Eric W. Biederman
2017-04-02 15:35                       ` Oleg Nesterov
2017-04-02 15:35                         ` Oleg Nesterov
2017-04-02 18:53                         ` Eric W. Biederman
2017-04-02 18:53                           ` Eric W. Biederman
2017-04-03 18:12                           ` Oleg Nesterov
2017-04-03 18:12                             ` Oleg Nesterov
2017-04-03 21:04                             ` Eric W. Biederman
2017-04-05 16:44                               ` Oleg Nesterov
2017-04-02 15:38                     ` [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang Oleg Nesterov
2017-04-02 15:38                       ` Oleg Nesterov
2017-04-02 22:50                     ` [RFC][PATCH v2 0/5] " Eric W. Biederman
2017-04-02 22:50                       ` Eric W. Biederman
2017-04-02 22:51                       ` [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump Eric W. Biederman
2017-04-02 22:51                         ` Eric W. Biederman
2017-04-05 16:19                         ` Oleg Nesterov
2017-04-02 22:51                       ` [RFC][PATCH v2 2/5] sighand: Count each thread group once in sighand_struct Eric W. Biederman
2017-04-02 22:51                         ` Eric W. Biederman
2017-04-02 22:52                       ` [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct Eric W. Biederman
2017-04-02 22:52                         ` Eric W. Biederman
2017-04-05 16:24                         ` Oleg Nesterov
2017-04-05 16:24                           ` Oleg Nesterov
2017-04-05 17:34                           ` Eric W. Biederman
2017-04-05 18:11                             ` Oleg Nesterov
2017-04-02 22:53                       ` [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped Eric W. Biederman
2017-04-02 22:53                         ` Eric W. Biederman
2017-04-05 16:15                         ` Oleg Nesterov
2017-04-02 22:57                       ` [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec Eric W. Biederman
2017-04-02 22:57                         ` Eric W. Biederman
2017-04-05 16:18                         ` Oleg Nesterov
2017-04-05 16:18                           ` Oleg Nesterov
2017-04-05 18:16                           ` Eric W. Biederman
2017-04-05 18:16                             ` Eric W. Biederman
2017-04-06 15:48                             ` Oleg Nesterov
2017-04-06 15:48                               ` Oleg Nesterov
2017-04-02 16:15                   ` [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped Oleg Nesterov
2017-04-02 16:15                     ` Oleg Nesterov
2017-04-02 21:07                     ` Eric W. Biederman
2017-04-02 21:07                       ` Eric W. Biederman
2017-04-03 18:37                       ` Oleg Nesterov
2017-04-03 18:37                         ` Oleg Nesterov
2017-04-03 22:49                         ` Eric W. Biederman
2017-04-03 22:49                           ` Eric W. Biederman
2017-04-03 22:49                         ` scope of cred_guard_mutex Eric W. Biederman
2017-04-03 22:49                           ` Eric W. Biederman
2017-04-05 16:08                           ` Oleg Nesterov
2017-04-05 16:11                             ` Kees Cook
2017-04-05 17:53                             ` Eric W. Biederman
2017-04-05 18:15                               ` Oleg Nesterov
2017-04-06 15:55                           ` Oleg Nesterov
2017-04-06 15:55                             ` Oleg Nesterov
2017-04-07 22:07                             ` Kees Cook
2017-04-07 22:07                               ` Kees Cook
2017-09-04  3:19                       ` [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped Robert O'Callahan
2017-09-04  3:19                         ` Robert O'Callahan
2017-03-04 16:54         ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov
2017-03-04 16:54           ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.