* [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-02-13 14:14 Oleg Nesterov 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov ` (2 more replies) 0 siblings, 3 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-13 14:14 UTC (permalink / raw) To: Andrew Morton Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel Hello, Lets finally fix this problem, it was reported several times. I still think that in the longer term we should (try to) rework the security hooks and (partially) revert this change, but this is not trivial and we need something backportable anyway. Eric, Jann, we already discussed this change. 1/2 is the same patch I suggested 3 months ago except now it compiles and moves flush_signal_handlers() to de_thread(). Both patches ask for subsequent cleanups, see the changelogs. Oleg. arch/x86/ia32/ia32_aout.c | 3 ++ fs/binfmt_aout.c | 3 ++ fs/binfmt_elf.c | 6 ++- fs/binfmt_elf_fdpic.c | 4 ++ fs/binfmt_flat.c | 3 ++ fs/exec.c | 128 +++++++++++++++++++++++----------------------- include/linux/binfmts.h | 1 + kernel/exit.c | 5 +- kernel/signal.c | 21 +++++--- 9 files changed, 101 insertions(+), 73 deletions(-) ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov @ 2017-02-13 14:15 ` Oleg Nesterov 2017-02-13 16:12 ` kbuild test robot ` (3 more replies) 2017-02-13 14:15 ` [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov 2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov 2 siblings, 4 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-13 14:15 UTC (permalink / raw) To: Andrew Morton Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel de_thread() waits for other threads with ->cred_guard_mutex held and this is really bad because the time is not bounded, debugger can delay the exit and this lock has a lot of users (mostly abusers imo) in fs/proc and more. And this leads to deadlock if debugger tries to take the same mutex: #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <pthread.h> void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); // or anything else which needs ->cred_guard_mutex, // say open(/proc/$pid/mem) ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } This hangs because de_thread() waits for debugger which should release the killed thread with cred_guard_mutex held, while the debugger sleeps waiting for the same mutex. Not really that bad, the tracer can be killed, but still this is a bug and people hit it in practice. The patch changes flush_old_exec() to wait until all the threads have passed exit_notify(), we use the new kill_sub_threads() helper instead of de_thread(). However, we still need to wait until they all disappear. The main reason is that we can not unshare sighand until that, execing thread and zombies must use the same sighand->siglock to serialize the access to ->thread_head/etc. So the patch changes ->load_binary() methods to call de_thread() right after install_exec_creds() drops cred_guard_mutex. Note that de_thread() is simplified, it no longer needs to send SIGKILL to sub-threads and wait for the group_leader to become a zombie. But since it is called after flush_old_exec() we also need to move flush_signal_handlers() to de_thread(). TODO: - Move install_exec_creds() and de_thread() into setup_new_exec(), unexport de_thread(). - Avoid tasklist_lock in kill_sub_threads(), we can change exit_notify() to set ->exit_state under siglock. - Simplify/cleanup the ->notify_count logic, it is really ugly. I think we should just turn this counter into signal->nr_live_threads. __exit_signal can use signal->nr_threads instead. - In any case we should limit the scope of cred_guard_mutex in execve paths. It is not clear why do we take it at the start of execve, and worse, it is not clear why we do we actually overload this mutex to exclude other threads (except check_unsafe_exec() but this is solveable). The original motivation was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221 ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to call copy_strings/etc with this mutex held. Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Reported-by: Attila Fazekas <afazekas@redhat.com> Reported-by: Aleksa Sarai <asarai@suse.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- arch/x86/ia32/ia32_aout.c | 3 ++ fs/binfmt_aout.c | 3 ++ fs/binfmt_elf.c | 6 ++- fs/binfmt_elf_fdpic.c | 4 ++ fs/binfmt_flat.c | 3 ++ fs/exec.c | 128 +++++++++++++++++++++++----------------------- include/linux/binfmts.h | 1 + kernel/exit.c | 5 +- kernel/signal.c | 3 +- 9 files changed, 87 insertions(+), 69 deletions(-) diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c index 7c0a711..a6b9cc9 100644 --- a/arch/x86/ia32/ia32_aout.c +++ b/arch/x86/ia32/ia32_aout.c @@ -312,6 +312,9 @@ static int load_aout_binary(struct linux_binprm *bprm) return retval; install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + return retval; if (N_MAGIC(ex) == OMAGIC) { unsigned long text_addr, map_size; diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c index 2a59139..edd1335 100644 --- a/fs/binfmt_aout.c +++ b/fs/binfmt_aout.c @@ -256,6 +256,9 @@ static int load_aout_binary(struct linux_binprm * bprm) return retval; install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + return retval; if (N_MAGIC(ex) == OMAGIC) { unsigned long text_addr, map_size; diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 4223702..79508f7 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm) setup_new_exec(bprm); install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + goto out_free_dentry; + /* Do this so that we can load the interpreter, if need be. We will change some of these later */ retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), executable_stack); if (retval < 0) goto out_free_dentry; - + current->mm->start_stack = bprm->p; /* Now we do a little grungy work by mmapping the ELF image into diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index d2e36f8..75fd6d8 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm) #endif install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + goto error; + if (create_elf_fdpic_tables(bprm, current->mm, &exec_params, &interp_params) < 0) goto error; diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c index 9b2917a..a0ad9a3 100644 --- a/fs/binfmt_flat.c +++ b/fs/binfmt_flat.c @@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm) } install_exec_creds(bprm); + res = de_thread(current); + if (res) + return res; set_binfmt(&flat_format); diff --git a/fs/exec.c b/fs/exec.c index e579466..8591c56 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm) return 0; } +static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig) +{ + for (;;) { + if (unlikely(__fatal_signal_pending(tsk))) + goto killed; + set_current_state(TASK_KILLABLE); + if (!sig->notify_count) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); + return 0; + +killed: + /* protects against exit_notify() and __exit_signal() */ + read_lock(&tasklist_lock); + sig->group_exit_task = NULL; + sig->notify_count = 0; + read_unlock(&tasklist_lock); + return -EINTR; +} + +/* + * Kill all the sub-threads and wait until they all pass exit_notify(). + */ +static int kill_sub_threads(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + int err = -EINTR; + + if (thread_group_empty(tsk)) + return 0; + + read_lock(&tasklist_lock); + spin_lock_irq(&tsk->sighand->siglock); + if (!signal_group_exit(sig)) { + sig->group_exit_task = tsk; + sig->notify_count = -zap_other_threads(tsk); + err = 0; + } + spin_unlock_irq(&tsk->sighand->siglock); + read_unlock(&tasklist_lock); + + if (!err) + err = wait_for_notify_count(tsk, sig); + return err; + +} + /* - * This function makes sure the current process has its own signal table, - * so that flush_signal_handlers can later reset the handlers without - * disturbing other processes. (Other processes might share the signal - * table via the CLONE_SIGHAND option to clone().) + * This function makes sure the current process has no other threads and + * has a private signal table so that flush_signal_handlers() can reset + * the handlers without disturbing other processes which might share the + * signal table via the CLONE_SIGHAND option to clone(). */ -static int de_thread(struct task_struct *tsk) +int de_thread(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk) if (thread_group_empty(tsk)) goto no_thread_group; - /* - * Kill all other threads in the thread group. - */ spin_lock_irq(lock); - if (signal_group_exit(sig)) { - /* - * Another group action in progress, just - * return so that the signal is processed. - */ - spin_unlock_irq(lock); - return -EAGAIN; - } - - sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); + sig->notify_count = sig->nr_threads; if (!thread_group_leader(tsk)) sig->notify_count--; - - while (sig->notify_count) { - __set_current_state(TASK_KILLABLE); - spin_unlock_irq(lock); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - spin_lock_irq(lock); - } spin_unlock_irq(lock); + if (wait_for_notify_count(tsk, sig)) + return -EINTR; + /* * At this point all other threads have exited, all we have to - * do is to wait for the thread group leader to become inactive, - * and to assume its PID: + * do is to reap the old leader and assume its PID. */ if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } - + threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a * process, regardless of execs it's done, is start_time. @@ -1162,10 +1175,9 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } +no_thread_group: sig->group_exit_task = NULL; sig->notify_count = 0; - -no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; @@ -1198,15 +1210,8 @@ static int de_thread(struct task_struct *tsk) } BUG_ON(!thread_group_leader(tsk)); + flush_signal_handlers(current, 0); return 0; - -killed: - /* protects against exit_notify() and __exit_signal() */ - read_lock(&tasklist_lock); - sig->group_exit_task = NULL; - sig->notify_count = 0; - read_unlock(&tasklist_lock); - return -EAGAIN; } char *get_task_comm(char *buf, struct task_struct *tsk) @@ -1237,11 +1242,7 @@ int flush_old_exec(struct linux_binprm * bprm) { int retval; - /* - * Make sure we have a private signal table and that - * we are unassociated from the previous thread group. - */ - retval = de_thread(current); + retval = kill_sub_threads(current); if (retval) goto out; @@ -1336,7 +1337,6 @@ void setup_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ current->self_exec_id++; - flush_signal_handlers(current, 0); } EXPORT_SYMBOL(setup_new_exec); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 1303b57..06a5a7b 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *); extern int search_binary_handler(struct linux_binprm *); extern int flush_old_exec(struct linux_binprm * bprm); extern void setup_new_exec(struct linux_binprm * bprm); +extern int de_thread(struct task_struct *tsk); extern void would_dump(struct linux_binprm *, struct file *); extern int suid_dumpable; diff --git a/kernel/exit.c b/kernel/exit.c index 8f14b86..169d9f2 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -699,8 +699,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) + /* mt-exec, kill_sub_threads() is waiting for group exit */ + if (unlikely(tsk->signal->notify_count < 0) && + !++tsk->signal->notify_count) wake_up_process(tsk->signal->group_exit_task); write_unlock_irq(&tasklist_lock); diff --git a/kernel/signal.c b/kernel/signal.c index 3603d93..b78ce63 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1200,13 +1200,12 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; - /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count; -- 2.5.0 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov @ 2017-02-13 16:12 ` kbuild test robot 2017-02-13 16:47 ` Oleg Nesterov 2017-02-13 16:39 ` kbuild test robot ` (2 subsequent siblings) 3 siblings, 1 reply; 93+ messages in thread From: kbuild test robot @ 2017-02-13 16:12 UTC (permalink / raw) To: Oleg Nesterov Cc: kbuild-all, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel [-- Attachment #1: Type: text/plain, Size: 902 bytes --] Hi Oleg, [auto build test ERROR on linus/master] [also build test ERROR on v4.10-rc8 next-20170213] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Oleg-Nesterov/fix-the-traced-mt-exec-deadlock/20170213-221943 config: m68k-sun3_defconfig (attached as .config) compiler: m68k-linux-gcc (GCC) 4.9.0 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=m68k All errors (new ones prefixed by >>): >> ERROR: "de_thread" [fs/binfmt_aout.ko] undefined! --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 11660 bytes --] ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 16:12 ` kbuild test robot @ 2017-02-13 16:47 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-13 16:47 UTC (permalink / raw) To: kbuild test robot Cc: kbuild-all, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel On 02/14, kbuild test robot wrote: > > Hi Oleg, > > [auto build test ERROR on linus/master] > [also build test ERROR on v4.10-rc8 next-20170213] > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] > > url: https://github.com/0day-ci/linux/commits/Oleg-Nesterov/fix-the-traced-mt-exec-deadlock/20170213-221943 > config: m68k-sun3_defconfig (attached as .config) > compiler: m68k-linux-gcc (GCC) 4.9.0 > reproduce: > wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > make.cross ARCH=m68k > > All errors (new ones prefixed by >>): > > >> ERROR: "de_thread" [fs/binfmt_aout.ko] undefined! Aaah thanks... Of course, I forgot to add EXPORT_SYMBOL(de_thread), will send V2. Thanks a lot! Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov 2017-02-13 16:12 ` kbuild test robot @ 2017-02-13 16:39 ` kbuild test robot 2017-02-13 17:27 ` Mika Penttilä 2017-02-13 18:04 ` [PATCH V2 " Oleg Nesterov 3 siblings, 0 replies; 93+ messages in thread From: kbuild test robot @ 2017-02-13 16:39 UTC (permalink / raw) To: Oleg Nesterov Cc: kbuild-all, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel [-- Attachment #1: Type: text/plain, Size: 777 bytes --] Hi Oleg, [auto build test ERROR on linus/master] [also build test ERROR on v4.10-rc8 next-20170213] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Oleg-Nesterov/fix-the-traced-mt-exec-deadlock/20170213-221943 config: x86_64-randconfig-ne0-02132256 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): >> ERROR: "de_thread" [arch/x86/ia32/ia32_aout.ko] undefined! --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 22364 bytes --] ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov 2017-02-13 16:12 ` kbuild test robot 2017-02-13 16:39 ` kbuild test robot @ 2017-02-13 17:27 ` Mika Penttilä 2017-02-13 18:01 ` Oleg Nesterov 2017-02-13 18:04 ` [PATCH V2 " Oleg Nesterov 3 siblings, 1 reply; 93+ messages in thread From: Mika Penttilä @ 2017-02-13 17:27 UTC (permalink / raw) To: Oleg Nesterov, Andrew Morton Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel On 13.02.2017 16:15, Oleg Nesterov wrote: > + retval = de_thread(current); > + if (retval) > + return retval; > > if (N_MAGIC(ex) == OMAGIC) { > unsigned long text_addr, map_size; > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c > index 4223702..79508f7 100644 > --- a/fs/binfmt_elf.c > +++ b/fs/binfmt_elf.c > @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm) > setup_new_exec(bprm); > install_exec_creds(bprm); > > + retval = de_thread(current); > + if (retval) > + goto out_free_dentry; > + > /* Do this so that we can load the interpreter, if need be. We will > change some of these later */ > retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), > executable_stack); > if (retval < 0) > goto out_free_dentry; > - > + > current->mm->start_stack = bprm->p; > > /* Now we do a little grungy work by mmapping the ELF image into > diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c > index d2e36f8..75fd6d8 100644 > --- a/fs/binfmt_elf_fdpic.c > +++ b/fs/binfmt_elf_fdpic.c > @@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm) > #endif > > install_exec_creds(bprm); > + retval = de_thread(current); > + if (retval) > + goto error; > + > if (create_elf_fdpic_tables(bprm, current->mm, > &exec_params, &interp_params) < 0) > goto error; > diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c > index 9b2917a..a0ad9a3 100644 > --- a/fs/binfmt_flat.c > +++ b/fs/binfmt_flat.c > @@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm) > } > > install_exec_creds(bprm); > + res = de_thread(current); > + if (res) > + return res; > > set_binfmt(&flat_format); > > diff --git a/fs/exec.c b/fs/exec.c > index e579466..8591c56 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm) > return 0; > } > > +static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig) > +{ > + for (;;) { > + if (unlikely(__fatal_signal_pending(tsk))) > + goto killed; > + set_current_state(TASK_KILLABLE); > + if (!sig->notify_count) > + break; > + schedule(); > + } > + __set_current_state(TASK_RUNNING); > + return 0; > + > +killed: > + /* protects against exit_notify() and __exit_signal() */ > + read_lock(&tasklist_lock); > + sig->group_exit_task = NULL; > + sig->notify_count = 0; > + read_unlock(&tasklist_lock); > + return -EINTR; > +} > + > +/* > + * Kill all the sub-threads and wait until they all pass exit_notify(). > + */ > +static int kill_sub_threads(struct task_struct *tsk) > +{ > + struct signal_struct *sig = tsk->signal; > + int err = -EINTR; > + > + if (thread_group_empty(tsk)) > + return 0; > + > + read_lock(&tasklist_lock); > + spin_lock_irq(&tsk->sighand->siglock); > + if (!signal_group_exit(sig)) { > + sig->group_exit_task = tsk; > + sig->notify_count = -zap_other_threads(tsk); > + err = 0; > + } > + spin_unlock_irq(&tsk->sighand->siglock); > + read_unlock(&tasklist_lock); > + > + if (!err) > + err = wait_for_notify_count(tsk, sig); > + return err; > + > +} > + > /* > - * This function makes sure the current process has its own signal table, > - * so that flush_signal_handlers can later reset the handlers without > - * disturbing other processes. (Other processes might share the signal > - * table via the CLONE_SIGHAND option to clone().) > + * This function makes sure the current process has no other threads and > + * has a private signal table so that flush_signal_handlers() can reset > + * the handlers without disturbing other processes which might share the > + * signal table via the CLONE_SIGHAND option to clone(). > */ > -static int de_thread(struct task_struct *tsk) > +int de_thread(struct task_struct *tsk) > { > struct signal_struct *sig = tsk->signal; > struct sighand_struct *oldsighand = tsk->sighand; > @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk) > if (thread_group_empty(tsk)) > goto no_thread_group; > > - /* > - * Kill all other threads in the thread group. > - */ > spin_lock_irq(lock); > - if (signal_group_exit(sig)) { > - /* > - * Another group action in progress, just > - * return so that the signal is processed. > - */ > - spin_unlock_irq(lock); > - return -EAGAIN; > - } > - > - sig->group_exit_task = tsk; > - sig->notify_count = zap_other_threads(tsk); > + sig->notify_count = sig->nr_threads; maybe nr_threads - 1 since nr_threads includes us ? + sig->notify_count = sig->nr_threads - 1; > if (!thread_group_leader(tsk)) > sig->notify_count--; > - > - while (sig->notify_count) { > - __set_current_state(TASK_KILLABLE); > - spin_unlock_irq(lock); > - schedule(); > - if (unlikely(__fatal_signal_pending(tsk))) > - goto killed; > - spin_lock_irq(lock); > - } > spin_unlock_irq(lock); > > + if (wait_for_notify_count(tsk, sig)) > + return -EINTR; > + > /* > * At this point all other threads have exited, all we have to > - * do is to wait for the thread group leader to become inactive, > - * and to assume its PID: > + * do is to reap the old leader and assume its PID. > */ > if (!thread_group_leader(tsk)) { > struct task_struct *leader = tsk->group_leader; > > - for (;;) { > - threadgroup_change_begin(tsk); > - write_lock_irq(&tasklist_lock); > - /* > - * Do this under tasklist_lock to ensure that > - * exit_notify() can't miss ->group_exit_task > - */ > - sig->notify_count = -1; > - if (likely(leader->exit_state)) > - break; > - __set_current_state(TASK_KILLABLE); > - write_unlock_irq(&tasklist_lock); > - threadgroup_change_end(tsk); > - schedule(); > - if (unlikely(__fatal_signal_pending(tsk))) > - goto killed; > - } > - > + threadgroup_change_begin(tsk); > + write_lock_irq(&tasklist_lock); > /* > * The only record we have of the real-time age of a > * process, regardless of execs it's done, is start_time. > @@ -1162,10 +1175,9 @@ static int de_thread(struct task_struct *tsk) > release_task(leader); > } > > +no_thread_group: > sig->group_exit_task = NULL; > sig->notify_count = 0; > - > -no_thread_group: > /* we have changed execution domain */ > tsk->exit_signal = SIGCHLD; > > @@ -1198,15 +1210,8 @@ static int de_thread(struct task_struct *tsk) > } > > BUG_ON(!thread_group_leader(tsk)); > + flush_signal_handlers(current, 0); > return 0; > - > -killed: > - /* protects against exit_notify() and __exit_signal() */ > - read_lock(&tasklist_lock); > - sig->group_exit_task = NULL; > - sig->notify_count = 0; > - read_unlock(&tasklist_lock); > - return -EAGAIN; > } > > char *get_task_comm(char *buf, struct task_struct *tsk) > @@ -1237,11 +1242,7 @@ int flush_old_exec(struct linux_binprm * bprm) > { > int retval; > > - /* > - * Make sure we have a private signal table and that > - * we are unassociated from the previous thread group. > - */ > - retval = de_thread(current); > + retval = kill_sub_threads(current); > if (retval) > goto out; > > @@ -1336,7 +1337,6 @@ void setup_new_exec(struct linux_binprm * bprm) > /* An exec changes our domain. We are no longer part of the thread > group */ > current->self_exec_id++; > - flush_signal_handlers(current, 0); > } > EXPORT_SYMBOL(setup_new_exec); > > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h > index 1303b57..06a5a7b 100644 > --- a/include/linux/binfmts.h > +++ b/include/linux/binfmts.h > @@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *); > extern int search_binary_handler(struct linux_binprm *); > extern int flush_old_exec(struct linux_binprm * bprm); > extern void setup_new_exec(struct linux_binprm * bprm); > +extern int de_thread(struct task_struct *tsk); > extern void would_dump(struct linux_binprm *, struct file *); > > extern int suid_dumpable; > diff --git a/kernel/exit.c b/kernel/exit.c > index 8f14b86..169d9f2 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -699,8 +699,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > if (tsk->exit_state == EXIT_DEAD) > list_add(&tsk->ptrace_entry, &dead); > > - /* mt-exec, de_thread() is waiting for group leader */ > - if (unlikely(tsk->signal->notify_count < 0)) > + /* mt-exec, kill_sub_threads() is waiting for group exit */ > + if (unlikely(tsk->signal->notify_count < 0) && > + !++tsk->signal->notify_count) > wake_up_process(tsk->signal->group_exit_task); > write_unlock_irq(&tasklist_lock); > > diff --git a/kernel/signal.c b/kernel/signal.c > index 3603d93..b78ce63 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1200,13 +1200,12 @@ int zap_other_threads(struct task_struct *p) > > while_each_thread(p, t) { > task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); > - count++; > - > /* Don't bother with already dead threads */ > if (t->exit_state) > continue; > sigaddset(&t->pending.signal, SIGKILL); > signal_wake_up(t, 1); > + count++; > } > > return count; ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 17:27 ` Mika Penttilä @ 2017-02-13 18:01 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-13 18:01 UTC (permalink / raw) To: Mika Penttilä Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel On 02/13, Mika Penttilä wrote: > > > +int de_thread(struct task_struct *tsk) > > { > > struct signal_struct *sig = tsk->signal; > > struct sighand_struct *oldsighand = tsk->sighand; > > @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk) > > if (thread_group_empty(tsk)) > > goto no_thread_group; > > > > - /* > > - * Kill all other threads in the thread group. > > - */ > > spin_lock_irq(lock); > > - if (signal_group_exit(sig)) { > > - /* > > - * Another group action in progress, just > > - * return so that the signal is processed. > > - */ > > - spin_unlock_irq(lock); > > - return -EAGAIN; > > - } > > - > > - sig->group_exit_task = tsk; > > - sig->notify_count = zap_other_threads(tsk); > > + sig->notify_count = sig->nr_threads; > > > maybe nr_threads - 1 since nr_threads includes us ? Damn. Of course you are right, thanks a lot! Please see v2. Hmm. I didn't even notice my own test-case didn't pass because of this off-by-one. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov ` (2 preceding siblings ...) 2017-02-13 17:27 ` Mika Penttilä @ 2017-02-13 18:04 ` Oleg Nesterov 2017-02-16 11:42 ` Eric W. Biederman 2017-02-17 4:42 ` Eric W. Biederman 3 siblings, 2 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-13 18:04 UTC (permalink / raw) To: Andrew Morton, Mika Penttilä Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel de_thread() waits for other threads with ->cred_guard_mutex held and this is really bad because the time is not bounded, debugger can delay the exit and this lock has a lot of users (mostly abusers imo) in fs/proc and more. And this leads to deadlock if debugger tries to take the same mutex: #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <pthread.h> void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); // or anything else which needs ->cred_guard_mutex, // say open(/proc/$pid/mem) ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } This hangs because de_thread() waits for debugger which should release the killed thread with cred_guard_mutex held, while the debugger sleeps waiting for the same mutex. Not really that bad, the tracer can be killed, but still this is a bug and people hit it in practice. The patch changes flush_old_exec() to wait until all the threads have passed exit_notify(), we use the new kill_sub_threads() helper instead of de_thread(). However, we still need to wait until they all disappear. The main reason is that we can not unshare sighand until that, execing thread and zombies must use the same sighand->siglock to serialize the access to ->thread_head/etc. So the patch changes ->load_binary() methods to call de_thread() right after install_exec_creds() drops cred_guard_mutex. Note that de_thread() is simplified, it no longer needs to send SIGKILL to sub-threads and wait for the group_leader to become a zombie. But since it is called after flush_old_exec() we also need to move flush_signal_handlers() to de_thread(). TODO: - Move install_exec_creds() and de_thread() into setup_new_exec(), unexport de_thread(). - Avoid tasklist_lock in kill_sub_threads(), we can change exit_notify() to set ->exit_state under siglock. - Simplify/cleanup the ->notify_count logic, it is really ugly. I think we should just turn this counter into signal->nr_live_threads. __exit_signal can use signal->nr_threads instead. - In any case we should limit the scope of cred_guard_mutex in execve paths. It is not clear why do we take it at the start of execve, and worse, it is not clear why we do we actually overload this mutex to exclude other threads (except check_unsafe_exec() but this is solveable). The original motivation was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221 ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to call copy_strings/etc with this mutex held. v2: - add EXPORT_SYMBOL(de_thread) - fix the usage of nr_threads in de_thread(), pointed by Mika Penttila <mika.penttila@nextfour.com> Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Reported-by: Attila Fazekas <afazekas@redhat.com> Reported-by: Aleksa Sarai <asarai@suse.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- arch/x86/ia32/ia32_aout.c | 3 ++ fs/binfmt_aout.c | 3 ++ fs/binfmt_elf.c | 6 ++- fs/binfmt_elf_fdpic.c | 4 ++ fs/binfmt_flat.c | 3 ++ fs/exec.c | 129 +++++++++++++++++++++++----------------------- include/linux/binfmts.h | 1 + kernel/exit.c | 5 +- kernel/signal.c | 3 +- 9 files changed, 88 insertions(+), 69 deletions(-) diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c index 7c0a711..a6b9cc9 100644 --- a/arch/x86/ia32/ia32_aout.c +++ b/arch/x86/ia32/ia32_aout.c @@ -312,6 +312,9 @@ static int load_aout_binary(struct linux_binprm *bprm) return retval; install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + return retval; if (N_MAGIC(ex) == OMAGIC) { unsigned long text_addr, map_size; diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c index 2a59139..edd1335 100644 --- a/fs/binfmt_aout.c +++ b/fs/binfmt_aout.c @@ -256,6 +256,9 @@ static int load_aout_binary(struct linux_binprm * bprm) return retval; install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + return retval; if (N_MAGIC(ex) == OMAGIC) { unsigned long text_addr, map_size; diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 4223702..79508f7 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm) setup_new_exec(bprm); install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + goto out_free_dentry; + /* Do this so that we can load the interpreter, if need be. We will change some of these later */ retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), executable_stack); if (retval < 0) goto out_free_dentry; - + current->mm->start_stack = bprm->p; /* Now we do a little grungy work by mmapping the ELF image into diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index d2e36f8..75fd6d8 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm) #endif install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + goto error; + if (create_elf_fdpic_tables(bprm, current->mm, &exec_params, &interp_params) < 0) goto error; diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c index 9b2917a..a0ad9a3 100644 --- a/fs/binfmt_flat.c +++ b/fs/binfmt_flat.c @@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm) } install_exec_creds(bprm); + res = de_thread(current); + if (res) + return res; set_binfmt(&flat_format); diff --git a/fs/exec.c b/fs/exec.c index e579466..8e8f2ef 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm) return 0; } +static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig) +{ + for (;;) { + if (unlikely(__fatal_signal_pending(tsk))) + goto killed; + set_current_state(TASK_KILLABLE); + if (!sig->notify_count) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); + return 0; + +killed: + /* protects against exit_notify() and __exit_signal() */ + read_lock(&tasklist_lock); + sig->group_exit_task = NULL; + sig->notify_count = 0; + read_unlock(&tasklist_lock); + return -EINTR; +} + +/* + * Kill all the sub-threads and wait until they all pass exit_notify(). + */ +static int kill_sub_threads(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + int err = -EINTR; + + if (thread_group_empty(tsk)) + return 0; + + read_lock(&tasklist_lock); + spin_lock_irq(&tsk->sighand->siglock); + if (!signal_group_exit(sig)) { + sig->group_exit_task = tsk; + sig->notify_count = -zap_other_threads(tsk); + err = 0; + } + spin_unlock_irq(&tsk->sighand->siglock); + read_unlock(&tasklist_lock); + + if (!err) + err = wait_for_notify_count(tsk, sig); + return err; + +} + /* - * This function makes sure the current process has its own signal table, - * so that flush_signal_handlers can later reset the handlers without - * disturbing other processes. (Other processes might share the signal - * table via the CLONE_SIGHAND option to clone().) + * This function makes sure the current process has no other threads and + * has a private signal table so that flush_signal_handlers() can reset + * the handlers without disturbing other processes which might share the + * signal table via the CLONE_SIGHAND option to clone(). */ -static int de_thread(struct task_struct *tsk) +int de_thread(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk) if (thread_group_empty(tsk)) goto no_thread_group; - /* - * Kill all other threads in the thread group. - */ spin_lock_irq(lock); - if (signal_group_exit(sig)) { - /* - * Another group action in progress, just - * return so that the signal is processed. - */ - spin_unlock_irq(lock); - return -EAGAIN; - } - - sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); + sig->notify_count = sig->nr_threads - 1; if (!thread_group_leader(tsk)) sig->notify_count--; - - while (sig->notify_count) { - __set_current_state(TASK_KILLABLE); - spin_unlock_irq(lock); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - spin_lock_irq(lock); - } spin_unlock_irq(lock); + if (wait_for_notify_count(tsk, sig)) + return -EINTR; + /* * At this point all other threads have exited, all we have to - * do is to wait for the thread group leader to become inactive, - * and to assume its PID: + * do is to reap the old leader and assume its PID. */ if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } - + threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a * process, regardless of execs it's done, is start_time. @@ -1162,10 +1175,9 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } +no_thread_group: sig->group_exit_task = NULL; sig->notify_count = 0; - -no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; @@ -1198,16 +1210,10 @@ static int de_thread(struct task_struct *tsk) } BUG_ON(!thread_group_leader(tsk)); + flush_signal_handlers(current, 0); return 0; - -killed: - /* protects against exit_notify() and __exit_signal() */ - read_lock(&tasklist_lock); - sig->group_exit_task = NULL; - sig->notify_count = 0; - read_unlock(&tasklist_lock); - return -EAGAIN; } +EXPORT_SYMBOL(de_thread); char *get_task_comm(char *buf, struct task_struct *tsk) { @@ -1237,11 +1243,7 @@ int flush_old_exec(struct linux_binprm * bprm) { int retval; - /* - * Make sure we have a private signal table and that - * we are unassociated from the previous thread group. - */ - retval = de_thread(current); + retval = kill_sub_threads(current); if (retval) goto out; @@ -1336,7 +1338,6 @@ void setup_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ current->self_exec_id++; - flush_signal_handlers(current, 0); } EXPORT_SYMBOL(setup_new_exec); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 1303b57..06a5a7b 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *); extern int search_binary_handler(struct linux_binprm *); extern int flush_old_exec(struct linux_binprm * bprm); extern void setup_new_exec(struct linux_binprm * bprm); +extern int de_thread(struct task_struct *tsk); extern void would_dump(struct linux_binprm *, struct file *); extern int suid_dumpable; diff --git a/kernel/exit.c b/kernel/exit.c index 8f14b86..169d9f2 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -699,8 +699,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) + /* mt-exec, kill_sub_threads() is waiting for group exit */ + if (unlikely(tsk->signal->notify_count < 0) && + !++tsk->signal->notify_count) wake_up_process(tsk->signal->group_exit_task); write_unlock_irq(&tasklist_lock); diff --git a/kernel/signal.c b/kernel/signal.c index 3603d93..b78ce63 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1200,13 +1200,12 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; - /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count; -- 2.5.0 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 18:04 ` [PATCH V2 " Oleg Nesterov @ 2017-02-16 11:42 ` Eric W. Biederman 2017-02-20 15:22 ` Oleg Nesterov 2017-02-17 4:42 ` Eric W. Biederman 1 sibling, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-02-16 11:42 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel I am slowly working my way through this and I have found something that I have an issue with. Oleg Nesterov <oleg@redhat.com> writes: > - In any case we should limit the scope of cred_guard_mutex in execve paths. > It is not clear why do we take it at the start of execve, and worse, it is > not clear why we do we actually overload this mutex to exclude other threads > (except check_unsafe_exec() but this is solveable). The original motivation > was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221 > ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to > call copy_strings/etc with this mutex held. The original changes that introduced cred_guard_mutex are: a6f76f23d297 ("CRED: Make execve() take advantage of copy-on-write credentials") d84f4f992cbd ("CRED: Inaugurate COW credentials") So I don't think you actually have your history right. Beyond that there is a compelling reason to have exec appear atomic from the perspective of ptrace_attach. If the operation is a setuid exec and the tracer does not have permission to trace the original or the result of the exec there could be some significant information leakage if the exec operation is not atomic from the perspective of ptrace_attach. This is the kind of thing that could easily defeat address space layout randomization of setuid executables if we get this wrong. So I don't think this is a small thing we are talking about. And while I would like to merge this patch, a patch description that appears to encourage people to make changes that will create security vulnerabilities does not make me comfortable. Additionally your comment makes me nervous when you are wondering why we take this mutex to exclude other threads and I look in the git history and see: commit 9b1bf12d5d51bca178dea21b04a0805e29d60cf1 Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Date: Wed Oct 27 15:34:08 2010 -0700 signals: move cred_guard_mutex from task_struct to signal_struct Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent execve() has no worth. Let's move ->cred_guard_mutex from task_struct to signal_struct. It naturally prevent multiple-threads-inside-exec. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> AKA it was your idea and you reviewed the patch that made it happen. So while I fully agree we have issues here that we need to address and fix your patch description does not inspire confidence. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-16 11:42 ` Eric W. Biederman @ 2017-02-20 15:22 ` Oleg Nesterov 2017-02-20 15:36 ` Oleg Nesterov 2017-02-20 22:30 ` Eric W. Biederman 0 siblings, 2 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-20 15:22 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel Eric, Thanks for looking into this! and sorry for delay. On 02/17, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > - In any case we should limit the scope of cred_guard_mutex in execve paths. > > It is not clear why do we take it at the start of execve, and worse, it is > > not clear why we do we actually overload this mutex to exclude other threads > > (except check_unsafe_exec() but this is solveable). The original motivation > > was signal->in_exec_mm protection but this idea was replaced by 3c77f8457221 > > ("exec: make argv/envp memory visible to oom-killer"). It is just ugly to > > call copy_strings/etc with this mutex held. > > > The original changes that introduced cred_guard_mutex are: > a6f76f23d297 ("CRED: Make execve() take advantage of copy-on-write credentials") > d84f4f992cbd ("CRED: Inaugurate COW credentials") > > So I don't think you actually have your history right. > > Beyond that there is a compelling reason to have exec appear atomic from > the perspective of ptrace_attach. If the operation is a setuid exec > and the tracer does not have permission to trace the original or the > result of the exec there could be some significant information leakage > if the exec operation is not atomic from the perspective of > ptrace_attach. Yes sure. But I meant execve() should not take cred_guard_mutex at the start, it should take it later even if we do not rework the security hooks. At least it should take it after copy_strings(), but probably this needs some work. > Additionally your comment makes me nervous when you are wondering why we > take this mutex to exclude other threads and I look in the git history > and see: > > commit 9b1bf12d5d51bca178dea21b04a0805e29d60cf1 > Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Date: Wed Oct 27 15:34:08 2010 -0700 > > signals: move cred_guard_mutex from task_struct to signal_struct > > Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec > itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent > execve() has no worth. > > Let's move ->cred_guard_mutex from task_struct to signal_struct. It > naturally prevent multiple-threads-inside-exec. Yes, and let me explain the original motivation for this change. To remind, we had a problem with copy_strings() which can use a lot of memory, and this memory was not visible to OOM-killer. So we were going to add the new member, signal_struct->in_exec_mm = bprm->mm and change OOM-killer to account both task->mm and task->signal->in_exec_mm. And in this case we obviously need to ensure that only one thread can enter exec and use signal_struct->in_exec_mm. That patch was ready, but then we found another (better) solution: 3c77f8457221 ("exec: make argv/envp memory visible to oom-killer"). So I do not think we need to exclude other threads today, and we do not need to hold cred_guard_mutex throughout the whole execve path. Again, this needs some work. For example check_unsafe_exec() assumes it can't race with another thread, see 9e00cdb091b008cb3c78192651180 "exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic". But this looks solvable. > So while I fully agree we have issues here that we need to address and > fix your patch description does not inspire confidence. See above... what do you think I should change in this part of changelog? Thanks, Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-20 15:22 ` Oleg Nesterov @ 2017-02-20 15:36 ` Oleg Nesterov 2017-02-20 22:30 ` Eric W. Biederman 1 sibling, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-20 15:36 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel On 02/20, Oleg Nesterov wrote: > > Again, this needs some work. For example check_unsafe_exec() assumes > it can't race with another thread, see 9e00cdb091b008cb3c78192651180 > "exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic". > But this looks solvable. Forgot to mention... plus check_unsafe_exec() checks ptrace, this is another reason why we can't simply shift mutex_lock(cred_guard_mutex) later. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-20 15:22 ` Oleg Nesterov 2017-02-20 15:36 ` Oleg Nesterov @ 2017-02-20 22:30 ` Eric W. Biederman 2017-02-21 17:53 ` Oleg Nesterov 1 sibling, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-02-20 22:30 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel Oleg, My apologies for not replying in-line but I think I can be clearer just saying what I think needs to be said. Today cred_guard_mutex is part of making exec appear to be an atomic operation to ptrace and and proc. To make exec appear to be atomic we do need to take the mutex at the beginning and release it at the end of exec. The semantics of exec appear atomic to ptrace_attach and to proc readers are necessary to ensure we use the proper process credentials in the event of a suid exec. I believe making cred_guard_mutex per task is an option. Reducing the scope of cred_guard_mutex concerns me. There appear to be some fields like sighand that we currently expose in proc that might possibly be problematic. So I am not certain we protect enough things in our proc accessors. Do you know if we can make cred_guard_mutex a per-task lock again? Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-20 22:30 ` Eric W. Biederman @ 2017-02-21 17:53 ` Oleg Nesterov 2017-02-21 20:20 ` Eric W. Biederman 0 siblings, 1 reply; 93+ messages in thread From: Oleg Nesterov @ 2017-02-21 17:53 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel On 02/21, Eric W. Biederman wrote: > > Today cred_guard_mutex is part of making exec appear to be an atomic > operation to ptrace and and proc. To make exec appear to be atomic > we do need to take the mutex at the beginning and release it at the end > of exec. > > The semantics of exec appear atomic to ptrace_attach and to proc readers > are necessary to ensure we use the proper process credentials in the > event of a suid exec. This is clear. My point is that imo a) it is over-used in fs/proc and b) the scope of this mutex if execve is too huge. I see absolutely no reason to do copy_strings() with this mutex held, for example. And note that copy_strings() can use a lot of memory/time, it can trigger oom,swapping, etc. But let me repeat, this is a bit off-topic right now, this patch doesn't change anything in this respect, afaics. > I believe making cred_guard_mutex per task is an option. Reducing the > scope of cred_guard_mutex concerns me. There appear to be some fields > like sighand that we currently expose in proc please see another email, collect_sigign_sigcatch() is called without this mutex. > Do you know if we can make cred_guard_mutex a per-task lock again? I think we can, but this needs some (afaics simple) changes too. But for what? Note that the problem fixed by this series won't go away if we do this. So what do you think about this series? Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-21 17:53 ` Oleg Nesterov @ 2017-02-21 20:20 ` Eric W. Biederman 2017-02-22 17:41 ` Oleg Nesterov 0 siblings, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-02-21 20:20 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel Oleg Nesterov <oleg@redhat.com> writes: > On 02/21, Eric W. Biederman wrote: >> >> Today cred_guard_mutex is part of making exec appear to be an atomic >> operation to ptrace and and proc. To make exec appear to be atomic >> we do need to take the mutex at the beginning and release it at the end >> of exec. >> >> The semantics of exec appear atomic to ptrace_attach and to proc readers >> are necessary to ensure we use the proper process credentials in the >> event of a suid exec. > > This is clear. My point is that imo a) it is over-used in fs/proc and b) > the scope of this mutex if execve is too huge. I see absolutely no reason > to do copy_strings() with this mutex held, for example. And note that > copy_strings() can use a lot of memory/time, it can trigger oom,swapping, > etc. I agree that we can do things like copy_strings that don't change the data structures and of the task without before taking the cred_guard_mutex. > But let me repeat, this is a bit off-topic right now, this patch doesn't > change anything in this respect, afaics. > > >> I believe making cred_guard_mutex per task is an option. Reducing the >> scope of cred_guard_mutex concerns me. There appear to be some fields >> like sighand that we currently expose in proc > > please see another email, collect_sigign_sigcatch() is called without this > mutex. I agree that it is called without the mutex. It is not clear to me that is the correct behavior. It violates the fundamental property that exec of a setuid executable should be an atomic operation. I don't know how much we care but it disturbs me that we can read something of a processes signal handling state with the wrong credentials. Adopting an implementation where we can never fix this apparent bug really really disturbs me. >> Do you know if we can make cred_guard_mutex a per-task lock again? > > I think we can, but this needs some (afaics simple) changes too. > > But for what? Note that the problem fixed by this series won't go away > if we do this. I believe it will if the other waiters use mutex_lock_killable. > So what do you think about this series? I like the second patch. That seems clean and reasonable. I really don't like the first patch. It makes an information leak part a required detail of the implementation and as such possibly something we can never change. It attempts to paint a picture for a full fix in the future that appears to result in an incorrect kernel. That really bugs me. I suspect that a good fix that respects that proc and ptrace_attach need to exclude the setuid exec case for semantic reasons would have a similar complexity. I think a mutex doing the job that cred_guard_mutex is doing especially when we have multiple readers and a single writer is the wrong locking primative. A reader-writer lock or something even cheaper would probably be much better. I think fixing the deadlock is important. I think structuring the fix in such a way that the code is easily maintainable in the future and is also very important. Right now it feels like your fix in patch 1 makes things a bit more brittle and I don't like that at all. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-21 20:20 ` Eric W. Biederman @ 2017-02-22 17:41 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-22 17:41 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel On 02/22, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > >> Reducing the > >> scope of cred_guard_mutex concerns me. There appear to be some fields > >> like sighand that we currently expose in proc > > > > please see another email, collect_sigign_sigcatch() is called without this > > mutex. > > I agree that it is called without the mutex. It is not clear to me that > is the correct behavior. I fail to understand how/why this can be wrong. > >> Do you know if we can make cred_guard_mutex a per-task lock again? > > > > I think we can, but this needs some (afaics simple) changes too. > > > > But for what? Note that the problem fixed by this series won't go away > > if we do this. > > I believe it will if the other waiters use mutex_lock_killable. No. They already use mutex_lock_killable/interruptible. And the test-case can be killed, it is not the hard-lockup. > I really don't like the first patch. Just in case, I don't really like it too. Simply because it makes execve more complex, we need to wait for sub-threads twice. > It makes an information leak part > a required detail of the implementation and as such possibly something > we can never change. Again, I simply can't understand how flush_signal_handlers() outside of cred_guard_mutex can be treated as information leak. Even _if_ collect_sigign_sigcatch() was called with this mutex held. Or do you mean something else? > I suspect that a good fix that respects that proc and ptrace_attach need > to exclude the setuid exec case for semantic reasons would have a similar > complexity. I am not sure I understand how we can do this. We need cred_guard_mutex or something else even if exec is not setuid and does not change the credentials, an LSM module can nack exec-under-ptrace by any reason. > I think fixing the deadlock is important. Yes. People actually hit this bug, it was reported several times. > Right now it feels like your fix in patch 1 makes things a bit more > brittle and I don't like that at all. See above, I am not proud of this change too. I even mentioned on 0/2 that it would be nice to reconsider this change in the long term. But I do not see another simple and _backportable_ solution for now. What do you think we can do instead for stable trees? Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-13 18:04 ` [PATCH V2 " Oleg Nesterov 2017-02-16 11:42 ` Eric W. Biederman @ 2017-02-17 4:42 ` Eric W. Biederman 2017-02-20 15:50 ` Oleg Nesterov 1 sibling, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-02-17 4:42 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, David Howells Oleg Nesterov <oleg@redhat.com> writes: > de_thread() waits for other threads with ->cred_guard_mutex held and this > is really bad because the time is not bounded, debugger can delay the exit > and this lock has a lot of users (mostly abusers imo) in fs/proc and more. > And this leads to deadlock if debugger tries to take the same mutex: Oleg. I looked at the history in proc of users of cred_guard_mutex and the proc users are grabbing cred_guard_mutex for the proper semantic reasons. To avoid races with setuid exec that could result in an information disclosure. I do agree that a mutex is the wrong data structure for the job cred_guard_mutex is performing. The job of sorting ensuring debuggers and proc processes see either the old version or the new version of the task. I need to play with the code but I suspect the best we can handle this preventing both security issues and problems in the future is to create a new task struct and populate it appropriate with the new data from exec (at least in the case of setuid exec). I am thinking of generalizing the case of a non-leader exec where we have to assume the leaders pid. I don't yet know what the performance implications would be but that would clean the users up a lot. On that score I believe we can incrementally approach that point and only grab the cred_guard_mutex in exec if we are performing an exec that changes the processes credentials. Right now I don't think it introduces any new security information disclosures but the moving of flush_signal_handlers outside of cred_guard_mutex feels wrong. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held 2017-02-17 4:42 ` Eric W. Biederman @ 2017-02-20 15:50 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-20 15:50 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Mika Penttilä, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, David Howells On 02/17, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > de_thread() waits for other threads with ->cred_guard_mutex held and this > > is really bad because the time is not bounded, debugger can delay the exit > > and this lock has a lot of users (mostly abusers imo) in fs/proc and more. > > And this leads to deadlock if debugger tries to take the same mutex: > > Oleg. I looked at the history in proc of users of cred_guard_mutex > and the proc users are grabbing cred_guard_mutex for the proper > semantic reasons. To avoid races with setuid exec that could result in > an information disclosure. This is clear. However I really think it is over-used. For example, I do think that lock_trace() should die. OK, of course I can be wrong. And in any case this is almost off-topic right now, lets discuss this separately. > On that score I believe we can incrementally approach that point and > only grab the cred_guard_mutex in exec if we are performing an exec that > changes the processes credentials. May be... but afaics this is more complicated because of LSM hooks. To me one the main problems is that lsm hook can fail if the task is traced, that is why we can't simply take cred_guard_mutex and check ->ptrace after de_thread(). But again, lets discuss this later. > Right now I don't think it introduces any new security information > disclosures but the moving of flush_signal_handlers outside of > cred_guard_mutex feels wrong. Why? Just in case, we can do flush_signal_handlers() only after we unshare ->sighand, that is why it was moved outside of cred_guard_mutex. But why this is bad? collect_sigign_sigcatch() is called without this mutex, who else can look at sa.sa_handler? Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec 2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov @ 2017-02-13 14:15 ` Oleg Nesterov 2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov 2 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-02-13 14:15 UTC (permalink / raw) To: Andrew Morton Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Eric W. Biederman, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel The previous patch fixed the deadlock when mt-exec waits for debugger which should reap a zombie thread, but we can hit the same problem if the killed sub-thread stops in ptrace_event(PTRACE_EVENT_EXIT). Change may_ptrace_stop() to check signal->group_exit_task. This is a user-visible change. But hopefully it can't break anything. Note that the semantics of PTRACE_EVENT_EXIT was never really defined, it depends on /dev/random. Just for example, currently a sub-thread killed by exec will stop, but if it exits on its own and races with exec it will not stop, so nobody can rely on PTRACE_EVENT_EXIT anyway. We really need to finally define what PTRACE_EVENT_EXIT should actually do, but this needs other changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/signal.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index b78ce63..16e44d7 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1738,6 +1738,20 @@ static inline int may_ptrace_stop(void) { if (!likely(current->ptrace)) return 0; + + /* + * Other checks can only affect PTRACE_EVENT_EXIT, a task with the + * pending SIGKILL can't block in TASK_TRACED. But PTRACE_EVENT_EXIT + * can be reported after SIGKILL was already dequeued. + */ + + /* + * Coredump or exec. We must not stop in the latter case, debuger can + * wait for cred_guard_mutex held by execing thread. + */ + if (current->signal->group_exit_task) + return 0; + /* * Are we in the middle of do_coredump? * If so and our tracer is also part of the coredump stopping @@ -1746,10 +1760,6 @@ static inline int may_ptrace_stop(void) * If SIGKILL was already sent before the caller unlocked * ->siglock we must see ->core_state != NULL. Otherwise it * is safe to enter schedule(). - * - * This is almost outdated, a task with the pending SIGKILL can't - * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported - * after SIGKILL was already dequeued. */ if (unlikely(current->mm->core_state) && unlikely(current->mm == current->parent->mm)) -- 2.5.0 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock 2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov 2017-02-13 14:15 ` [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov @ 2017-02-24 16:03 ` Oleg Nesterov 2017-03-03 1:05 ` Eric W. Biederman 2 siblings, 1 reply; 93+ messages in thread From: Oleg Nesterov @ 2017-02-24 16:03 UTC (permalink / raw) To: Andrew Morton, Eric W. Biederman Cc: Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel Eric, our discussion was a bit confusing, and it seems that we did not fully convince each other. So let me ask what do you finally think about this fix. Let me repeat. Even if I do not agree with some of your objections, I do agree that 1/2 does not look nice and clean. And we seem to agree that either way, with or without this fix, we need more changes in this area. But we need a simple and backportable fix for stable trees, say for rhel7. This bug was reported many times, and this is the simplest solution I was able to find. Oleg. On 02/13, Oleg Nesterov wrote: > > Hello, > > Lets finally fix this problem, it was reported several times. I still think that > in the longer term we should (try to) rework the security hooks and (partially) > revert this change, but this is not trivial and we need something backportable > anyway. > > Eric, Jann, we already discussed this change. 1/2 is the same patch I suggested 3 > months ago except now it compiles and moves flush_signal_handlers() to de_thread(). > > Both patches ask for subsequent cleanups, see the changelogs. > > Oleg. > > arch/x86/ia32/ia32_aout.c | 3 ++ > fs/binfmt_aout.c | 3 ++ > fs/binfmt_elf.c | 6 ++- > fs/binfmt_elf_fdpic.c | 4 ++ > fs/binfmt_flat.c | 3 ++ > fs/exec.c | 128 +++++++++++++++++++++++----------------------- > include/linux/binfmts.h | 1 + > kernel/exit.c | 5 +- > kernel/signal.c | 21 +++++--- > 9 files changed, 101 insertions(+), 73 deletions(-) ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock 2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov @ 2017-03-03 1:05 ` Eric W. Biederman 2017-03-03 17:33 ` Oleg Nesterov 0 siblings, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 1:05 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel Oleg Nesterov <oleg@redhat.com> writes: > Eric, > > our discussion was a bit confusing, and it seems that we did not > fully convince each other. So let me ask what do you finally think > about this fix. > > Let me repeat. Even if I do not agree with some of your objections, > I do agree that 1/2 does not look nice and clean. And we seem to > agree that either way, with or without this fix, we need more changes > in this area. > > But we need a simple and backportable fix for stable trees, say for > rhel7. This bug was reported many times, and this is the simplest > solution I was able to find. I am not 100% convinced that we need a backportable solution, but regardless my experience is that good clean solutions are almost always easier to backport that something we just settle for. The patch below needs a little more looking and testing but arguably it is sufficient. It implements autoreaping for non-leader threads always. We might want to limit this to the case of exec. For exec there is a strong argument that no one cares as the exit_code/status returned by these exiting threads has always been 0. Maybe that is ok. Exit_success but it certainly seems wrong in the case of exec. It would really be better to have an exit status that you went away with exec. I know it has always been 0 as fs/exec.c does not set sig->group_exit_code so the value initialized in copy_signal from kmem_cache_zalloc is used. AKA 0. This change has very little to do with the cred_guard_mutex. Just a question of which semantics we want to export. In my preliminary testing this seems a viable. My big problem with your previous patch is that it is in no way obviously correct. Nor is it well explained. All signs that someone is pushing too hard and something better can be achieved if someone steps back a little. When the following patch can be made to solve the same problem we certainly need an explanation in the commit comment on why we are using much more code for the same job. diff --git a/kernel/exit.c b/kernel/exit.c index 5cfbd595f918..5a432e0dcf4a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) thread_group_empty(tsk) && !ptrace_reparented(tsk) ? tsk->exit_signal : SIGCHLD; - autoreap = do_notify_parent(tsk, sig); + do_notify_parent(tsk, sig); + /* Autoreap threads even when ptraced */ + autoreap = !thread_group_leader(tsk); } else if (thread_group_leader(tsk)) { autoreap = thread_group_empty(tsk) && do_notify_parent(tsk, tsk->exit_signal); @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) } tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; - if (tsk->exit_state == EXIT_DEAD) - list_add(&tsk->ptrace_entry, &dead); /* mt-exec, de_thread() is waiting for group leader */ if (unlikely(tsk->signal->notify_count < 0)) @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead) list_del_init(&p->ptrace_entry); release_task(p); } + if (autoreap) + release_task(tsk); } #ifdef CONFIG_DEBUG_STACK_USAGE Eric ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock 2017-03-03 1:05 ` Eric W. Biederman @ 2017-03-03 17:33 ` Oleg Nesterov 2017-03-03 18:23 ` Eric W. Biederman 0 siblings, 1 reply; 93+ messages in thread From: Oleg Nesterov @ 2017-03-03 17:33 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel On 03/02, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > our discussion was a bit confusing, and it seems that we did not > > fully convince each other. So let me ask what do you finally think > > about this fix. > > > > Let me repeat. Even if I do not agree with some of your objections, > > I do agree that 1/2 does not look nice and clean. And we seem to > > agree that either way, with or without this fix, we need more changes > > in this area. > > > > But we need a simple and backportable fix for stable trees, say for > > rhel7. This bug was reported many times, and this is the simplest > > solution I was able to find. > > I am not 100% convinced that we need a backportable solution, I think we need, this was already requested, > but > regardless my experience is that good clean solutions are almost always > easier to backport that something we just settle for. Sure. > The patch below needs a little more looking and testing but arguably > it is sufficient. > > It implements autoreaping for non-leader threads always. We might want > to limit this to the case of exec. I should have mentioned this. Of course, this change was proposed from the very beginning, when this problem was reported first time. And of course I like this fix much more than my patch (but yes, we shouldd limit it to the case of exec). The only problem is that it is incompatible change, and I have no idea what can be broken. Trivial test-case: void *thread(void *arg) { for (;;) pause(); return NULL; } int main(void) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); execlp("true", "true", NULL); } with your patch applied $ strace -f ./test strace: wait4(__WALL): No child processes Yes, this is just a warning, but still. Now we need to change strace. And we can't know what else can be confused/broken by this change. man(ptrace) even documents that all other threads except the thread group leader report death as if they exited via _exit(2). Yes, yes, it also says that other threads stop in PTRACE_EVENT_EXIT stop, so 2/2 (which we need with your change too) is is not compatible too and I am worried, but: - this was never really true, an already exiting thread won't stop if it races with exec - PTRACE_O_TRACEEXEC is not used that often, it never really worked - man(ptrace) also mentions that PTRACE_EVENT_EXIT behaviour may be change in the future. In short. Of course I considered this change. Looks too risky to me. But. I will be happy if you send this change and take all the blame ;) I won't argue (if you limit it to execve case). > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > thread_group_empty(tsk) && > !ptrace_reparented(tsk) ? > tsk->exit_signal : SIGCHLD; > - autoreap = do_notify_parent(tsk, sig); > + do_notify_parent(tsk, sig); > + /* Autoreap threads even when ptraced */ > + autoreap = !thread_group_leader(tsk); > } else if (thread_group_leader(tsk)) { > autoreap = thread_group_empty(tsk) && > do_notify_parent(tsk, tsk->exit_signal); This is all you need, > @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > } > > tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; > - if (tsk->exit_state == EXIT_DEAD) > - list_add(&tsk->ptrace_entry, &dead); > > /* mt-exec, de_thread() is waiting for group leader */ > if (unlikely(tsk->signal->notify_count < 0)) > @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > list_del_init(&p->ptrace_entry); > release_task(p); > } > + if (autoreap) > + release_task(tsk); These 2 changes are not needed. release_task(tsk) will be called by list_for_each_entry_safe() above if autoreap == T. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-03-03 18:23 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 18:23 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Cc'd linux-api as we are talking about a deliberate user visible API change here. Oleg Nesterov <oleg@redhat.com> writes: > On 03/02, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg@redhat.com> writes: >> >> > our discussion was a bit confusing, and it seems that we did not >> > fully convince each other. So let me ask what do you finally think >> > about this fix. >> > >> > Let me repeat. Even if I do not agree with some of your objections, >> > I do agree that 1/2 does not look nice and clean. And we seem to >> > agree that either way, with or without this fix, we need more changes >> > in this area. >> > >> > But we need a simple and backportable fix for stable trees, say for >> > rhel7. This bug was reported many times, and this is the simplest >> > solution I was able to find. >> >> I am not 100% convinced that we need a backportable solution, > > I think we need, this was already requested, > >> but >> regardless my experience is that good clean solutions are almost always >> easier to backport that something we just settle for. > > Sure. > >> The patch below needs a little more looking and testing but arguably >> it is sufficient. >> >> It implements autoreaping for non-leader threads always. We might want >> to limit this to the case of exec. > > I should have mentioned this. Of course, this change was proposed from the > very beginning, when this problem was reported first time. And of course > I like this fix much more than my patch (but yes, we shouldd limit it to > the case of exec). > > The only problem is that it is incompatible change, and I have no idea what > can be broken. > > Trivial test-case: > > void *thread(void *arg) > { > for (;;) > pause(); > return NULL; > } > > int main(void) > { > pthread_t pt; > pthread_create(&pt, NULL, thread, NULL); > execlp("true", "true", NULL); > } > > with your patch applied > > $ strace -f ./test > strace: wait4(__WALL): No child processes > > Yes, this is just a warning, but still. Now we need to change strace. And we > can't know what else can be confused/broken by this change. > > man(ptrace) even documents that all other threads except the thread group leader > report death as if they exited via _exit(2). > > Yes, yes, it also says that other threads stop in PTRACE_EVENT_EXIT stop, > so 2/2 (which we need with your change too) is is not compatible too and > I am worried, but: > > - this was never really true, an already exiting thread won't stop > if it races with exec > > - PTRACE_O_TRACEEXEC is not used that often, it never really worked > > - man(ptrace) also mentions that PTRACE_EVENT_EXIT behaviour may be > change in the future. > > In short. Of course I considered this change. Looks too risky to me. But. > I will be happy if you send this change and take all the blame ;) I won't > argue (if you limit it to execve case). Unfortunately you have found what already looks like a userspace regression. So I don't think we can do that. I do think the user space visible change of modifying a multi-threaded exec no to wait for the other threads to be reaped is the least dangerous change. The big lesson for me, and what was not obvious from your change description is that we are changing the user space visible semantics of exec+ptrace and that cred_guard_mutex is not at all the problem (as we always take cred_guard_mutex in a killable or interruptible way). So I tentatively agree that we need to deliberately change the semantics of exec to not wait for zombies in fs/exec.c:de_thread. That is what I would expect of exec and that seems to the minimal change. As part of that change I expect we want to call __cleanup_sighand from do_exit rather than release_task. Semantically we sould not need the sighand_struct for zombie process. >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) >> thread_group_empty(tsk) && >> !ptrace_reparented(tsk) ? >> tsk->exit_signal : SIGCHLD; >> - autoreap = do_notify_parent(tsk, sig); >> + do_notify_parent(tsk, sig); >> + /* Autoreap threads even when ptraced */ >> + autoreap = !thread_group_leader(tsk); >> } else if (thread_group_leader(tsk)) { >> autoreap = thread_group_empty(tsk) && >> do_notify_parent(tsk, tsk->exit_signal); > > This is all you need, > >> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) >> } >> >> tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; >> - if (tsk->exit_state == EXIT_DEAD) >> - list_add(&tsk->ptrace_entry, &dead); >> >> /* mt-exec, de_thread() is waiting for group leader */ >> if (unlikely(tsk->signal->notify_count < 0)) >> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead) >> list_del_init(&p->ptrace_entry); >> release_task(p); >> } >> + if (autoreap) >> + release_task(tsk); > > These 2 changes are not needed. release_task(tsk) will be called by > list_for_each_entry_safe() above if autoreap == T. Except for the practical case that for threads that are ptraced tsk->ptrace_entry is already in use. Which means we can't use list_add(&tsk->ptrace_entry, &dead). Or in short we get a really pretty oops because we corrupt one of the ptrace lists if we don't have my extra two hunks. But I agree logically they are separate changes. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-03-03 18:23 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 18:23 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Cc'd linux-api as we are talking about a deliberate user visible API change here. Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > On 03/02, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: >> >> > our discussion was a bit confusing, and it seems that we did not >> > fully convince each other. So let me ask what do you finally think >> > about this fix. >> > >> > Let me repeat. Even if I do not agree with some of your objections, >> > I do agree that 1/2 does not look nice and clean. And we seem to >> > agree that either way, with or without this fix, we need more changes >> > in this area. >> > >> > But we need a simple and backportable fix for stable trees, say for >> > rhel7. This bug was reported many times, and this is the simplest >> > solution I was able to find. >> >> I am not 100% convinced that we need a backportable solution, > > I think we need, this was already requested, > >> but >> regardless my experience is that good clean solutions are almost always >> easier to backport that something we just settle for. > > Sure. > >> The patch below needs a little more looking and testing but arguably >> it is sufficient. >> >> It implements autoreaping for non-leader threads always. We might want >> to limit this to the case of exec. > > I should have mentioned this. Of course, this change was proposed from the > very beginning, when this problem was reported first time. And of course > I like this fix much more than my patch (but yes, we shouldd limit it to > the case of exec). > > The only problem is that it is incompatible change, and I have no idea what > can be broken. > > Trivial test-case: > > void *thread(void *arg) > { > for (;;) > pause(); > return NULL; > } > > int main(void) > { > pthread_t pt; > pthread_create(&pt, NULL, thread, NULL); > execlp("true", "true", NULL); > } > > with your patch applied > > $ strace -f ./test > strace: wait4(__WALL): No child processes > > Yes, this is just a warning, but still. Now we need to change strace. And we > can't know what else can be confused/broken by this change. > > man(ptrace) even documents that all other threads except the thread group leader > report death as if they exited via _exit(2). > > Yes, yes, it also says that other threads stop in PTRACE_EVENT_EXIT stop, > so 2/2 (which we need with your change too) is is not compatible too and > I am worried, but: > > - this was never really true, an already exiting thread won't stop > if it races with exec > > - PTRACE_O_TRACEEXEC is not used that often, it never really worked > > - man(ptrace) also mentions that PTRACE_EVENT_EXIT behaviour may be > change in the future. > > In short. Of course I considered this change. Looks too risky to me. But. > I will be happy if you send this change and take all the blame ;) I won't > argue (if you limit it to execve case). Unfortunately you have found what already looks like a userspace regression. So I don't think we can do that. I do think the user space visible change of modifying a multi-threaded exec no to wait for the other threads to be reaped is the least dangerous change. The big lesson for me, and what was not obvious from your change description is that we are changing the user space visible semantics of exec+ptrace and that cred_guard_mutex is not at all the problem (as we always take cred_guard_mutex in a killable or interruptible way). So I tentatively agree that we need to deliberately change the semantics of exec to not wait for zombies in fs/exec.c:de_thread. That is what I would expect of exec and that seems to the minimal change. As part of that change I expect we want to call __cleanup_sighand from do_exit rather than release_task. Semantically we sould not need the sighand_struct for zombie process. >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -690,7 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) >> thread_group_empty(tsk) && >> !ptrace_reparented(tsk) ? >> tsk->exit_signal : SIGCHLD; >> - autoreap = do_notify_parent(tsk, sig); >> + do_notify_parent(tsk, sig); >> + /* Autoreap threads even when ptraced */ >> + autoreap = !thread_group_leader(tsk); >> } else if (thread_group_leader(tsk)) { >> autoreap = thread_group_empty(tsk) && >> do_notify_parent(tsk, tsk->exit_signal); > > This is all you need, > >> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) >> } >> >> tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; >> - if (tsk->exit_state == EXIT_DEAD) >> - list_add(&tsk->ptrace_entry, &dead); >> >> /* mt-exec, de_thread() is waiting for group leader */ >> if (unlikely(tsk->signal->notify_count < 0)) >> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead) >> list_del_init(&p->ptrace_entry); >> release_task(p); >> } >> + if (autoreap) >> + release_task(tsk); > > These 2 changes are not needed. release_task(tsk) will be called by > list_for_each_entry_safe() above if autoreap == T. Except for the practical case that for threads that are ptraced tsk->ptrace_entry is already in use. Which means we can't use list_add(&tsk->ptrace_entry, &dead). Or in short we get a really pretty oops because we corrupt one of the ptrace lists if we don't have my extra two hunks. But I agree logically they are separate changes. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock 2017-03-03 18:23 ` Eric W. Biederman @ 2017-03-03 18:59 ` Eric W. Biederman -1 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 18:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api ebiederm@xmission.com (Eric W. Biederman) writes: > The big lesson for me, and what was not obvious from your change > description is that we are changing the user space visible semantics > of exec+ptrace and that cred_guard_mutex is not at all the problem (as > we always take cred_guard_mutex in a killable or interruptible way). Just to follow up. Because the cred_guard_mutex is fine as is we don't need to move de_thread out from under cred_guard_mutex. We just need to change de_thread to wait until all of the other threads are zombies. Which should remove about half your proposed patch. The other key thing is that knowning it isn't cred_guard_mutex let's us know that this kind of deadlock goes all of the way back to when CLONE_THREAD was merged into the kernel. Insteresingly enough looking at zap_other_threads and notify_count I have found a second bug. When a multi-threaded processes becomes a zombie we don't send the notification to the parent process until the non-leader threads have been reaped. Which means ptrace can mess up sending SIGCHLD to the parent. Now arguably that might be what is desirable but I don't think so. If we aren't ptracing a thread then I don't think we want to delay sending SIGCHLD to the parent. So this whole area of the semantics of a ptrace'd multi-threaded process exiting/exec'ing looks like it needs a thorough going over. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-03-03 18:59 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 18:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api ebiederm@xmission.com (Eric W. Biederman) writes: > The big lesson for me, and what was not obvious from your change > description is that we are changing the user space visible semantics > of exec+ptrace and that cred_guard_mutex is not at all the problem (as > we always take cred_guard_mutex in a killable or interruptible way). Just to follow up. Because the cred_guard_mutex is fine as is we don't need to move de_thread out from under cred_guard_mutex. We just need to change de_thread to wait until all of the other threads are zombies. Which should remove about half your proposed patch. The other key thing is that knowning it isn't cred_guard_mutex let's us know that this kind of deadlock goes all of the way back to when CLONE_THREAD was merged into the kernel. Insteresingly enough looking at zap_other_threads and notify_count I have found a second bug. When a multi-threaded processes becomes a zombie we don't send the notification to the parent process until the non-leader threads have been reaped. Which means ptrace can mess up sending SIGCHLD to the parent. Now arguably that might be what is desirable but I don't think so. If we aren't ptracing a thread then I don't think we want to delay sending SIGCHLD to the parent. So this whole area of the semantics of a ptrace'd multi-threaded process exiting/exec'ing looks like it needs a thorough going over. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-03-03 20:06 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 20:06 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api ebiederm@xmission.com (Eric W. Biederman) writes: > ebiederm@xmission.com (Eric W. Biederman) writes: > >> The big lesson for me, and what was not obvious from your change >> description is that we are changing the user space visible semantics >> of exec+ptrace and that cred_guard_mutex is not at all the problem (as >> we always take cred_guard_mutex in a killable or interruptible way). > > Just to follow up. > > Because the cred_guard_mutex is fine as is we don't need to move > de_thread out from under cred_guard_mutex. We just need to change > de_thread to wait until all of the other threads are zombies. > Which should remove about half your proposed patch. > > The other key thing is that knowning it isn't cred_guard_mutex let's us > know that this kind of deadlock goes all of the way back to when > CLONE_THREAD was merged into the kernel. > > Insteresingly enough looking at zap_other_threads and notify_count I > have found a second bug. When a multi-threaded processes becomes a > zombie we don't send the notification to the parent process until the > non-leader threads have been reaped. Which means ptrace can mess up > sending SIGCHLD to the parent. Bah. I was misreading the code. Nothing but exec uses notify_count and group_exit_task. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-03-03 20:06 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 20:06 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > >> The big lesson for me, and what was not obvious from your change >> description is that we are changing the user space visible semantics >> of exec+ptrace and that cred_guard_mutex is not at all the problem (as >> we always take cred_guard_mutex in a killable or interruptible way). > > Just to follow up. > > Because the cred_guard_mutex is fine as is we don't need to move > de_thread out from under cred_guard_mutex. We just need to change > de_thread to wait until all of the other threads are zombies. > Which should remove about half your proposed patch. > > The other key thing is that knowning it isn't cred_guard_mutex let's us > know that this kind of deadlock goes all of the way back to when > CLONE_THREAD was merged into the kernel. > > Insteresingly enough looking at zap_other_threads and notify_count I > have found a second bug. When a multi-threaded processes becomes a > zombie we don't send the notification to the parent process until the > non-leader threads have been reaped. Which means ptrace can mess up > sending SIGCHLD to the parent. Bah. I was misreading the code. Nothing but exec uses notify_count and group_exit_task. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-03-03 20:11 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 20:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Ever since CLONE_THREAD support was added to the kernel it has been possible to dead-lock userspace by ptracing a process and not reaping it's child threads. With use of the cred_guard_mutex in proc the ways userspace can unknowningly trigger a dead-lock have grown. Which makes a simple -f strace on a program that performs a mult-threaded exec unsafe. #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <pthread.h> void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); // or anything else which needs ->cred_guard_mutex, // say open(/proc/$pid/mem) ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } Sovle this by modifying exec to only wait until all of the other threads are zombies, and not waiting until the other threads are reaped. As well as simplifying the userspace semantics this simplifies the code. Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Reported-by: Attila Fazekas <afazekas@redhat.com> Reported-by: Aleksa Sarai <asarai@suse.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Fixes: v2.4.0 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/exec.c | 31 +++++-------------------------- include/linux/sched.h | 5 ++--- kernel/exit.c | 18 +++++------------- kernel/signal.c | 6 +----- 4 files changed, 13 insertions(+), 47 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 698a86094f76..f78205a72304 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk) } sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); - if (!thread_group_leader(tsk)) - sig->notify_count--; - - while (sig->notify_count) { + zap_other_threads(tsk); + while (atomic_read(&sig->live) > 1) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); schedule(); @@ -1081,29 +1078,13 @@ static int de_thread(struct task_struct *tsk) /* * At this point all other threads have exited, all we have to - * do is to wait for the thread group leader to become inactive, - * and to assume its PID: + * do is to assume the PID of the thread group leader. */ if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } + threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a @@ -1163,7 +1144,6 @@ static int de_thread(struct task_struct *tsk) } sig->group_exit_task = NULL; - sig->notify_count = 0; no_thread_group: /* we have changed execution domain */ @@ -1204,7 +1184,6 @@ static int de_thread(struct task_struct *tsk) /* protects against exit_notify() and __exit_signal() */ read_lock(&tasklist_lock); sig->group_exit_task = NULL; - sig->notify_count = 0; read_unlock(&tasklist_lock); return -EAGAIN; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 6261bfc12853..ff6aa76beac5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -711,11 +711,10 @@ struct signal_struct { /* thread group exit support */ int group_exit_code; /* overloaded: - * - notify group_exit_task when ->count is equal to notify_count + * - notify group_exit_task when ->live is equal to 1 * - everyone except group_exit_task is stopped during signal delivery * of fatal signals, group_exit_task processes the signal. */ - int notify_count; struct task_struct *group_exit_task; /* thread group stop support, overloads group_exit_code too */ @@ -2763,7 +2762,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int); extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent); extern void force_sig(int, struct task_struct *); extern int send_sig(int, struct task_struct *, int); -extern int zap_other_threads(struct task_struct *p); +extern void zap_other_threads(struct task_struct *p); extern struct sigqueue *sigqueue_alloc(void); extern void sigqueue_free(struct sigqueue *); extern int send_sigqueue(struct sigqueue *, struct task_struct *, int group); diff --git a/kernel/exit.c b/kernel/exit.c index 5cfbd595f918..dc9bc2bdb45a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -111,13 +111,6 @@ static void __exit_signal(struct task_struct *tsk) tty = sig->tty; sig->tty = NULL; } else { - /* - * If there is any task waiting for the group exit - * then notify it: - */ - if (sig->notify_count > 0 && !--sig->notify_count) - wake_up_process(sig->group_exit_task); - if (tsk == sig->curr_target) sig->curr_target = next_thread(tsk); } @@ -701,10 +694,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) - wake_up_process(tsk->signal->group_exit_task); write_unlock_irq(&tasklist_lock); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { @@ -740,7 +729,7 @@ static inline void check_stack_usage(void) {} void __noreturn do_exit(long code) { struct task_struct *tsk = current; - int group_dead; + int group_left, group_dead; TASKS_RCU(int tasks_rcu_i); profile_task_exit(tsk); @@ -809,7 +798,8 @@ void __noreturn do_exit(long code) if (tsk->mm) sync_mm_rss(tsk->mm); acct_update_integrals(tsk); - group_dead = atomic_dec_and_test(&tsk->signal->live); + group_left = atomic_dec_return(&tsk->signal->live); + group_dead = group_left == 0; if (group_dead) { #ifdef CONFIG_POSIX_TIMERS hrtimer_cancel(&tsk->signal->real_timer); @@ -818,6 +808,8 @@ void __noreturn do_exit(long code) if (tsk->mm) setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm); } + if ((group_left == 1) && tsk->signal->group_exit_task) + wake_up_process(tsk->signal->group_exit_task); acct_collect(code, group_dead); if (group_dead) tty_audit_exit(); diff --git a/kernel/signal.c b/kernel/signal.c index fbbab5a7c84d..17312c71f1ae 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1191,16 +1191,14 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t) /* * Nuke all other threads in the group. */ -int zap_other_threads(struct task_struct *p) +void zap_other_threads(struct task_struct *p) { struct task_struct *t = p; - int count = 0; p->signal->group_stop_count = 0; while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; /* Don't bother with already dead threads */ if (t->exit_state) @@ -1208,8 +1206,6 @@ int zap_other_threads(struct task_struct *p) sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); } - - return count; } struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-03-03 20:11 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-03 20:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Ever since CLONE_THREAD support was added to the kernel it has been possible to dead-lock userspace by ptracing a process and not reaping it's child threads. With use of the cred_guard_mutex in proc the ways userspace can unknowningly trigger a dead-lock have grown. Which makes a simple -f strace on a program that performs a mult-threaded exec unsafe. #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <pthread.h> void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); // or anything else which needs ->cred_guard_mutex, // say open(/proc/$pid/mem) ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } Sovle this by modifying exec to only wait until all of the other threads are zombies, and not waiting until the other threads are reaped. As well as simplifying the userspace semantics this simplifies the code. Reported-by: Ulrich Obergfell <uobergfe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Reported-by: Attila Fazekas <afazekas-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Reported-by: Aleksa Sarai <asarai-IBi9RG/b67k@public.gmane.org> Reported-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Fixes: v2.4.0 Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/exec.c | 31 +++++-------------------------- include/linux/sched.h | 5 ++--- kernel/exit.c | 18 +++++------------- kernel/signal.c | 6 +----- 4 files changed, 13 insertions(+), 47 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 698a86094f76..f78205a72304 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk) } sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); - if (!thread_group_leader(tsk)) - sig->notify_count--; - - while (sig->notify_count) { + zap_other_threads(tsk); + while (atomic_read(&sig->live) > 1) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); schedule(); @@ -1081,29 +1078,13 @@ static int de_thread(struct task_struct *tsk) /* * At this point all other threads have exited, all we have to - * do is to wait for the thread group leader to become inactive, - * and to assume its PID: + * do is to assume the PID of the thread group leader. */ if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } + threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a @@ -1163,7 +1144,6 @@ static int de_thread(struct task_struct *tsk) } sig->group_exit_task = NULL; - sig->notify_count = 0; no_thread_group: /* we have changed execution domain */ @@ -1204,7 +1184,6 @@ static int de_thread(struct task_struct *tsk) /* protects against exit_notify() and __exit_signal() */ read_lock(&tasklist_lock); sig->group_exit_task = NULL; - sig->notify_count = 0; read_unlock(&tasklist_lock); return -EAGAIN; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 6261bfc12853..ff6aa76beac5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -711,11 +711,10 @@ struct signal_struct { /* thread group exit support */ int group_exit_code; /* overloaded: - * - notify group_exit_task when ->count is equal to notify_count + * - notify group_exit_task when ->live is equal to 1 * - everyone except group_exit_task is stopped during signal delivery * of fatal signals, group_exit_task processes the signal. */ - int notify_count; struct task_struct *group_exit_task; /* thread group stop support, overloads group_exit_code too */ @@ -2763,7 +2762,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int); extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent); extern void force_sig(int, struct task_struct *); extern int send_sig(int, struct task_struct *, int); -extern int zap_other_threads(struct task_struct *p); +extern void zap_other_threads(struct task_struct *p); extern struct sigqueue *sigqueue_alloc(void); extern void sigqueue_free(struct sigqueue *); extern int send_sigqueue(struct sigqueue *, struct task_struct *, int group); diff --git a/kernel/exit.c b/kernel/exit.c index 5cfbd595f918..dc9bc2bdb45a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -111,13 +111,6 @@ static void __exit_signal(struct task_struct *tsk) tty = sig->tty; sig->tty = NULL; } else { - /* - * If there is any task waiting for the group exit - * then notify it: - */ - if (sig->notify_count > 0 && !--sig->notify_count) - wake_up_process(sig->group_exit_task); - if (tsk == sig->curr_target) sig->curr_target = next_thread(tsk); } @@ -701,10 +694,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) - wake_up_process(tsk->signal->group_exit_task); write_unlock_irq(&tasklist_lock); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { @@ -740,7 +729,7 @@ static inline void check_stack_usage(void) {} void __noreturn do_exit(long code) { struct task_struct *tsk = current; - int group_dead; + int group_left, group_dead; TASKS_RCU(int tasks_rcu_i); profile_task_exit(tsk); @@ -809,7 +798,8 @@ void __noreturn do_exit(long code) if (tsk->mm) sync_mm_rss(tsk->mm); acct_update_integrals(tsk); - group_dead = atomic_dec_and_test(&tsk->signal->live); + group_left = atomic_dec_return(&tsk->signal->live); + group_dead = group_left == 0; if (group_dead) { #ifdef CONFIG_POSIX_TIMERS hrtimer_cancel(&tsk->signal->real_timer); @@ -818,6 +808,8 @@ void __noreturn do_exit(long code) if (tsk->mm) setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm); } + if ((group_left == 1) && tsk->signal->group_exit_task) + wake_up_process(tsk->signal->group_exit_task); acct_collect(code, group_dead); if (group_dead) tty_audit_exit(); diff --git a/kernel/signal.c b/kernel/signal.c index fbbab5a7c84d..17312c71f1ae 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1191,16 +1191,14 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t) /* * Nuke all other threads in the group. */ -int zap_other_threads(struct task_struct *p) +void zap_other_threads(struct task_struct *p) { struct task_struct *t = p; - int count = 0; p->signal->group_stop_count = 0; while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; /* Don't bother with already dead threads */ if (t->exit_state) @@ -1208,8 +1206,6 @@ int zap_other_threads(struct task_struct *p) sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); } - - return count; } struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. 2017-03-03 20:11 ` Eric W. Biederman (?) @ 2017-03-04 17:03 ` Oleg Nesterov 2017-03-30 8:07 ` Eric W. Biederman -1 siblings, 1 reply; 93+ messages in thread From: Oleg Nesterov @ 2017-03-04 17:03 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 03/03, Eric W. Biederman wrote: > > Ever since CLONE_THREAD support was added to the kernel it has been > possible to dead-lock userspace by ptracing a process and not reaping > it's child threads. Hmm. I disagree... I do not think this is a bug. But lets discuss this separately, perhaps I misunderstood you. > With use of the cred_guard_mutex in proc the ways > userspace can unknowningly trigger a dead-lock have grown. I think this particular problem did not exist until cred_guard_mutex was introduced. Debugger can obviously "delay" exec if it doesn't reap a zombie sub-thread, but this is another thing and not a bug imo. > Sovle this by modifying exec to only wait until all of the other > threads are zombies, and not waiting until the other threads > are reaped. This patch looks wrong in many ways. > @@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk) > } > > sig->group_exit_task = tsk; > - sig->notify_count = zap_other_threads(tsk); > - if (!thread_group_leader(tsk)) > - sig->notify_count--; > - > - while (sig->notify_count) { > + zap_other_threads(tsk); > + while (atomic_read(&sig->live) > 1) { > __set_current_state(TASK_KILLABLE); > spin_unlock_irq(lock); > schedule(); Very nice. So de_thread() returns as soon as all other threads decrement signal->live in do_exit(). Before they do, say, exit_mm(). This is already wrong, for example this breaks OOM. Plus a lot more problems afaics, but lets ignore this. Note that de_thread() also unshares ->sighand before return. So in the case of mt exec it will likely see oldsighand->count != 1 and alloc the new sighand_struct and this breaks the locking. Because the execing thread will use newsighand->siglock to protect its signal_struct while the zombie threads will use oldsighand->siglock to protect the same signal struct. Yes, tasklist_lock + the fact irq_disable implies rcu_lock mostly save us but not entirely, say, a foreign process doing __send_signal() can take the right or the wrong lock depending on /dev/random. > @@ -818,6 +808,8 @@ void __noreturn do_exit(long code) > if (tsk->mm) > setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm); > } > + if ((group_left == 1) && tsk->signal->group_exit_task) > + wake_up_process(tsk->signal->group_exit_task); This is racy, but this is minor. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. 2017-03-04 17:03 ` Oleg Nesterov @ 2017-03-30 8:07 ` Eric W. Biederman 2017-04-01 5:11 ` Eric W. Biederman 2017-04-02 16:15 ` Oleg Nesterov 0 siblings, 2 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-03-30 8:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > On 03/03, Eric W. Biederman wrote: >> @@ -1065,11 +1065,8 @@ static int de_thread(struct task_struct *tsk) >> } >> >> sig->group_exit_task = tsk; >> - sig->notify_count = zap_other_threads(tsk); >> - if (!thread_group_leader(tsk)) >> - sig->notify_count--; >> - >> - while (sig->notify_count) { >> + zap_other_threads(tsk); >> + while (atomic_read(&sig->live) > 1) { >> __set_current_state(TASK_KILLABLE); >> spin_unlock_irq(lock); >> schedule(); > > Very nice. So de_thread() returns as soon as all other threads decrement > signal->live in do_exit(). Before they do, say, exit_mm(). This is already > wrong, for example this breaks OOM. Plus a lot more problems afaics, but > lets ignore this. Which means that we need to keep sig->notify_count. > Note that de_thread() also unshares ->sighand before return. So in the > case of mt exec it will likely see oldsighand->count != 1 and alloc the > new sighand_struct and this breaks the locking. > > Because the execing thread will use newsighand->siglock to protect its > signal_struct while the zombie threads will use oldsighand->siglock to > protect the same signal struct. Yes, tasklist_lock + the fact irq_disable > implies rcu_lock mostly save us but not entirely, say, a foreign process > doing __send_signal() can take the right or the wrong lock depending on > /dev/random. Which leads to the question how can we get back tot he 2.4 behavior of freeing sighand_struct in do_exit? At which point as soon as we free sighand_struct if we are the last to dying thread notify de_thread and everything works. There are only two uses of sighand->siglock that I can see in release_task: __ptrace_unlink and __exit_signal. For what __ptrace_unlink is doing we should just be able to skip acquiring of siglock if PF_EXITING is set. __exit_signal is a little more interesting but half of what it is doing looks like it was pulled out of do_exit and just needs to be put back. Which probably adds up to 4 or 5 small carefully written patches to sort out that part of the exit path, but I think it leads to a very simple and straight forward result. With the only side effect that we can exec a process and still have a debugger reaping zombie threads. Oleg does that sound reasonable to you? Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang @ 2017-04-01 5:11 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-01 5:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api I spent a little more time with this and only waiting until the killed thread are zombies (and not reaped as we do today) really looks like the right fix. Oleg the following two patches work on top of your PTRACE_EVENT_EXIT change and probably need a little more cleanup until they are ready for serious posting. That said I want to I want to post the code so I have a change at some feedback before I prepare the final round of patches. These patches only handle the case when sighand_struct is not shared between different multi-threaded processes. The general case is solvable but that is a quite a bit more code. Eric W. Biederman (2): sighand: Count each thread group once in sighand_struct exec: If possible don't wait for ptraced threads to be reaped fs/exec.c | 15 ++++++++++----- include/linux/sched/signal.h | 2 +- kernel/exit.c | 15 ++++++++++----- kernel/fork.c | 6 ++++-- kernel/signal.c | 8 ++++++-- 5 files changed, 31 insertions(+), 15 deletions(-) Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang @ 2017-04-01 5:11 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-01 5:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA I spent a little more time with this and only waiting until the killed thread are zombies (and not reaped as we do today) really looks like the right fix. Oleg the following two patches work on top of your PTRACE_EVENT_EXIT change and probably need a little more cleanup until they are ready for serious posting. That said I want to I want to post the code so I have a change at some feedback before I prepare the final round of patches. These patches only handle the case when sighand_struct is not shared between different multi-threaded processes. The general case is solvable but that is a quite a bit more code. Eric W. Biederman (2): sighand: Count each thread group once in sighand_struct exec: If possible don't wait for ptraced threads to be reaped fs/exec.c | 15 ++++++++++----- include/linux/sched/signal.h | 2 +- kernel/exit.c | 15 ++++++++++----- kernel/fork.c | 6 ++++-- kernel/signal.c | 8 ++++++-- 5 files changed, 31 insertions(+), 15 deletions(-) Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH 1/2] sighand: Count each thread group once in sighand_struct @ 2017-04-01 5:14 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-01 5:14 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api In practice either a thread group is either using a sighand_struct or it isn't. Therefore simplify things a bit and only increment the count in sighand_struct when a new thread group is created that uses the existing sighand_struct, and only decrement the count in sighand_struct when a thread group exits. As well as standing on it's own merits this has the potential to simply de_thread. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/exit.c | 2 +- kernel/fork.c | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index e126ebf2400c..8c5b3e106298 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk) tsk->sighand = NULL; spin_unlock(&sighand->siglock); - __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk, TIF_SIGPENDING); if (group_dead) { + __cleanup_sighand(sighand); flush_sigqueue(&sig->shared_pending); tty_kref_put(tty); } diff --git a/kernel/fork.c b/kernel/fork.c index 6c463c80e93d..fe6f1bf32bb9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) struct sighand_struct *sig; if (clone_flags & CLONE_SIGHAND) { - atomic_inc(¤t->sighand->count); + if (!(clone_flags & CLONE_THREAD)) + atomic_inc(¤t->sighand->count); return 0; } sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); @@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process( if (!(clone_flags & CLONE_THREAD)) free_signal_struct(p->signal); bad_fork_cleanup_sighand: - __cleanup_sighand(p->sighand); + if (!(clone_flags & CLONE_THREAD)) + __cleanup_sighand(p->sighand); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH 1/2] sighand: Count each thread group once in sighand_struct @ 2017-04-01 5:14 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-01 5:14 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA In practice either a thread group is either using a sighand_struct or it isn't. Therefore simplify things a bit and only increment the count in sighand_struct when a new thread group is created that uses the existing sighand_struct, and only decrement the count in sighand_struct when a thread group exits. As well as standing on it's own merits this has the potential to simply de_thread. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- kernel/exit.c | 2 +- kernel/fork.c | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index e126ebf2400c..8c5b3e106298 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk) tsk->sighand = NULL; spin_unlock(&sighand->siglock); - __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk, TIF_SIGPENDING); if (group_dead) { + __cleanup_sighand(sighand); flush_sigqueue(&sig->shared_pending); tty_kref_put(tty); } diff --git a/kernel/fork.c b/kernel/fork.c index 6c463c80e93d..fe6f1bf32bb9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) struct sighand_struct *sig; if (clone_flags & CLONE_SIGHAND) { - atomic_inc(¤t->sighand->count); + if (!(clone_flags & CLONE_THREAD)) + atomic_inc(¤t->sighand->count); return 0; } sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); @@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process( if (!(clone_flags & CLONE_THREAD)) free_signal_struct(p->signal); bad_fork_cleanup_sighand: - __cleanup_sighand(p->sighand); + if (!(clone_flags & CLONE_THREAD)) + __cleanup_sighand(p->sighand); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-01 5:16 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-01 5:16 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Take advantage of the situation when sighand->count == 1 to only wait for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread. Only old old linux threading libraries use CLONE_SIGHAND without CLONE_THREAD. So this situation should be present most of the time. This allows ptracing through a multi-threaded exec without the danger of stalling the exec. As historically exec waits for the other threads to be reaped in de_thread before completing. This is necessary as it is not safe to unshare the sighand_struct until all of the other threads in this thread group are reaped, because the lock to serialize threads in a thread group siglock lives in sighand_struct. When oldsighand->count == 1 we know that there are no other users and unsharing the sighand struct in exec is pointless. This makes it safe to only wait for threads to become zombies as the siglock won't change during exec and release_task will use the samve siglock for the old threads as for the new threads. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/exec.c | 15 ++++++++++----- include/linux/sched/signal.h | 2 +- kernel/exit.c | 13 +++++++++---- kernel/signal.c | 8 ++++++-- 4 files changed, 26 insertions(+), 12 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 65145a3df065..0fd29342bbe4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk) struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + bool may_hang; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk) return -EAGAIN; } + may_hang = atomic_read(&oldsighand->count) != 1; sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); - if (!thread_group_leader(tsk)) + sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1); + if (may_hang && !thread_group_leader(tsk)) sig->notify_count--; while (sig->notify_count) { @@ -1092,9 +1094,10 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - cgroup_threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); + cgroup_threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); + + for (;may_hang;) { /* * Do this under tasklist_lock to ensure that * exit_notify() can't miss ->group_exit_task @@ -1108,6 +1111,8 @@ static int de_thread(struct task_struct *tsk) schedule(); if (unlikely(__fatal_signal_pending(tsk))) goto killed; + cgroup_threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); } /* diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 2cf446704cd4..187a9e980d3a 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -298,7 +298,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int); extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent); extern void force_sig(int, struct task_struct *); extern int send_sig(int, struct task_struct *, int); -extern int zap_other_threads(struct task_struct *p); +extern int zap_other_threads(struct task_struct *p, int do_count); extern struct sigqueue *sigqueue_alloc(void); extern void sigqueue_free(struct sigqueue *); extern int send_sigqueue(struct sigqueue *, struct task_struct *, int group); diff --git a/kernel/exit.c b/kernel/exit.c index 8c5b3e106298..972df5ebf79f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -712,6 +712,8 @@ static void forget_original_parent(struct task_struct *father, */ static void exit_notify(struct task_struct *tsk, int group_dead) { + struct sighand_struct *sighand = tsk->sighand; + struct signal_struct *signal = tsk->signal; bool autoreap; struct task_struct *p, *n; LIST_HEAD(dead); @@ -739,9 +741,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) - wake_up_process(tsk->signal->group_exit_task); + spin_lock(&sighand->siglock); + /* mt-exec, de_thread is waiting for threads to exit */ + if (signal->notify_count < 0 && !++signal->notify_count) + wake_up_process(signal->group_exit_task); + + spin_unlock(&sighand->siglock); write_unlock_irq(&tasklist_lock); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { @@ -975,7 +980,7 @@ do_group_exit(int exit_code) else { sig->group_exit_code = exit_code; sig->flags = SIGNAL_GROUP_EXIT; - zap_other_threads(current); + zap_other_threads(current, 0); } spin_unlock_irq(&sighand->siglock); } diff --git a/kernel/signal.c b/kernel/signal.c index 986ef55641ea..e3a5bc239345 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1196,7 +1196,7 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t) /* * Nuke all other threads in the group. */ -int zap_other_threads(struct task_struct *p) +int zap_other_threads(struct task_struct *p, int do_count) { struct task_struct *t = p; int count = 0; @@ -1205,13 +1205,17 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; + if (do_count > 0) + count++; /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + + if (do_count < 0) + count--; } return count; -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-01 5:16 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-01 5:16 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Take advantage of the situation when sighand->count == 1 to only wait for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread. Only old old linux threading libraries use CLONE_SIGHAND without CLONE_THREAD. So this situation should be present most of the time. This allows ptracing through a multi-threaded exec without the danger of stalling the exec. As historically exec waits for the other threads to be reaped in de_thread before completing. This is necessary as it is not safe to unshare the sighand_struct until all of the other threads in this thread group are reaped, because the lock to serialize threads in a thread group siglock lives in sighand_struct. When oldsighand->count == 1 we know that there are no other users and unsharing the sighand struct in exec is pointless. This makes it safe to only wait for threads to become zombies as the siglock won't change during exec and release_task will use the samve siglock for the old threads as for the new threads. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/exec.c | 15 ++++++++++----- include/linux/sched/signal.h | 2 +- kernel/exit.c | 13 +++++++++---- kernel/signal.c | 8 ++++++-- 4 files changed, 26 insertions(+), 12 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 65145a3df065..0fd29342bbe4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk) struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + bool may_hang; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk) return -EAGAIN; } + may_hang = atomic_read(&oldsighand->count) != 1; sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); - if (!thread_group_leader(tsk)) + sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1); + if (may_hang && !thread_group_leader(tsk)) sig->notify_count--; while (sig->notify_count) { @@ -1092,9 +1094,10 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - cgroup_threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); + cgroup_threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); + + for (;may_hang;) { /* * Do this under tasklist_lock to ensure that * exit_notify() can't miss ->group_exit_task @@ -1108,6 +1111,8 @@ static int de_thread(struct task_struct *tsk) schedule(); if (unlikely(__fatal_signal_pending(tsk))) goto killed; + cgroup_threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); } /* diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 2cf446704cd4..187a9e980d3a 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -298,7 +298,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int); extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent); extern void force_sig(int, struct task_struct *); extern int send_sig(int, struct task_struct *, int); -extern int zap_other_threads(struct task_struct *p); +extern int zap_other_threads(struct task_struct *p, int do_count); extern struct sigqueue *sigqueue_alloc(void); extern void sigqueue_free(struct sigqueue *); extern int send_sigqueue(struct sigqueue *, struct task_struct *, int group); diff --git a/kernel/exit.c b/kernel/exit.c index 8c5b3e106298..972df5ebf79f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -712,6 +712,8 @@ static void forget_original_parent(struct task_struct *father, */ static void exit_notify(struct task_struct *tsk, int group_dead) { + struct sighand_struct *sighand = tsk->sighand; + struct signal_struct *signal = tsk->signal; bool autoreap; struct task_struct *p, *n; LIST_HEAD(dead); @@ -739,9 +741,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) - wake_up_process(tsk->signal->group_exit_task); + spin_lock(&sighand->siglock); + /* mt-exec, de_thread is waiting for threads to exit */ + if (signal->notify_count < 0 && !++signal->notify_count) + wake_up_process(signal->group_exit_task); + + spin_unlock(&sighand->siglock); write_unlock_irq(&tasklist_lock); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { @@ -975,7 +980,7 @@ do_group_exit(int exit_code) else { sig->group_exit_code = exit_code; sig->flags = SIGNAL_GROUP_EXIT; - zap_other_threads(current); + zap_other_threads(current, 0); } spin_unlock_irq(&sighand->siglock); } diff --git a/kernel/signal.c b/kernel/signal.c index 986ef55641ea..e3a5bc239345 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1196,7 +1196,7 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t) /* * Nuke all other threads in the group. */ -int zap_other_threads(struct task_struct *p) +int zap_other_threads(struct task_struct *p, int do_count) { struct task_struct *t = p; int count = 0; @@ -1205,13 +1205,17 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; + if (do_count > 0) + count++; /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + + if (do_count < 0) + count--; } return count; -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-02 15:35 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-02 15:35 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/01, Eric W. Biederman wrote: > > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk) > struct signal_struct *sig = tsk->signal; > struct sighand_struct *oldsighand = tsk->sighand; > spinlock_t *lock = &oldsighand->siglock; > + bool may_hang; > > if (thread_group_empty(tsk)) > goto no_thread_group; > @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk) > return -EAGAIN; > } > > + may_hang = atomic_read(&oldsighand->count) != 1; > sig->group_exit_task = tsk; > - sig->notify_count = zap_other_threads(tsk); > - if (!thread_group_leader(tsk)) > + sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1); Eric, this is amazing. So with this patch exec does different things depening on whether sighand is shared with another CLONE_SIGHAND task or not. To me this doesn't look sane in any case. And of course you do realize that it doesn't solve the problem entirely? If I modify my test-case a little bit int xxx(void *arg) { for (;;) pause(); } void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; char stack[16 * 1024]; clone(xxx, stack + 16*1024, CLONE_SIGHAND|CLONE_VM, NULL); pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); // or anything else which needs ->cred_guard_mutex, // say open(/proc/$pid/mem) ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } it should deadlock the same way? So what is the point to make the, well imo insane, patch if it doesn't solve the problem? And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or exit_notify() should set tsk->exit_state under siglock, otherwise zap() can return the wrong count. Finally. This patch creates the nice security hole. Let me modify my test-case again: void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp(path-to-setuid-binary, args); } sleep(1); // Now we can send the signals to setiuid app kill(pid+1, ANYSIGNAL); return 0; } I see another email from your with another proposal. I disagree, will reply soon. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-02 15:35 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-02 15:35 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On 04/01, Eric W. Biederman wrote: > > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk) > struct signal_struct *sig = tsk->signal; > struct sighand_struct *oldsighand = tsk->sighand; > spinlock_t *lock = &oldsighand->siglock; > + bool may_hang; > > if (thread_group_empty(tsk)) > goto no_thread_group; > @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk) > return -EAGAIN; > } > > + may_hang = atomic_read(&oldsighand->count) != 1; > sig->group_exit_task = tsk; > - sig->notify_count = zap_other_threads(tsk); > - if (!thread_group_leader(tsk)) > + sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1); Eric, this is amazing. So with this patch exec does different things depening on whether sighand is shared with another CLONE_SIGHAND task or not. To me this doesn't look sane in any case. And of course you do realize that it doesn't solve the problem entirely? If I modify my test-case a little bit int xxx(void *arg) { for (;;) pause(); } void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; char stack[16 * 1024]; clone(xxx, stack + 16*1024, CLONE_SIGHAND|CLONE_VM, NULL); pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); } sleep(1); // or anything else which needs ->cred_guard_mutex, // say open(/proc/$pid/mem) ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; } it should deadlock the same way? So what is the point to make the, well imo insane, patch if it doesn't solve the problem? And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or exit_notify() should set tsk->exit_state under siglock, otherwise zap() can return the wrong count. Finally. This patch creates the nice security hole. Let me modify my test-case again: void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; } int main(void) { int pid = fork(); if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp(path-to-setuid-binary, args); } sleep(1); // Now we can send the signals to setiuid app kill(pid+1, ANYSIGNAL); return 0; } I see another email from your with another proposal. I disagree, will reply soon. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-02 18:53 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 18:53 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > On 04/01, Eric W. Biederman wrote: >> >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk) >> struct signal_struct *sig = tsk->signal; >> struct sighand_struct *oldsighand = tsk->sighand; >> spinlock_t *lock = &oldsighand->siglock; >> + bool may_hang; >> >> if (thread_group_empty(tsk)) >> goto no_thread_group; >> @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk) >> return -EAGAIN; >> } >> >> + may_hang = atomic_read(&oldsighand->count) != 1; >> sig->group_exit_task = tsk; >> - sig->notify_count = zap_other_threads(tsk); >> - if (!thread_group_leader(tsk)) >> + sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1); > > Eric, this is amazing. So with this patch exec does different things depening > on whether sighand is shared with another CLONE_SIGHAND task or not. To me > this doesn't look sane in any case. It is a 99% solution that makes it possible to talk about and review letting the exec continue after the subthreads are killed but not reaped. Sigh I should have made may_hang say: may_hang = (atomic_read(&oldsignand->count) != 1) && (sig->nr_threads > 1) Which covers all know ways userspace actually uses these clone flags. > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can > return the wrong count. zap_other_thread(tsk, 0) only gets called in the case where we don't care about the return value. It does not get called from fs/exec.c > Finally. This patch creates the nice security hole. Let me modify my test-case > again: > > void *thread(void *arg) > { > ptrace(PTRACE_TRACEME, 0,0,0); > return NULL; > } > > int main(void) > { > int pid = fork(); > > if (!pid) { > pthread_t pt; > pthread_create(&pt, NULL, thread, NULL); > pthread_join(pt, NULL); > execlp(path-to-setuid-binary, args); > } > > sleep(1); > > // Now we can send the signals to setiuid app > kill(pid+1, ANYSIGNAL); > > return 0; > } That is a substantive objection, and something that definitely needs to get fixed. Can you think of anything else? Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-02 18:53 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 18:53 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > On 04/01, Eric W. Biederman wrote: >> >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -1052,6 +1052,7 @@ static int de_thread(struct task_struct *tsk) >> struct signal_struct *sig = tsk->signal; >> struct sighand_struct *oldsighand = tsk->sighand; >> spinlock_t *lock = &oldsighand->siglock; >> + bool may_hang; >> >> if (thread_group_empty(tsk)) >> goto no_thread_group; >> @@ -1069,9 +1070,10 @@ static int de_thread(struct task_struct *tsk) >> return -EAGAIN; >> } >> >> + may_hang = atomic_read(&oldsighand->count) != 1; >> sig->group_exit_task = tsk; >> - sig->notify_count = zap_other_threads(tsk); >> - if (!thread_group_leader(tsk)) >> + sig->notify_count = zap_other_threads(tsk, may_hang ? 1 : -1); > > Eric, this is amazing. So with this patch exec does different things depening > on whether sighand is shared with another CLONE_SIGHAND task or not. To me > this doesn't look sane in any case. It is a 99% solution that makes it possible to talk about and review letting the exec continue after the subthreads are killed but not reaped. Sigh I should have made may_hang say: may_hang = (atomic_read(&oldsignand->count) != 1) && (sig->nr_threads > 1) Which covers all know ways userspace actually uses these clone flags. > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can > return the wrong count. zap_other_thread(tsk, 0) only gets called in the case where we don't care about the return value. It does not get called from fs/exec.c > Finally. This patch creates the nice security hole. Let me modify my test-case > again: > > void *thread(void *arg) > { > ptrace(PTRACE_TRACEME, 0,0,0); > return NULL; > } > > int main(void) > { > int pid = fork(); > > if (!pid) { > pthread_t pt; > pthread_create(&pt, NULL, thread, NULL); > pthread_join(pt, NULL); > execlp(path-to-setuid-binary, args); > } > > sleep(1); > > // Now we can send the signals to setiuid app > kill(pid+1, ANYSIGNAL); > > return 0; > } That is a substantive objection, and something that definitely needs to get fixed. Can you think of anything else? Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-03 18:12 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-03 18:12 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/02, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or > > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can > > return the wrong count. > > zap_other_thread(tsk, 0) only gets called in the case where we don't > care about the return value. It does not get called from fs/exec.c I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should return the number of threads which didn't pass exit_notify(). The returned value can be wrong unless you change exit_notify() to set exit_state under siglock. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-03 18:12 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-03 18:12 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On 04/02, Eric W. Biederman wrote: > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or > > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can > > return the wrong count. > > zap_other_thread(tsk, 0) only gets called in the case where we don't > care about the return value. It does not get called from fs/exec.c I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should return the number of threads which didn't pass exit_notify(). The returned value can be wrong unless you change exit_notify() to set exit_state under siglock. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped 2017-04-03 18:12 ` Oleg Nesterov (?) @ 2017-04-03 21:04 ` Eric W. Biederman 2017-04-05 16:44 ` Oleg Nesterov -1 siblings, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-04-03 21:04 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > On 04/02, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg@redhat.com> writes: >> >> > And btw zap_other_threads(may_hang == 0) is racy. Either you need tasklist or >> > exit_notify() should set tsk->exit_state under siglock, otherwise zap() can >> > return the wrong count. >> >> zap_other_thread(tsk, 0) only gets called in the case where we don't >> care about the return value. It does not get called from fs/exec.c > > I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should > return the number of threads which didn't pass exit_notify(). The returned value > can be wrong unless you change exit_notify() to set exit_state under > siglock. Interesting an existing bug. I won't deny that one. Subtle to catch but easy enough to fix. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped 2017-04-03 21:04 ` Eric W. Biederman @ 2017-04-05 16:44 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:44 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/03, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > I meant that may_hang == 0 implies zap_other_threads(do_count => -1) which should > > return the number of threads which didn't pass exit_notify(). The returned value > > can be wrong unless you change exit_notify() to set exit_state under > > siglock. but I forgot to add that, of course, this problem is very minor because we can only miss a thread which is already at the end of exit_notify() so nothing bad can happen. But imo should be fixed anyway, simply because this looks wrong/racy. Your recent 4/5 has the same problem. > Interesting an existing bug. Hmm... what do you mean? The current code looks fine. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang @ 2017-04-02 15:38 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-02 15:38 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/01, Eric W. Biederman wrote: > > These patches only handle the case when sighand_struct is not > shared between different multi-threaded processes. Ah, I didn't read 0/2 when looked at 2/2, so at least you documented this. > The general > case is solvable but that is a quite a bit more code. I don't think so, please see my reply to 2/2. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang @ 2017-04-02 15:38 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-02 15:38 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On 04/01, Eric W. Biederman wrote: > > These patches only handle the case when sighand_struct is not > shared between different multi-threaded processes. Ah, I didn't read 0/2 when looked at 2/2, so at least you documented this. > The general > case is solvable but that is a quite a bit more code. I don't think so, please see my reply to 2/2. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 0/5] exec: Fixing ptrace'd mulit-threaded hang @ 2017-04-02 22:50 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg your comment about kill being able to send signal was an important dimension I had missed thank you. This patchset just denies the case of SIGHAND between different multi-threaded processes as I don't think anyone cares. I can fix that if anyone cares but I am not certain we actally do. I have reworked the ptrace notification code so that we always send notifications if we can but don't wait if it is a coredump or an exec. Which simpilifies the code nicely. A few more tweaks are needed before a final version but I think things are compelling. fs/exec.c | 23 ++------- include/linux/sched/signal.h | 1 + kernel/exit.c | 20 ++++---- kernel/fork.c | 14 +++++- kernel/ptrace.c | 4 ++ kernel/signal.c | 112 +++++++++++++++++++------------------------ 6 files changed, 78 insertions(+), 96 deletions(-) Eric W. Biederman (5): ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump sighand: Count each thread group once in sighand_struct clone: Disallown CLONE_THREAD with a shared sighand_struct exec: If possible don't wait for ptraced threads to be reaped signal: Don't allow accessing signal_struct by old threads after exec Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 0/5] exec: Fixing ptrace'd mulit-threaded hang @ 2017-04-02 22:50 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Oleg your comment about kill being able to send signal was an important dimension I had missed thank you. This patchset just denies the case of SIGHAND between different multi-threaded processes as I don't think anyone cares. I can fix that if anyone cares but I am not certain we actally do. I have reworked the ptrace notification code so that we always send notifications if we can but don't wait if it is a coredump or an exec. Which simpilifies the code nicely. A few more tweaks are needed before a final version but I think things are compelling. fs/exec.c | 23 ++------- include/linux/sched/signal.h | 1 + kernel/exit.c | 20 ++++---- kernel/fork.c | 14 +++++- kernel/ptrace.c | 4 ++ kernel/signal.c | 112 +++++++++++++++++++------------------------ 6 files changed, 78 insertions(+), 96 deletions(-) Eric W. Biederman (5): ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump sighand: Count each thread group once in sighand_struct clone: Disallown CLONE_THREAD with a shared sighand_struct exec: If possible don't wait for ptraced threads to be reaped signal: Don't allow accessing signal_struct by old threads after exec Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump @ 2017-04-02 22:51 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Take advantage of the fact that no ptrace stop will wait for a debugger if another process sends SIGKILL to the waiting task. In the case of exec and coredump which have many interesting deadlock opportunities and no one tests the what happens if you are ptraced during exec or coredump act like another SIGKILL was immediately sent to the process and don't stop. Keep sending the signal to the tracer so that this appears like the worst case where someone else sent the process a SIGKILL before the tracer could react. So all non-buggy tracers must support this case. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- kernel/signal.c | 103 +++++++++++++++++++++++--------------------------------- 1 file changed, 42 insertions(+), 61 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 7e59ebc2c25e..11fa736eb2ae 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1740,30 +1740,6 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, spin_unlock_irqrestore(&sighand->siglock, flags); } -static inline int may_ptrace_stop(void) -{ - if (!likely(current->ptrace)) - return 0; - /* - * Are we in the middle of do_coredump? - * If so and our tracer is also part of the coredump stopping - * is a deadlock situation, and pointless because our tracer - * is dead so don't allow us to stop. - * If SIGKILL was already sent before the caller unlocked - * ->siglock we must see ->core_state != NULL. Otherwise it - * is safe to enter schedule(). - * - * This is almost outdated, a task with the pending SIGKILL can't - * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported - * after SIGKILL was already dequeued. - */ - if (unlikely(current->mm->core_state) && - unlikely(current->mm == current->parent->mm)) - return 0; - - return 1; -} - /* * Return non-zero if there is a SIGKILL that should be waking us up. * Called with the siglock held. @@ -1789,6 +1765,8 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info) __releases(¤t->sighand->siglock) __acquires(¤t->sighand->siglock) { + bool ptrace_parent; + bool ptrace_wait; bool gstop_done = false; if (arch_ptrace_stop_needed(exit_code, info)) { @@ -1842,53 +1820,56 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info) spin_unlock_irq(¤t->sighand->siglock); read_lock(&tasklist_lock); - if (may_ptrace_stop()) { - /* - * Notify parents of the stop. - * - * While ptraced, there are two parents - the ptracer and - * the real_parent of the group_leader. The ptracer should - * know about every stop while the real parent is only - * interested in the completion of group stop. The states - * for the two don't interact with each other. Notify - * separately unless they're gonna be duplicates. - */ + ptrace_parent = likely(current->ptrace); + /* + * Notify parents of the stop. + * + * While ptraced, there are two parents - the ptracer and + * the real_parent of the group_leader. The ptracer should + * know about every stop while the real parent is only + * interested in the completion of group stop. The states + * for the two don't interact with each other. Notify + * separately unless they're gonna be duplicates. + */ + if (ptrace_parent) do_notify_parent_cldstop(current, true, why); - if (gstop_done && ptrace_reparented(current)) - do_notify_parent_cldstop(current, false, why); + if (gstop_done && (ptrace_reparented(current) || !ptrace_parent)) + do_notify_parent_cldstop(current, false, why); - /* - * Don't want to allow preemption here, because - * sys_ptrace() needs this task to be inactive. - * - * XXX: implement read_unlock_no_resched(). - */ - preempt_disable(); - read_unlock(&tasklist_lock); - preempt_enable_no_resched(); - freezable_schedule(); - } else { - /* - * By the time we got the lock, our tracer went away. - * Don't drop the lock yet, another tracer may come. - * - * If @gstop_done, the ptracer went away between group stop - * completion and here. During detach, it would have set - * JOBCTL_STOP_PENDING on us and we'll re-enter - * TASK_STOPPED in do_signal_stop() on return, so notifying - * the real parent of the group stop completion is enough. - */ - if (gstop_done) - do_notify_parent_cldstop(current, false, why); + /* + * Always wait for the traceer if the process is traced unless + * the process is in the middle of an exec or core dump. + * + * In exec we risk a deadlock with de_thread waiting for the + * thread to be killed, the tracer, and acquring cred_guard_mutex. + * + * In coredump we risk a deadlock if the tracer is also part + * of the coredump stopping. + */ + ptrace_wait = ptrace_parent && + !current->signal->group_exit_task && + !current->mm->core_state; + if (!ptrace_wait) { /* tasklist protects us from ptrace_freeze_traced() */ __set_current_state(TASK_RUNNING); if (clear_code) current->exit_code = 0; - read_unlock(&tasklist_lock); } /* + * Don't want to allow preemption here, because + * sys_ptrace() needs this task to be inactive. + * + * XXX: implement read_unlock_no_resched(). + */ + preempt_disable(); + read_unlock(&tasklist_lock); + preempt_enable_no_resched(); + if (ptrace_wait) + freezable_schedule(); + + /* * We are back. Now reacquire the siglock before touching * last_siginfo, so that we are sure to have synchronized with * any signal-sending on another CPU that wants to examine it. -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump @ 2017-04-02 22:51 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Take advantage of the fact that no ptrace stop will wait for a debugger if another process sends SIGKILL to the waiting task. In the case of exec and coredump which have many interesting deadlock opportunities and no one tests the what happens if you are ptraced during exec or coredump act like another SIGKILL was immediately sent to the process and don't stop. Keep sending the signal to the tracer so that this appears like the worst case where someone else sent the process a SIGKILL before the tracer could react. So all non-buggy tracers must support this case. Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- kernel/signal.c | 103 +++++++++++++++++++++++--------------------------------- 1 file changed, 42 insertions(+), 61 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 7e59ebc2c25e..11fa736eb2ae 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1740,30 +1740,6 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, spin_unlock_irqrestore(&sighand->siglock, flags); } -static inline int may_ptrace_stop(void) -{ - if (!likely(current->ptrace)) - return 0; - /* - * Are we in the middle of do_coredump? - * If so and our tracer is also part of the coredump stopping - * is a deadlock situation, and pointless because our tracer - * is dead so don't allow us to stop. - * If SIGKILL was already sent before the caller unlocked - * ->siglock we must see ->core_state != NULL. Otherwise it - * is safe to enter schedule(). - * - * This is almost outdated, a task with the pending SIGKILL can't - * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported - * after SIGKILL was already dequeued. - */ - if (unlikely(current->mm->core_state) && - unlikely(current->mm == current->parent->mm)) - return 0; - - return 1; -} - /* * Return non-zero if there is a SIGKILL that should be waking us up. * Called with the siglock held. @@ -1789,6 +1765,8 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info) __releases(¤t->sighand->siglock) __acquires(¤t->sighand->siglock) { + bool ptrace_parent; + bool ptrace_wait; bool gstop_done = false; if (arch_ptrace_stop_needed(exit_code, info)) { @@ -1842,53 +1820,56 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info) spin_unlock_irq(¤t->sighand->siglock); read_lock(&tasklist_lock); - if (may_ptrace_stop()) { - /* - * Notify parents of the stop. - * - * While ptraced, there are two parents - the ptracer and - * the real_parent of the group_leader. The ptracer should - * know about every stop while the real parent is only - * interested in the completion of group stop. The states - * for the two don't interact with each other. Notify - * separately unless they're gonna be duplicates. - */ + ptrace_parent = likely(current->ptrace); + /* + * Notify parents of the stop. + * + * While ptraced, there are two parents - the ptracer and + * the real_parent of the group_leader. The ptracer should + * know about every stop while the real parent is only + * interested in the completion of group stop. The states + * for the two don't interact with each other. Notify + * separately unless they're gonna be duplicates. + */ + if (ptrace_parent) do_notify_parent_cldstop(current, true, why); - if (gstop_done && ptrace_reparented(current)) - do_notify_parent_cldstop(current, false, why); + if (gstop_done && (ptrace_reparented(current) || !ptrace_parent)) + do_notify_parent_cldstop(current, false, why); - /* - * Don't want to allow preemption here, because - * sys_ptrace() needs this task to be inactive. - * - * XXX: implement read_unlock_no_resched(). - */ - preempt_disable(); - read_unlock(&tasklist_lock); - preempt_enable_no_resched(); - freezable_schedule(); - } else { - /* - * By the time we got the lock, our tracer went away. - * Don't drop the lock yet, another tracer may come. - * - * If @gstop_done, the ptracer went away between group stop - * completion and here. During detach, it would have set - * JOBCTL_STOP_PENDING on us and we'll re-enter - * TASK_STOPPED in do_signal_stop() on return, so notifying - * the real parent of the group stop completion is enough. - */ - if (gstop_done) - do_notify_parent_cldstop(current, false, why); + /* + * Always wait for the traceer if the process is traced unless + * the process is in the middle of an exec or core dump. + * + * In exec we risk a deadlock with de_thread waiting for the + * thread to be killed, the tracer, and acquring cred_guard_mutex. + * + * In coredump we risk a deadlock if the tracer is also part + * of the coredump stopping. + */ + ptrace_wait = ptrace_parent && + !current->signal->group_exit_task && + !current->mm->core_state; + if (!ptrace_wait) { /* tasklist protects us from ptrace_freeze_traced() */ __set_current_state(TASK_RUNNING); if (clear_code) current->exit_code = 0; - read_unlock(&tasklist_lock); } /* + * Don't want to allow preemption here, because + * sys_ptrace() needs this task to be inactive. + * + * XXX: implement read_unlock_no_resched(). + */ + preempt_disable(); + read_unlock(&tasklist_lock); + preempt_enable_no_resched(); + if (ptrace_wait) + freezable_schedule(); + + /* * We are back. Now reacquire the siglock before touching * last_siginfo, so that we are sure to have synchronized with * any signal-sending on another CPU that wants to examine it. -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump 2017-04-02 22:51 ` Eric W. Biederman (?) @ 2017-04-05 16:19 ` Oleg Nesterov -1 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:19 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/02, Eric W. Biederman wrote: > > In the case of exec and coredump which have many interesting deadlock > opportunities So this patch is very close to my 2/2 one-liner, except - you removed the current->mm == current->parent->mm check I didn't do this on purpose, because even the->core_state is not really needed if we check ->group_exit_task, this need more changes anyway, but I won't argue. - With your patch we send the notification to debugger even if we are not going to stop. This is not wrong, but why? This is pointless, nobody rely on SIGCHLD, if nothing else it doesn't queue. Again, I won't argue, but this complicates both the patch and the code for no reason. Unless I missed something. > Keep sending the signal to the tracer so that this appears like > the worst case where someone else sent the process a SIGKILL before > the tracer could react. So all non-buggy tracers must support > this case. Well, I can't understand the changelog. Sure, debugger must support this case, but obviously this can break things anyway. For example. The coredumping thread must stop in PTRACE_EVENT_EXIT. There is a tool (I don't remember its name) which does ptrace_attach(PTRACE_SEIZE, PTRACE_O_TRACEEXIT) after the coredump was already started, closes the pipe, and reads the registers when this thread actually exits. This patch or my 2/2 should not break it, ->group_exit_task will be cleared after do_coredump(), but unfortunately something else can be broken. So I think the changelog should mention that yes, this is the user visible change which _can_ break something anyway. In short. I will be really happy if this patch comes from you, not me ;) Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 2/5] sighand: Count each thread group once in sighand_struct @ 2017-04-02 22:51 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api In practice either a thread group is either using a sighand_struct or it isn't. Therefore simplify things a bit and only increment the count in sighand_struct when a new thread group is created that uses the existing sighand_struct, and only decrement the count in sighand_struct when a thread group exits. As well as standing on it's own merits this has the potential to simply de_thread. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/exit.c | 2 +- kernel/fork.c | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index e126ebf2400c..8c5b3e106298 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk) tsk->sighand = NULL; spin_unlock(&sighand->siglock); - __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk, TIF_SIGPENDING); if (group_dead) { + __cleanup_sighand(sighand); flush_sigqueue(&sig->shared_pending); tty_kref_put(tty); } diff --git a/kernel/fork.c b/kernel/fork.c index 6c463c80e93d..fe6f1bf32bb9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) struct sighand_struct *sig; if (clone_flags & CLONE_SIGHAND) { - atomic_inc(¤t->sighand->count); + if (!(clone_flags & CLONE_THREAD)) + atomic_inc(¤t->sighand->count); return 0; } sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); @@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process( if (!(clone_flags & CLONE_THREAD)) free_signal_struct(p->signal); bad_fork_cleanup_sighand: - __cleanup_sighand(p->sighand); + if (!(clone_flags & CLONE_THREAD)) + __cleanup_sighand(p->sighand); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 2/5] sighand: Count each thread group once in sighand_struct @ 2017-04-02 22:51 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA In practice either a thread group is either using a sighand_struct or it isn't. Therefore simplify things a bit and only increment the count in sighand_struct when a new thread group is created that uses the existing sighand_struct, and only decrement the count in sighand_struct when a thread group exits. As well as standing on it's own merits this has the potential to simply de_thread. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- kernel/exit.c | 2 +- kernel/fork.c | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index e126ebf2400c..8c5b3e106298 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -163,9 +163,9 @@ static void __exit_signal(struct task_struct *tsk) tsk->sighand = NULL; spin_unlock(&sighand->siglock); - __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk, TIF_SIGPENDING); if (group_dead) { + __cleanup_sighand(sighand); flush_sigqueue(&sig->shared_pending); tty_kref_put(tty); } diff --git a/kernel/fork.c b/kernel/fork.c index 6c463c80e93d..fe6f1bf32bb9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1295,7 +1295,8 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) struct sighand_struct *sig; if (clone_flags & CLONE_SIGHAND) { - atomic_inc(¤t->sighand->count); + if (!(clone_flags & CLONE_THREAD)) + atomic_inc(¤t->sighand->count); return 0; } sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); @@ -1896,7 +1897,8 @@ static __latent_entropy struct task_struct *copy_process( if (!(clone_flags & CLONE_THREAD)) free_signal_struct(p->signal); bad_fork_cleanup_sighand: - __cleanup_sighand(p->sighand); + if (!(clone_flags & CLONE_THREAD)) + __cleanup_sighand(p->sighand); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct @ 2017-04-02 22:52 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:52 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Old threading libraries used CLONE_SIGHAND without clone thread. Modern threadding libraries always use CLONE_SIGHAND | CLONE_THREAD. Therefore let's simplify our lives and stop supporting a case no one cares about. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/fork.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index fe6f1bf32bb9..0632ac1180be 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process( if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) return ERR_PTR(-EINVAL); + /* Disallow CLONE_THREAD with a shared SIGHAND structure. No + * one cares and supporting it leads to unnecessarily complex + * code. + */ + if ((clone_flags & CLONE_THREAD) && (atomic_read(¤t->sighand->count) > 1)) + return ERR_PTR(-EINVAL); + /* * Shared signal handlers imply shared VM. By way of the above, * thread groups also imply shared VM. Blocking this case allows -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct @ 2017-04-02 22:52 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:52 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Old threading libraries used CLONE_SIGHAND without clone thread. Modern threadding libraries always use CLONE_SIGHAND | CLONE_THREAD. Therefore let's simplify our lives and stop supporting a case no one cares about. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- kernel/fork.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index fe6f1bf32bb9..0632ac1180be 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process( if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) return ERR_PTR(-EINVAL); + /* Disallow CLONE_THREAD with a shared SIGHAND structure. No + * one cares and supporting it leads to unnecessarily complex + * code. + */ + if ((clone_flags & CLONE_THREAD) && (atomic_read(¤t->sighand->count) > 1)) + return ERR_PTR(-EINVAL); + /* * Shared signal handlers imply shared VM. By way of the above, * thread groups also imply shared VM. Blocking this case allows -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct @ 2017-04-05 16:24 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/02, Eric W. Biederman wrote: > > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process( > if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) > return ERR_PTR(-EINVAL); > > + /* Disallow CLONE_THREAD with a shared SIGHAND structure. No > + * one cares Well, can't resists... I won't argue, but we can't know if no one cares or not. I agree that most probably this won't break something, but who knows... I am always scared when we add the incompatible changes. > and supporting it leads to unnecessarily complex > + * code. > + */ > + if ((clone_flags & CLONE_THREAD) && (atomic_read(¤t->sighand->count) > 1)) > + return ERR_PTR(-EINVAL); Perhaps the comment should explain why we do this and say that sighand-unsharing in de_thread() depends on this. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct @ 2017-04-05 16:24 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On 04/02, Eric W. Biederman wrote: > > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process( > if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) > return ERR_PTR(-EINVAL); > > + /* Disallow CLONE_THREAD with a shared SIGHAND structure. No > + * one cares Well, can't resists... I won't argue, but we can't know if no one cares or not. I agree that most probably this won't break something, but who knows... I am always scared when we add the incompatible changes. > and supporting it leads to unnecessarily complex > + * code. > + */ > + if ((clone_flags & CLONE_THREAD) && (atomic_read(¤t->sighand->count) > 1)) > + return ERR_PTR(-EINVAL); Perhaps the comment should explain why we do this and say that sighand-unsharing in de_thread() depends on this. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct 2017-04-05 16:24 ` Oleg Nesterov (?) @ 2017-04-05 17:34 ` Eric W. Biederman 2017-04-05 18:11 ` Oleg Nesterov -1 siblings, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-04-05 17:34 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > On 04/02, Eric W. Biederman wrote: >> >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -1515,6 +1515,13 @@ static __latent_entropy struct task_struct *copy_process( >> if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) >> return ERR_PTR(-EINVAL); >> >> + /* Disallow CLONE_THREAD with a shared SIGHAND structure. No >> + * one cares > > Well, can't resists... I won't argue, but we can't know if no one cares > or not. I agree that most probably this won't break something, but who > knows... I am always scared when we add the incompatible changes. I agree that changing userspace semantics is something to be very careful with. But at least for purposes of discussion I think this is a good patch. I can avoid this change but it requires moving sighand->siglock into signal_struct and introducing a new spinlock into sighand_struct to just guard the signal handlers. However I think the change to move siglock would be a distraction from the larger issues of this patchset. Once we address the core issues I will be happy to revisit this. >> and supporting it leads to unnecessarily complex >> + * code. >> + */ >> + if ((clone_flags & CLONE_THREAD) && (atomic_read(¤t->sighand->count) > 1)) >> + return ERR_PTR(-EINVAL); > > Perhaps the comment should explain why we do this and say that > sighand-unsharing in de_thread() depends on this. That would be a better comment. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct 2017-04-05 17:34 ` Eric W. Biederman @ 2017-04-05 18:11 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 18:11 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/05, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > I agree that changing userspace semantics is something to be very > careful with. But at least for purposes of discussion I think this is a > good patch. I agree that we need it with your approach, but imo it would be much better to not depend on the subtle changes like this. My 2/2 or your 1/5 are already bad enough. > I can avoid this change but it requires moving sighand->siglock > into signal_struct and introducing a new spinlock into sighand_struct > to just guard the signal handlers. Oh, this looks much, much worse to me. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-02 22:53 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:53 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Take advantage of the situation when sighand->count == 1 to only wait for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread. Only old old linux threading libraries use CLONE_SIGHAND without CLONE_THREAD. So this situation should be present most of the time. This allows ptracing through a multi-threaded exec without the danger of stalling the exec. As historically exec waits for the other threads to be reaped in de_thread before completing. This is necessary as it is not safe to unshare the sighand_struct until all of the other threads in this thread group are reaped, because the lock to serialize threads in a thread group siglock lives in sighand_struct. When oldsighand->count == 1 we know that there are no other users and unsharing the sighand struct in exec is pointless. This makes it safe to only wait for threads to become zombies as the siglock won't change during exec and release_task will use the samve siglock for the old threads as for the new threads. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/exec.c | 22 ++-------------------- kernel/exit.c | 18 ++++++++---------- kernel/signal.c | 2 +- 3 files changed, 11 insertions(+), 31 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 65145a3df065..303a114b00ce 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1071,9 +1071,6 @@ static int de_thread(struct task_struct *tsk) sig->group_exit_task = tsk; sig->notify_count = zap_other_threads(tsk); - if (!thread_group_leader(tsk)) - sig->notify_count--; - while (sig->notify_count) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); @@ -1092,23 +1089,8 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - cgroup_threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - cgroup_threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } + cgroup_threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a diff --git a/kernel/exit.c b/kernel/exit.c index 8c5b3e106298..955c96e3fc12 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -118,13 +118,6 @@ static void __exit_signal(struct task_struct *tsk) tty = sig->tty; sig->tty = NULL; } else { - /* - * If there is any task waiting for the group exit - * then notify it: - */ - if (sig->notify_count > 0 && !--sig->notify_count) - wake_up_process(sig->group_exit_task); - if (tsk == sig->curr_target) sig->curr_target = next_thread(tsk); } @@ -712,6 +705,8 @@ static void forget_original_parent(struct task_struct *father, */ static void exit_notify(struct task_struct *tsk, int group_dead) { + struct sighand_struct *sighand = tsk->sighand; + struct signal_struct *signal = tsk->signal; bool autoreap; struct task_struct *p, *n; LIST_HEAD(dead); @@ -739,9 +734,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) - wake_up_process(tsk->signal->group_exit_task); + spin_lock(&sighand->siglock); + /* mt-exec, de_thread is waiting for threads to exit */ + if (signal->notify_count > 0 && !--signal->notify_count) + wake_up_process(signal->group_exit_task); + + spin_unlock(&sighand->siglock); write_unlock_irq(&tasklist_lock); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { diff --git a/kernel/signal.c b/kernel/signal.c index 11fa736eb2ae..fd75ba33ee3d 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1205,13 +1205,13 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count; -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped @ 2017-04-02 22:53 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:53 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Take advantage of the situation when sighand->count == 1 to only wait for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread. Only old old linux threading libraries use CLONE_SIGHAND without CLONE_THREAD. So this situation should be present most of the time. This allows ptracing through a multi-threaded exec without the danger of stalling the exec. As historically exec waits for the other threads to be reaped in de_thread before completing. This is necessary as it is not safe to unshare the sighand_struct until all of the other threads in this thread group are reaped, because the lock to serialize threads in a thread group siglock lives in sighand_struct. When oldsighand->count == 1 we know that there are no other users and unsharing the sighand struct in exec is pointless. This makes it safe to only wait for threads to become zombies as the siglock won't change during exec and release_task will use the samve siglock for the old threads as for the new threads. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/exec.c | 22 ++-------------------- kernel/exit.c | 18 ++++++++---------- kernel/signal.c | 2 +- 3 files changed, 11 insertions(+), 31 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 65145a3df065..303a114b00ce 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1071,9 +1071,6 @@ static int de_thread(struct task_struct *tsk) sig->group_exit_task = tsk; sig->notify_count = zap_other_threads(tsk); - if (!thread_group_leader(tsk)) - sig->notify_count--; - while (sig->notify_count) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); @@ -1092,23 +1089,8 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - cgroup_threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - cgroup_threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } + cgroup_threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a diff --git a/kernel/exit.c b/kernel/exit.c index 8c5b3e106298..955c96e3fc12 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -118,13 +118,6 @@ static void __exit_signal(struct task_struct *tsk) tty = sig->tty; sig->tty = NULL; } else { - /* - * If there is any task waiting for the group exit - * then notify it: - */ - if (sig->notify_count > 0 && !--sig->notify_count) - wake_up_process(sig->group_exit_task); - if (tsk == sig->curr_target) sig->curr_target = next_thread(tsk); } @@ -712,6 +705,8 @@ static void forget_original_parent(struct task_struct *father, */ static void exit_notify(struct task_struct *tsk, int group_dead) { + struct sighand_struct *sighand = tsk->sighand; + struct signal_struct *signal = tsk->signal; bool autoreap; struct task_struct *p, *n; LIST_HEAD(dead); @@ -739,9 +734,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) - wake_up_process(tsk->signal->group_exit_task); + spin_lock(&sighand->siglock); + /* mt-exec, de_thread is waiting for threads to exit */ + if (signal->notify_count > 0 && !--signal->notify_count) + wake_up_process(signal->group_exit_task); + + spin_unlock(&sighand->siglock); write_unlock_irq(&tasklist_lock); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { diff --git a/kernel/signal.c b/kernel/signal.c index 11fa736eb2ae..fd75ba33ee3d 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1205,13 +1205,13 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count; -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped 2017-04-02 22:53 ` Eric W. Biederman (?) @ 2017-04-05 16:15 ` Oleg Nesterov -1 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/02, Eric W. Biederman wrote: > > Take advantage of the situation when sighand->count == 1 to only wait > for threads to reach EXIT_ZOMBIE instead of EXIT_DEAD in de_thread. Let me comment this patch first, it looks mostly fine to me. And note that this is what my patch does too: exec() waits until all threads pass exit_notify() and drops cred_guard_mutex. However, with my patch patch exec() then waits until all threads disappear. This is uglifies the code but this is simple and safe. With your patches exec doesn't do another wait and succeeds after the 1st wait. I think this is wrong and the next patch is not enough. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-02 22:57 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:57 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Add exec_id to signal_struct and compare it at a few choice moments. I believe this closes the security holes that letting the zombie threads linger after exec opens up. The problem is that old threads may have different creds after a setuid exec, and then formerly shared resources may change. So signal sending and accesses by proc need to be blocked. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/exec.c | 1 + include/linux/sched/signal.h | 1 + kernel/fork.c | 1 + kernel/ptrace.c | 4 ++++ kernel/signal.c | 7 ++++++- 5 files changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/exec.c b/fs/exec.c index 303a114b00ce..730dee8bb2f8 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1323,6 +1323,7 @@ void setup_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ current->self_exec_id++; + current->signal->exec_id = current->self_exec_id; flush_signal_handlers(current, 0); } EXPORT_SYMBOL(setup_new_exec); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 2cf446704cd4..63ae951ee330 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -80,6 +80,7 @@ struct signal_struct { atomic_t live; int nr_threads; struct list_head thread_head; + u32 exec_id; wait_queue_head_t wait_chldexit; /* for wait4() */ diff --git a/kernel/fork.c b/kernel/fork.c index 0632ac1180be..a442fa099842 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1387,6 +1387,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; + sig->exec_id = current->self_exec_id; mutex_init(&sig->cred_guard_mutex); diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 0af928712174..cc6b10b1ffbe 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -277,6 +277,10 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) * or halting the specified task is impossible. */ + /* Don't allow inspecting a thread after exec */ + if (task->self_exec_id != task->signal->exec_id) + return 1; + /* Don't let security modules deny introspection */ if (same_thread_group(task, current)) return 0; diff --git a/kernel/signal.c b/kernel/signal.c index fd75ba33ee3d..fe8dcdb622f5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, from_ancestor_ns || (info == SEND_SIG_FORCED))) goto ret; + /* Don't allow thread group signals after exec */ + if (group && (t->signal->exec_id != t->self_exec_id)) + goto ret; + pending = group ? &t->signal->shared_pending : &t->pending; /* * Short-circuit ignored signals and support queuing @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, * must see ->sighand == NULL. */ spin_lock(&sighand->siglock); - if (likely(sighand == tsk->sighand)) { + if (likely((sighand == tsk->sighand) && + (tsk->self_exec_id == tsk->signal->exec_id))) { rcu_read_unlock(); break; } -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-02 22:57 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 22:57 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Add exec_id to signal_struct and compare it at a few choice moments. I believe this closes the security holes that letting the zombie threads linger after exec opens up. The problem is that old threads may have different creds after a setuid exec, and then formerly shared resources may change. So signal sending and accesses by proc need to be blocked. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/exec.c | 1 + include/linux/sched/signal.h | 1 + kernel/fork.c | 1 + kernel/ptrace.c | 4 ++++ kernel/signal.c | 7 ++++++- 5 files changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/exec.c b/fs/exec.c index 303a114b00ce..730dee8bb2f8 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1323,6 +1323,7 @@ void setup_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ current->self_exec_id++; + current->signal->exec_id = current->self_exec_id; flush_signal_handlers(current, 0); } EXPORT_SYMBOL(setup_new_exec); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 2cf446704cd4..63ae951ee330 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -80,6 +80,7 @@ struct signal_struct { atomic_t live; int nr_threads; struct list_head thread_head; + u32 exec_id; wait_queue_head_t wait_chldexit; /* for wait4() */ diff --git a/kernel/fork.c b/kernel/fork.c index 0632ac1180be..a442fa099842 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1387,6 +1387,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; + sig->exec_id = current->self_exec_id; mutex_init(&sig->cred_guard_mutex); diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 0af928712174..cc6b10b1ffbe 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -277,6 +277,10 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) * or halting the specified task is impossible. */ + /* Don't allow inspecting a thread after exec */ + if (task->self_exec_id != task->signal->exec_id) + return 1; + /* Don't let security modules deny introspection */ if (same_thread_group(task, current)) return 0; diff --git a/kernel/signal.c b/kernel/signal.c index fd75ba33ee3d..fe8dcdb622f5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, from_ancestor_ns || (info == SEND_SIG_FORCED))) goto ret; + /* Don't allow thread group signals after exec */ + if (group && (t->signal->exec_id != t->self_exec_id)) + goto ret; + pending = group ? &t->signal->shared_pending : &t->pending; /* * Short-circuit ignored signals and support queuing @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, * must see ->sighand == NULL. */ spin_lock(&sighand->siglock); - if (likely(sighand == tsk->sighand)) { + if (likely((sighand == tsk->sighand) && + (tsk->self_exec_id == tsk->signal->exec_id))) { rcu_read_unlock(); break; } -- 2.10.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-05 16:18 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:18 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/02, Eric W. Biederman wrote: > > Add exec_id to signal_struct and compare it at a few choice moments. I really dislike this change no matter what, sorry. Firstly, task_struct->*_exec_id should simply die (I already have the patch), or at least they should be moved into signal_struct simply because this is per-process thing. > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, > from_ancestor_ns || (info == SEND_SIG_FORCED))) > goto ret; > > + /* Don't allow thread group signals after exec */ > + if (group && (t->signal->exec_id != t->self_exec_id)) > + goto ret; Hmm. Either we do not need this exec_id check at all, or we should not take "group" into account; a fatal signal (say SIGKILL) will kill the whole thread-group. > @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, > * must see ->sighand == NULL. > */ > spin_lock(&sighand->siglock); > - if (likely(sighand == tsk->sighand)) { > + if (likely((sighand == tsk->sighand) && > + (tsk->self_exec_id == tsk->signal->exec_id))) { Oh, this doesn't look good to me. Yes, with your approach we probably need this to, say, ensure that posix-cpu-timer can't kill the process after exec, but I'd rather add the exit_state check into run_posix_timers(). But OK, suppose that we fix the problems with signal-after-exec. ==================================================================== Now lets fix another problem. A mt exec suceeds and apllication does sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds another (zombie) SECCOMP_MODE_FILTER thread. And after we fix this problem, what else we will need to fix? I really think that - whatever we do - there should be no other threads after exec, even zombies. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-05 16:18 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:18 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On 04/02, Eric W. Biederman wrote: > > Add exec_id to signal_struct and compare it at a few choice moments. I really dislike this change no matter what, sorry. Firstly, task_struct->*_exec_id should simply die (I already have the patch), or at least they should be moved into signal_struct simply because this is per-process thing. > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, > from_ancestor_ns || (info == SEND_SIG_FORCED))) > goto ret; > > + /* Don't allow thread group signals after exec */ > + if (group && (t->signal->exec_id != t->self_exec_id)) > + goto ret; Hmm. Either we do not need this exec_id check at all, or we should not take "group" into account; a fatal signal (say SIGKILL) will kill the whole thread-group. > @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, > * must see ->sighand == NULL. > */ > spin_lock(&sighand->siglock); > - if (likely(sighand == tsk->sighand)) { > + if (likely((sighand == tsk->sighand) && > + (tsk->self_exec_id == tsk->signal->exec_id))) { Oh, this doesn't look good to me. Yes, with your approach we probably need this to, say, ensure that posix-cpu-timer can't kill the process after exec, but I'd rather add the exit_state check into run_posix_timers(). But OK, suppose that we fix the problems with signal-after-exec. ==================================================================== Now lets fix another problem. A mt exec suceeds and apllication does sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds another (zombie) SECCOMP_MODE_FILTER thread. And after we fix this problem, what else we will need to fix? I really think that - whatever we do - there should be no other threads after exec, even zombies. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-05 18:16 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-05 18:16 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > On 04/02, Eric W. Biederman wrote: >> >> Add exec_id to signal_struct and compare it at a few choice moments. > > I really dislike this change no matter what, sorry. > > Firstly, task_struct->*_exec_id should simply die (I already have the > patch), or at least they should be moved into signal_struct simply > because this is per-process thing. I am quite happy to find a better way to implement this. More than anything this was my proof of concept that it is possible to close the security holes created if we allow our zombies to be normal zombies. >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, >> from_ancestor_ns || (info == SEND_SIG_FORCED))) >> goto ret; >> >> + /* Don't allow thread group signals after exec */ >> + if (group && (t->signal->exec_id != t->self_exec_id)) >> + goto ret; > > Hmm. Either we do not need this exec_id check at all, or we should not > take "group" into account; a fatal signal (say SIGKILL) will kill the > whole thread-group. Wow. Those are crazy semantics for fatal signals. Sending a tkill should not affect the entire thread group. Oleg I think this is a bug you introduced and likely requires a separate fix. I really don't understand the logic in: commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567 Author: Oleg Nesterov <oleg@tv-sign.ru> Date: Wed Apr 30 00:52:55 2008 -0700 signals: use __group_complete_signal() for the specific signals too Based on Pavel Emelyanov's suggestion. Rename __group_complete_signal() to complete_signal() and use it to process the specific signals too. To do this we simply add the "int group" argument. This allows us to greatly simply the signal-sending code and adds a useful behaviour change. We can avoid the unneeded wakeups for the private signals because wants_signal() is more clever than sigismember(blocked), but more importantly we now take into account the fatal specific signals too. The latter allows us to kill some subtle checks in handle_stop_signal() and makes the specific/group signal's behaviour more consistent. For example, currently sigtimedwait(FATAL_SIGNAL) behaves differently depending on was the signal sent by kill() or tkill() if the signal was not blocked. And. This allows us to tweak/fix the behaviour when the specific signal is sent to the dying/dead ->group_leader. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> >> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, >> * must see ->sighand == NULL. >> */ >> spin_lock(&sighand->siglock); >> - if (likely(sighand == tsk->sighand)) { >> + if (likely((sighand == tsk->sighand) && >> + (tsk->self_exec_id == tsk->signal->exec_id))) { > > Oh, this doesn't look good to me. Yes, with your approach we probably need > this to, say, ensure that posix-cpu-timer can't kill the process after exec, > but I'd rather add the exit_state check into run_posix_timers(). The entire point of lock_task_sighand is to not operate on tasks/processes that have exited. The fact it even sighand in there is deceptive because it is all about siglock and nothing to do with sighand. > But OK, suppose that we fix the problems with signal-after-exec. > > ==================================================================== > Now lets fix another problem. A mt exec suceeds and apllication does > sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds > another (zombie) SECCOMP_MODE_FILTER thread. > > And after we fix this problem, what else we will need to fix? > > > I really think that - whatever we do - there should be no other threads > after exec, even zombies. I see where you are coming from. I need to stare at this a bit longer. Because you are right. Reusing the signal_struct and leaving zombies around is very prone to bugs. So it is not very maintainable. I suspect the answer here is to simply allocate a new sighand_struct and a new signal_struct if there we are not single threaded by the time we get down to the end of de_thread. However even if it is a case of whack-a-mole semantically not-blocking-for-zombies looks like the right thing to do and we need to figure out how to do it maintainably. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-05 18:16 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-05 18:16 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > On 04/02, Eric W. Biederman wrote: >> >> Add exec_id to signal_struct and compare it at a few choice moments. > > I really dislike this change no matter what, sorry. > > Firstly, task_struct->*_exec_id should simply die (I already have the > patch), or at least they should be moved into signal_struct simply > because this is per-process thing. I am quite happy to find a better way to implement this. More than anything this was my proof of concept that it is possible to close the security holes created if we allow our zombies to be normal zombies. >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, >> from_ancestor_ns || (info == SEND_SIG_FORCED))) >> goto ret; >> >> + /* Don't allow thread group signals after exec */ >> + if (group && (t->signal->exec_id != t->self_exec_id)) >> + goto ret; > > Hmm. Either we do not need this exec_id check at all, or we should not > take "group" into account; a fatal signal (say SIGKILL) will kill the > whole thread-group. Wow. Those are crazy semantics for fatal signals. Sending a tkill should not affect the entire thread group. Oleg I think this is a bug you introduced and likely requires a separate fix. I really don't understand the logic in: commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567 Author: Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> Date: Wed Apr 30 00:52:55 2008 -0700 signals: use __group_complete_signal() for the specific signals too Based on Pavel Emelyanov's suggestion. Rename __group_complete_signal() to complete_signal() and use it to process the specific signals too. To do this we simply add the "int group" argument. This allows us to greatly simply the signal-sending code and adds a useful behaviour change. We can avoid the unneeded wakeups for the private signals because wants_signal() is more clever than sigismember(blocked), but more importantly we now take into account the fatal specific signals too. The latter allows us to kill some subtle checks in handle_stop_signal() and makes the specific/group signal's behaviour more consistent. For example, currently sigtimedwait(FATAL_SIGNAL) behaves differently depending on was the signal sent by kill() or tkill() if the signal was not blocked. And. This allows us to tweak/fix the behaviour when the specific signal is sent to the dying/dead ->group_leader. Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> Signed-off-by: Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> Cc: Roland McGrath <roland-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Signed-off-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> >> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, >> * must see ->sighand == NULL. >> */ >> spin_lock(&sighand->siglock); >> - if (likely(sighand == tsk->sighand)) { >> + if (likely((sighand == tsk->sighand) && >> + (tsk->self_exec_id == tsk->signal->exec_id))) { > > Oh, this doesn't look good to me. Yes, with your approach we probably need > this to, say, ensure that posix-cpu-timer can't kill the process after exec, > but I'd rather add the exit_state check into run_posix_timers(). The entire point of lock_task_sighand is to not operate on tasks/processes that have exited. The fact it even sighand in there is deceptive because it is all about siglock and nothing to do with sighand. > But OK, suppose that we fix the problems with signal-after-exec. > > ==================================================================== > Now lets fix another problem. A mt exec suceeds and apllication does > sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds > another (zombie) SECCOMP_MODE_FILTER thread. > > And after we fix this problem, what else we will need to fix? > > > I really think that - whatever we do - there should be no other threads > after exec, even zombies. I see where you are coming from. I need to stare at this a bit longer. Because you are right. Reusing the signal_struct and leaving zombies around is very prone to bugs. So it is not very maintainable. I suspect the answer here is to simply allocate a new sighand_struct and a new signal_struct if there we are not single threaded by the time we get down to the end of de_thread. However even if it is a case of whack-a-mole semantically not-blocking-for-zombies looks like the right thing to do and we need to figure out how to do it maintainably. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-06 15:48 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-06 15:48 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api, Eugene Syromiatnikov On 04/05, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > >> --- a/kernel/signal.c > >> +++ b/kernel/signal.c > >> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, > >> from_ancestor_ns || (info == SEND_SIG_FORCED))) > >> goto ret; > >> > >> + /* Don't allow thread group signals after exec */ > >> + if (group && (t->signal->exec_id != t->self_exec_id)) > >> + goto ret; > > > > Hmm. Either we do not need this exec_id check at all, or we should not > > take "group" into account; a fatal signal (say SIGKILL) will kill the > > whole thread-group. > > Wow. Those are crazy semantics for fatal signals. Sending a tkill > should not affect the entire thread group. How so? SIGKILL or any fatal signal should kill the whole process, even if it was sent by tkill(). > Oleg I think this is a bug > you introduced and likely requires a separate fix. > > I really don't understand the logic in: > > commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567 > Author: Oleg Nesterov <oleg@tv-sign.ru> > Date: Wed Apr 30 00:52:55 2008 -0700 > > signals: use __group_complete_signal() for the specific signals too No. You can even forget about "send" path for the moment. Just suppose that a thread dequeues SIGKILL sent by tkill(). In this case it will call do_group_exit() and kill the group anyway. It is not possible to kill an individual thread, and linux never did this. Afaics, this commit also fixes the case when SIGKILL can be lost when tkill() races with the exiting target. Or if the target is a zombie-leader. Exactly because they obviously can't dequeue SIGKILL. Plus we want to shutdown the whole thread-group "asap", that is why complete_signal() sets SIGNAL_GROUP_EXIT and sends SIGKILL to other threads in the "send" path. This btw reminds me that we want to do the same with sig_kernel_coredump() signals too, but this is not simple. > >> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, > >> * must see ->sighand == NULL. > >> */ > >> spin_lock(&sighand->siglock); > >> - if (likely(sighand == tsk->sighand)) { > >> + if (likely((sighand == tsk->sighand) && > >> + (tsk->self_exec_id == tsk->signal->exec_id))) { > > > > Oh, this doesn't look good to me. Yes, with your approach we probably need > > this to, say, ensure that posix-cpu-timer can't kill the process after exec, > > but I'd rather add the exit_state check into run_posix_timers(). > > The entire point of lock_task_sighand is to not operate on > tasks/processes that have exited. Well, the entire point of lock_task_sighand() is take ->siglock if possible. > The fact it even sighand in there is > deceptive because it is all about siglock and nothing to do with > sighand. Not sure I understand what you mean... Yes, lock_task_sighand() can obviously fail, and yes the failure is used as an indication that this thread has gone. But a zombie thread controlled by the parent/debugger has not gone yet. > > ==================================================================== > > Now lets fix another problem. A mt exec suceeds and apllication does > > sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds > > another (zombie) SECCOMP_MODE_FILTER thread. > > > > And after we fix this problem, what else we will need to fix? > > > > > > I really think that - whatever we do - there should be no other threads > > after exec, even zombies. > > I see where you are coming from. > > I need to stare at this a bit longer. Because you are right. Reusing > the signal_struct and leaving zombies around is very prone to bugs. So > it is not very maintainable. Yes, yes, yes. This is what I was arguing with. > I suspect the answer here is to simply allocate a new sighand_struct and > a new signal_struct if there we are not single threaded by the time we > get down to the end of de_thread. May be. Not sure. Looks very nontrivial. And I still think that if we do this, we should fix the bug first, then try to do something like this. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec @ 2017-04-06 15:48 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-06 15:48 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, Eugene Syromiatnikov On 04/05, Eric W. Biederman wrote: > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > >> --- a/kernel/signal.c > >> +++ b/kernel/signal.c > >> @@ -995,6 +995,10 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, > >> from_ancestor_ns || (info == SEND_SIG_FORCED))) > >> goto ret; > >> > >> + /* Don't allow thread group signals after exec */ > >> + if (group && (t->signal->exec_id != t->self_exec_id)) > >> + goto ret; > > > > Hmm. Either we do not need this exec_id check at all, or we should not > > take "group" into account; a fatal signal (say SIGKILL) will kill the > > whole thread-group. > > Wow. Those are crazy semantics for fatal signals. Sending a tkill > should not affect the entire thread group. How so? SIGKILL or any fatal signal should kill the whole process, even if it was sent by tkill(). > Oleg I think this is a bug > you introduced and likely requires a separate fix. > > I really don't understand the logic in: > > commit 5fcd835bf8c2cde06404559b1904e2f1dfcb4567 > Author: Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> > Date: Wed Apr 30 00:52:55 2008 -0700 > > signals: use __group_complete_signal() for the specific signals too No. You can even forget about "send" path for the moment. Just suppose that a thread dequeues SIGKILL sent by tkill(). In this case it will call do_group_exit() and kill the group anyway. It is not possible to kill an individual thread, and linux never did this. Afaics, this commit also fixes the case when SIGKILL can be lost when tkill() races with the exiting target. Or if the target is a zombie-leader. Exactly because they obviously can't dequeue SIGKILL. Plus we want to shutdown the whole thread-group "asap", that is why complete_signal() sets SIGNAL_GROUP_EXIT and sends SIGKILL to other threads in the "send" path. This btw reminds me that we want to do the same with sig_kernel_coredump() signals too, but this is not simple. > >> @@ -1247,7 +1251,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, > >> * must see ->sighand == NULL. > >> */ > >> spin_lock(&sighand->siglock); > >> - if (likely(sighand == tsk->sighand)) { > >> + if (likely((sighand == tsk->sighand) && > >> + (tsk->self_exec_id == tsk->signal->exec_id))) { > > > > Oh, this doesn't look good to me. Yes, with your approach we probably need > > this to, say, ensure that posix-cpu-timer can't kill the process after exec, > > but I'd rather add the exit_state check into run_posix_timers(). > > The entire point of lock_task_sighand is to not operate on > tasks/processes that have exited. Well, the entire point of lock_task_sighand() is take ->siglock if possible. > The fact it even sighand in there is > deceptive because it is all about siglock and nothing to do with > sighand. Not sure I understand what you mean... Yes, lock_task_sighand() can obviously fail, and yes the failure is used as an indication that this thread has gone. But a zombie thread controlled by the parent/debugger has not gone yet. > > ==================================================================== > > Now lets fix another problem. A mt exec suceeds and apllication does > > sys_seccomp(SECCOMP_FILTER_FLAG_TSYNC) which fails because it finds > > another (zombie) SECCOMP_MODE_FILTER thread. > > > > And after we fix this problem, what else we will need to fix? > > > > > > I really think that - whatever we do - there should be no other threads > > after exec, even zombies. > > I see where you are coming from. > > I need to stare at this a bit longer. Because you are right. Reusing > the signal_struct and leaving zombies around is very prone to bugs. So > it is not very maintainable. Yes, yes, yes. This is what I was arguing with. > I suspect the answer here is to simply allocate a new sighand_struct and > a new signal_struct if there we are not single threaded by the time we > get down to the end of de_thread. May be. Not sure. Looks very nontrivial. And I still think that if we do this, we should fix the bug first, then try to do something like this. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-02 16:15 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-02 16:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 03/30, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > Very nice. So de_thread() returns as soon as all other threads decrement > > signal->live in do_exit(). Before they do, say, exit_mm(). This is already > > wrong, for example this breaks OOM. Plus a lot more problems afaics, but > > lets ignore this. > > Which means that we need to keep sig->notify_count. Yes, although we need to make it less ugly. > > Note that de_thread() also unshares ->sighand before return. So in the > > case of mt exec it will likely see oldsighand->count != 1 and alloc the > > new sighand_struct and this breaks the locking. > > > > Because the execing thread will use newsighand->siglock to protect its > > signal_struct while the zombie threads will use oldsighand->siglock to > > protect the same signal struct. Yes, tasklist_lock + the fact irq_disable > > implies rcu_lock mostly save us but not entirely, say, a foreign process > > doing __send_signal() can take the right or the wrong lock depending on > > /dev/random. > > Which leads to the question how can we get back tot he 2.4 behavior > of freeing sighand_struct in do_exit? > > At which point as soon as we free sighand_struct if we are the last > to dying thread notify de_thread and everything works. I was thinking about the similar option, see below, but decided that we should not do this at least right now. > For what __ptrace_unlink is doing we should just be able to skip > acquiring of siglock if PF_EXITING is set. We can even remove it from release_task() path, this is simple. > __exit_signal is a little more interesting but half of what it is > doing looks like it was pulled out of do_exit and just needs to > be put back. That is. I think we should actually unhash the exiting sub-thread even if it is traced. IOW, remove it from thread/pid/parent/etc lists and nullify its ->sighand. IMO, whatever we do thread_group_empty(current) should be true after exec. So the exiting sub-trace should look almost a EXIT_DEAD task except it still should report to debugger. But this is dangerous. Say, wait4(upid <= 0) becomes unsafe because task_pid_type(PIDTYPE_PGID) won't work. > Which probably adds up to 4 or 5 small carefully written patches to sort > out that part of the exit path, Perhaps I am wrong, but I think you underestimate the problems, and it is not clear to me if we really want this. ========================================================================= Anyway, Eric, even if we can and want to do this, why we can't do this on top of my fix? I simply fail to understand why you dislike it that much. Yes it is not pretty, I said this many times, but it is safe in that it doesn't really change the current behaviour. I am much more worried about 2/2 you didn't argue with, this patch _can_ break something and this is obviously not good even if PTRACE_EVENT_EXIT was always broken. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-02 16:15 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-02 16:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On 03/30, Eric W. Biederman wrote: > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > > Very nice. So de_thread() returns as soon as all other threads decrement > > signal->live in do_exit(). Before they do, say, exit_mm(). This is already > > wrong, for example this breaks OOM. Plus a lot more problems afaics, but > > lets ignore this. > > Which means that we need to keep sig->notify_count. Yes, although we need to make it less ugly. > > Note that de_thread() also unshares ->sighand before return. So in the > > case of mt exec it will likely see oldsighand->count != 1 and alloc the > > new sighand_struct and this breaks the locking. > > > > Because the execing thread will use newsighand->siglock to protect its > > signal_struct while the zombie threads will use oldsighand->siglock to > > protect the same signal struct. Yes, tasklist_lock + the fact irq_disable > > implies rcu_lock mostly save us but not entirely, say, a foreign process > > doing __send_signal() can take the right or the wrong lock depending on > > /dev/random. > > Which leads to the question how can we get back tot he 2.4 behavior > of freeing sighand_struct in do_exit? > > At which point as soon as we free sighand_struct if we are the last > to dying thread notify de_thread and everything works. I was thinking about the similar option, see below, but decided that we should not do this at least right now. > For what __ptrace_unlink is doing we should just be able to skip > acquiring of siglock if PF_EXITING is set. We can even remove it from release_task() path, this is simple. > __exit_signal is a little more interesting but half of what it is > doing looks like it was pulled out of do_exit and just needs to > be put back. That is. I think we should actually unhash the exiting sub-thread even if it is traced. IOW, remove it from thread/pid/parent/etc lists and nullify its ->sighand. IMO, whatever we do thread_group_empty(current) should be true after exec. So the exiting sub-trace should look almost a EXIT_DEAD task except it still should report to debugger. But this is dangerous. Say, wait4(upid <= 0) becomes unsafe because task_pid_type(PIDTYPE_PGID) won't work. > Which probably adds up to 4 or 5 small carefully written patches to sort > out that part of the exit path, Perhaps I am wrong, but I think you underestimate the problems, and it is not clear to me if we really want this. ========================================================================= Anyway, Eric, even if we can and want to do this, why we can't do this on top of my fix? I simply fail to understand why you dislike it that much. Yes it is not pretty, I said this many times, but it is safe in that it doesn't really change the current behaviour. I am much more worried about 2/2 you didn't argue with, this patch _can_ break something and this is obviously not good even if PTRACE_EVENT_EXIT was always broken. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-02 21:07 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 21:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > Perhaps I am wrong, but I think you underestimate the problems, and it is > not clear to me if we really want this. I worked through quite a bit of it and I realized a few fundamental issues. The task struct must remain visible until it is reaped and we use siglock to protect in unhash process to protect that reaping. Further tsk->sighand == NULL winds up being a flag used to tell if release_task has been called. To get an usable count on sighand struct all that needed to be done was to change the reference counting of sighand_struct to count processes and not threads. Which is what I wound up posting. > Anyway, Eric, even if we can and want to do this, why we can't do this on > top of my fix? Because your reduction in scope of cred_guard_mutex is fundamentally broken and unnecessary. > I simply fail to understand why you dislike it that much. Yes it is not > pretty, I said this many times, but it is safe in that it doesn't really > change the current behaviour. No it is not safe. And it promotes wrong thinking which is even more dangerous. I reviewed the code and cred_guard_mutex needs to cover what it covers. > I am much more worried about 2/2 you didn't argue with, this patch _can_ > break something and this is obviously not good even if PTRACE_EVENT_EXIT > was always broken. I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually know what the implications of changing it are. Let's see... gdb - no upstart - no lldb - yes strace - no It looks like lldb is worth testing with your PTRACE_EVENT_EXIT change to see if anything breaks. I think we can get away with changing the exec case but it does look worth testing. I hadn't realized you hadn't looked to see what was using PTRACE_O_TRACEEXIT to see if any part of userspace cares. Hmm. This is interesting. From the strace documentation: > Tracer cannot assume that ptrace-stopped tracee exists. There are many > scenarios when tracee may die while stopped (such as SIGKILL). > Therefore, tracer must always be prepared to handle ESRCH error on any > ptrace operation. Unfortunately, the same error is returned if tracee > exists but is not ptrace-stopped (for commands which require stopped > tracee), or if it is not traced by process which issued ptrace call. > Tracer needs to keep track of stopped/running state, and interpret > ESRCH as "tracee died unexpectedly" only if it knows that tracee has > been observed to enter ptrace-stop. Note that there is no guarantee > that waitpid(WNOHANG) will reliably report tracee's death status if > ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead. > IOW: tracee may be "not yet fully dead" but already refusing ptrace > ops. If delivering a second SIGKILL to a ptraced stopped processes will make it continue we have a very interesting out.. When we stop in ptrace_stop we stop in TASK_TRACED == (TASK_WAKEKILL|__TASK_TRACED) Delivery of a SIGKILL to that task has queue SIGKILL and call signal_wake_up_state(t, TASK_WAKEKILL). Which becomes wake_up_state(t, TASK_INTERRUPTIBLE | TASK_WAKEKILL) Which wakes up the process. So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT before the tracers find it. Therefore we are only talking a quality of implementation issue if we actually stop and wait for the tracer or not. .... Which brings us to your PTRACE_EVENT_EXIT patch. I think may_ptrace_stop is tested in the wrong place, and is probably buggy. - We should send the signal in all cases except when the ptracing parent does not exist aka (!current->ptrace). The siginfo contains enough information to understand what happened if anyone is listening. - Then we should send the group stop. - Then if we don't want to wait we should: __set_current_state(TASK_RUNNING) - Then we should drop the locks and only call freezable_schedule if we want to wait. That way userspace thinks someone else just sent a SIGKILL and killed the thread before it had a chance to look (which is effectively what we are doing). That sounds idea for both core-dumps and this case. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-02 21:07 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-02 21:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > Perhaps I am wrong, but I think you underestimate the problems, and it is > not clear to me if we really want this. I worked through quite a bit of it and I realized a few fundamental issues. The task struct must remain visible until it is reaped and we use siglock to protect in unhash process to protect that reaping. Further tsk->sighand == NULL winds up being a flag used to tell if release_task has been called. To get an usable count on sighand struct all that needed to be done was to change the reference counting of sighand_struct to count processes and not threads. Which is what I wound up posting. > Anyway, Eric, even if we can and want to do this, why we can't do this on > top of my fix? Because your reduction in scope of cred_guard_mutex is fundamentally broken and unnecessary. > I simply fail to understand why you dislike it that much. Yes it is not > pretty, I said this many times, but it is safe in that it doesn't really > change the current behaviour. No it is not safe. And it promotes wrong thinking which is even more dangerous. I reviewed the code and cred_guard_mutex needs to cover what it covers. > I am much more worried about 2/2 you didn't argue with, this patch _can_ > break something and this is obviously not good even if PTRACE_EVENT_EXIT > was always broken. I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually know what the implications of changing it are. Let's see... gdb - no upstart - no lldb - yes strace - no It looks like lldb is worth testing with your PTRACE_EVENT_EXIT change to see if anything breaks. I think we can get away with changing the exec case but it does look worth testing. I hadn't realized you hadn't looked to see what was using PTRACE_O_TRACEEXIT to see if any part of userspace cares. Hmm. This is interesting. From the strace documentation: > Tracer cannot assume that ptrace-stopped tracee exists. There are many > scenarios when tracee may die while stopped (such as SIGKILL). > Therefore, tracer must always be prepared to handle ESRCH error on any > ptrace operation. Unfortunately, the same error is returned if tracee > exists but is not ptrace-stopped (for commands which require stopped > tracee), or if it is not traced by process which issued ptrace call. > Tracer needs to keep track of stopped/running state, and interpret > ESRCH as "tracee died unexpectedly" only if it knows that tracee has > been observed to enter ptrace-stop. Note that there is no guarantee > that waitpid(WNOHANG) will reliably report tracee's death status if > ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead. > IOW: tracee may be "not yet fully dead" but already refusing ptrace > ops. If delivering a second SIGKILL to a ptraced stopped processes will make it continue we have a very interesting out.. When we stop in ptrace_stop we stop in TASK_TRACED == (TASK_WAKEKILL|__TASK_TRACED) Delivery of a SIGKILL to that task has queue SIGKILL and call signal_wake_up_state(t, TASK_WAKEKILL). Which becomes wake_up_state(t, TASK_INTERRUPTIBLE | TASK_WAKEKILL) Which wakes up the process. So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT before the tracers find it. Therefore we are only talking a quality of implementation issue if we actually stop and wait for the tracer or not. .... Which brings us to your PTRACE_EVENT_EXIT patch. I think may_ptrace_stop is tested in the wrong place, and is probably buggy. - We should send the signal in all cases except when the ptracing parent does not exist aka (!current->ptrace). The siginfo contains enough information to understand what happened if anyone is listening. - Then we should send the group stop. - Then if we don't want to wait we should: __set_current_state(TASK_RUNNING) - Then we should drop the locks and only call freezable_schedule if we want to wait. That way userspace thinks someone else just sent a SIGKILL and killed the thread before it had a chance to look (which is effectively what we are doing). That sounds idea for both core-dumps and this case. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-03 18:37 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-03 18:37 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Eric, I see another series from you, but I simply failed to force myself to read it carefully. Because at first glance it makes me really sad, I do dislike it even if it is correct. Yes, yes, sure, I can be wrong. Will try tomorrow. On 04/02, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > Anyway, Eric, even if we can and want to do this, why we can't do this on > > top of my fix? > > Because your reduction in scope of cred_guard_mutex is fundamentally > broken and unnecessary. And you never explained why it is wrong, or I failed to understand you. > > I simply fail to understand why you dislike it that much. Yes it is not > > pretty, I said this many times, but it is safe in that it doesn't really > > change the current behaviour. > > No it is not safe. And it promotes wrong thinking which is even more > dangerous. So please explain why it is not safe and why it is dangerous. Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex, please explain what I have missed in case you still think this is wrong. > I reviewed the code and cred_guard_mutex needs to cover what it covers. I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow it in any case, even if the current code was not bugy. But this is almost offtopic, lets discuss this separately. > > I am much more worried about 2/2 you didn't argue with, this patch _can_ > > break something and this is obviously not good even if PTRACE_EVENT_EXIT > > was always broken. > > I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually > know what the implications of changing it are. Let's see... And nobody knows ;) This is the problem, even the clear ptrace bugfix can break something, this happened before and we had to revert the obviously- correct patches; the bug was already used as feature. > If delivering a second SIGKILL ... > So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT > before the tracers find it. > > Therefore we are only talking a quality of implementation issue > if we actually stop and wait for the tracer or not. Oh, this is another story, needs another discussion. We really need some changes in this area, we need to distinguish SIGKILL sent from user-space and (say) from group-exit, and we need to decide when should we stop. But at least I think the tracee should never stop if SIGKILL comes from user space. And yes ptrace_stop() is ugly and wrong, just look at the arch_ptrace_stop_needed() check. The problem, again, is that any fix will be user-visible. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-03 18:37 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-03 18:37 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Eric, I see another series from you, but I simply failed to force myself to read it carefully. Because at first glance it makes me really sad, I do dislike it even if it is correct. Yes, yes, sure, I can be wrong. Will try tomorrow. On 04/02, Eric W. Biederman wrote: > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > > Anyway, Eric, even if we can and want to do this, why we can't do this on > > top of my fix? > > Because your reduction in scope of cred_guard_mutex is fundamentally > broken and unnecessary. And you never explained why it is wrong, or I failed to understand you. > > I simply fail to understand why you dislike it that much. Yes it is not > > pretty, I said this many times, but it is safe in that it doesn't really > > change the current behaviour. > > No it is not safe. And it promotes wrong thinking which is even more > dangerous. So please explain why it is not safe and why it is dangerous. Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex, please explain what I have missed in case you still think this is wrong. > I reviewed the code and cred_guard_mutex needs to cover what it covers. I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow it in any case, even if the current code was not bugy. But this is almost offtopic, lets discuss this separately. > > I am much more worried about 2/2 you didn't argue with, this patch _can_ > > break something and this is obviously not good even if PTRACE_EVENT_EXIT > > was always broken. > > I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually > know what the implications of changing it are. Let's see... And nobody knows ;) This is the problem, even the clear ptrace bugfix can break something, this happened before and we had to revert the obviously- correct patches; the bug was already used as feature. > If delivering a second SIGKILL ... > So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT > before the tracers find it. > > Therefore we are only talking a quality of implementation issue > if we actually stop and wait for the tracer or not. Oh, this is another story, needs another discussion. We really need some changes in this area, we need to distinguish SIGKILL sent from user-space and (say) from group-exit, and we need to decide when should we stop. But at least I think the tracee should never stop if SIGKILL comes from user space. And yes ptrace_stop() is ugly and wrong, just look at the arch_ptrace_stop_needed() check. The problem, again, is that any fix will be user-visible. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-03 22:49 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > Eric, > > I see another series from you, but I simply failed to force myself to read > it carefully. Because at first glance it makes me really sad, I do dislike > it even if it is correct. Yes, yes, sure, I can be wrong. Will try > tomorrow. Yes. I needed to get my thoughts concrete. I missed fixing the race in zap_other_threads. But overall I think things are moving in a good direction. >> >> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually >> know what the implications of changing it are. Let's see... > > And nobody knows ;) This is the problem, even the clear ptrace bugfix can > break something, this happened before and we had to revert the obviously- > correct patches; the bug was already used as feature. Yes that is the challenge of changing userspace. Which is why it helps to test as much of a userspace change as possible. Or to get very clever, and figure out how to avoid the userspace change. So I think it is worth knowing the lldb actually uses PTRACE_O_TRACEEXIT. So we can test at least some programs to verify that all is well. I don't see any way around cleaning up PTRACE_O_TRACEEXIT. As we fundamentally have the non-thread-group-leader exec problem. We have to reap that previous leader thread with release_task. Which means we can't stop for a PTRACE_O_TRACEEXIT. >> If delivering a second SIGKILL > ... >> So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT >> before the tracers find it. >> >> Therefore we are only talking a quality of implementation issue >> if we actually stop and wait for the tracer or not. > > Oh, this is another story, needs another discussion. We really need some > changes in this area, we need to distinguish SIGKILL sent from user-space > and (say) from group-exit, and we need to decide when should we stop. > > But at least I think the tracee should never stop if SIGKILL comes from > user space. And yes ptrace_stop() is ugly and wrong, just look at the > arch_ptrace_stop_needed() check. The problem, again, is that any fix will > be user-visible. The only issue I see is that arch_ptrace_stop() may sleep (sparc and ia64 do as they flush the register stack to memory). As the code may sleep it means we can't set TASK_TRACED until after calling arch_ptrace_stop(). My inclination is to just solve that by saying: if (!sigkill_pending(current)) set_current_task(TASK_TRACED); That removes the special case. We have to handle SIGKILL being delivered immediately after set_current_state in any event. And as we are talking about something that happens on rare architecutres I don't see any problem with tweaking that code at all. It is closely enough related I will fold that into the next version of my patch. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-04-03 22:49 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > Eric, > > I see another series from you, but I simply failed to force myself to read > it carefully. Because at first glance it makes me really sad, I do dislike > it even if it is correct. Yes, yes, sure, I can be wrong. Will try > tomorrow. Yes. I needed to get my thoughts concrete. I missed fixing the race in zap_other_threads. But overall I think things are moving in a good direction. >> >> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually >> know what the implications of changing it are. Let's see... > > And nobody knows ;) This is the problem, even the clear ptrace bugfix can > break something, this happened before and we had to revert the obviously- > correct patches; the bug was already used as feature. Yes that is the challenge of changing userspace. Which is why it helps to test as much of a userspace change as possible. Or to get very clever, and figure out how to avoid the userspace change. So I think it is worth knowing the lldb actually uses PTRACE_O_TRACEEXIT. So we can test at least some programs to verify that all is well. I don't see any way around cleaning up PTRACE_O_TRACEEXIT. As we fundamentally have the non-thread-group-leader exec problem. We have to reap that previous leader thread with release_task. Which means we can't stop for a PTRACE_O_TRACEEXIT. >> If delivering a second SIGKILL > ... >> So userspace can absolutely kill a processes in PTRACE_EVENT_EXIT >> before the tracers find it. >> >> Therefore we are only talking a quality of implementation issue >> if we actually stop and wait for the tracer or not. > > Oh, this is another story, needs another discussion. We really need some > changes in this area, we need to distinguish SIGKILL sent from user-space > and (say) from group-exit, and we need to decide when should we stop. > > But at least I think the tracee should never stop if SIGKILL comes from > user space. And yes ptrace_stop() is ugly and wrong, just look at the > arch_ptrace_stop_needed() check. The problem, again, is that any fix will > be user-visible. The only issue I see is that arch_ptrace_stop() may sleep (sparc and ia64 do as they flush the register stack to memory). As the code may sleep it means we can't set TASK_TRACED until after calling arch_ptrace_stop(). My inclination is to just solve that by saying: if (!sigkill_pending(current)) set_current_task(TASK_TRACED); That removes the special case. We have to handle SIGKILL being delivered immediately after set_current_state in any event. And as we are talking about something that happens on rare architecutres I don't see any problem with tweaking that code at all. It is closely enough related I will fold that into the next version of my patch. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* scope of cred_guard_mutex. @ 2017-04-03 22:49 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > On 04/02, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg@redhat.com> writes: >> >> > Anyway, Eric, even if we can and want to do this, why we can't do this on >> > top of my fix? >> >> Because your reduction in scope of cred_guard_mutex is fundamentally >> broken and unnecessary. > > And you never explained why it is wrong, or I failed to understand you. > >> > I simply fail to understand why you dislike it that much. Yes it is not >> > pretty, I said this many times, but it is safe in that it doesn't really >> > change the current behaviour. >> >> No it is not safe. And it promotes wrong thinking which is even more >> dangerous. > > So please explain why it is not safe and why it is dangerous. > > Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex, > please explain what I have missed in case you still think this is wrong. >> I reviewed the code and cred_guard_mutex needs to cover what it covers. > > I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow > it in any case, even if the current code was not bugy. But this is almost > offtopic, lets discuss this separately. You have asked why I have problems with your patch and so I am going to try to explain. Partly I want to see a clean set of patches that we can merge into Linus's tree before we make any compromises. Because the work preparing a clean patchset may inform us of something better. Plus we need to make something clean and long term maintainable in any event. Partly I object because your understanding and my understanding of cred_guard_mutex are very different. As I read and understand the code the job of cred_guard_mutex is to keep ptrace (and other threads of the proccess) from interferring in exec and to ensure old resources are accessed with permission checks using our original credentials and that new and modified resources are accessed with permission checks using our new credentials. I object to your patch in particular because you deliberately mess up the part of only making old resources available with old creds and new resources available with new creds. Even if the current permission checks are a don't care it still remains conceptually wrong. And conceptually wrong tends code tends towards maintenance problems and real surprises when someone makes small changes to the code. Which is what I mean when I say your patch is dangerous. AKA What I see neededing to be protected looks something like: mutex_lock(); new_cred = compute_new_cred(tsk); new_mm = compute_new_mm(tsk); tsk->mm = new_mm; tsk->cred = new_cred; zap_other_threads(tsk); update_sighand(tsk); update_signal(tsk); do_close_on_exec(); update_tsk_fields(tsk); mutex_unlock(); The only way I can see of reducing the scope of cred_guard_mutex is performing work in such a way that ptrace and the other threads can't interfere and then taking the lock. Computing the new mm and the new credentials are certainly candidates for that kind of treatment. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* scope of cred_guard_mutex. @ 2017-04-03 22:49 ` Eric W. Biederman 0 siblings, 0 replies; 93+ messages in thread From: Eric W. Biederman @ 2017-04-03 22:49 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > On 04/02, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: >> >> > Anyway, Eric, even if we can and want to do this, why we can't do this on >> > top of my fix? >> >> Because your reduction in scope of cred_guard_mutex is fundamentally >> broken and unnecessary. > > And you never explained why it is wrong, or I failed to understand you. > >> > I simply fail to understand why you dislike it that much. Yes it is not >> > pretty, I said this many times, but it is safe in that it doesn't really >> > change the current behaviour. >> >> No it is not safe. And it promotes wrong thinking which is even more >> dangerous. > > So please explain why it is not safe and why it is dangerous. > > Just in case, if you mean flush_signal_handlers() outside of cred_guard_mutex, > please explain what I have missed in case you still think this is wrong. >> I reviewed the code and cred_guard_mutex needs to cover what it covers. > > I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow > it in any case, even if the current code was not bugy. But this is almost > offtopic, lets discuss this separately. You have asked why I have problems with your patch and so I am going to try to explain. Partly I want to see a clean set of patches that we can merge into Linus's tree before we make any compromises. Because the work preparing a clean patchset may inform us of something better. Plus we need to make something clean and long term maintainable in any event. Partly I object because your understanding and my understanding of cred_guard_mutex are very different. As I read and understand the code the job of cred_guard_mutex is to keep ptrace (and other threads of the proccess) from interferring in exec and to ensure old resources are accessed with permission checks using our original credentials and that new and modified resources are accessed with permission checks using our new credentials. I object to your patch in particular because you deliberately mess up the part of only making old resources available with old creds and new resources available with new creds. Even if the current permission checks are a don't care it still remains conceptually wrong. And conceptually wrong tends code tends towards maintenance problems and real surprises when someone makes small changes to the code. Which is what I mean when I say your patch is dangerous. AKA What I see neededing to be protected looks something like: mutex_lock(); new_cred = compute_new_cred(tsk); new_mm = compute_new_mm(tsk); tsk->mm = new_mm; tsk->cred = new_cred; zap_other_threads(tsk); update_sighand(tsk); update_signal(tsk); do_close_on_exec(); update_tsk_fields(tsk); mutex_unlock(); The only way I can see of reducing the scope of cred_guard_mutex is performing work in such a way that ptrace and the other threads can't interfere and then taking the lock. Computing the new mm and the new credentials are certainly candidates for that kind of treatment. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. 2017-04-03 22:49 ` Eric W. Biederman (?) @ 2017-04-05 16:08 ` Oleg Nesterov 2017-04-05 16:11 ` Kees Cook 2017-04-05 17:53 ` Eric W. Biederman -1 siblings, 2 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 16:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/03, Eric W. Biederman wrote: > > You have asked why I have problems with your patch and so I am going to > try to explain. Partly I want to see a clean set of patches that we > can merge into Linus's tree before we make any compromises. Because the > work preparing a clean patchset may inform us of something better. Plus > we need to make something clean and long term maintainable in any event. > > Partly I object because your understanding and my understanding of > cred_guard_mutex are very different. And I think there is another problem, your understanding and my understanding of "clean" differ too much and it seems that we can not convince each other ;) The last series looks buggy (I'll send more emails later today), but the main problem is that - in my opinion! - your approach is "obviously wrong and much less clean". But yes, yes, I understand that this is my opinion, and I can be wrong. Eric, I think we need more CC's. Linus, probably security list, the more the better. I am going to resend my series with more CC's, then you can nack it and explain what you think we should do. Perhaps someone else will suggest a better solution, or at least review the patches. OK? Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. 2017-04-05 16:08 ` Oleg Nesterov @ 2017-04-05 16:11 ` Kees Cook 2017-04-05 17:53 ` Eric W. Biederman 1 sibling, 0 replies; 93+ messages in thread From: Kees Cook @ 2017-04-05 16:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Michal Hocko, Ulrich Obergfell, LKML, Linux API On Wed, Apr 5, 2017 at 9:08 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 04/03, Eric W. Biederman wrote: >> >> You have asked why I have problems with your patch and so I am going to >> try to explain. Partly I want to see a clean set of patches that we >> can merge into Linus's tree before we make any compromises. Because the >> work preparing a clean patchset may inform us of something better. Plus >> we need to make something clean and long term maintainable in any event. >> >> Partly I object because your understanding and my understanding of >> cred_guard_mutex are very different. > > And I think there is another problem, your understanding and my understanding > of "clean" differ too much and it seems that we can not convince each other ;) > > The last series looks buggy (I'll send more emails later today), but the > main problem is that - in my opinion! - your approach is "obviously wrong > and much less clean". But yes, yes, I understand that this is my opinion, > and I can be wrong. > > Eric, I think we need more CC's. Linus, probably security list, the more > the better. > > I am going to resend my series with more CC's, then you can nack it and > explain what you think we should do. Perhaps someone else will suggest > a better solution, or at least review the patches. OK? I've been following along, but it seems like there are a lot of edge cases in these changes. I'll try to meaningfully comment on the coming emails... having code examples of why various things will/won't work go a long way for helping understand what's safe or not... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. 2017-04-05 16:08 ` Oleg Nesterov 2017-04-05 16:11 ` Kees Cook @ 2017-04-05 17:53 ` Eric W. Biederman 2017-04-05 18:15 ` Oleg Nesterov 1 sibling, 1 reply; 93+ messages in thread From: Eric W. Biederman @ 2017-04-05 17:53 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api Oleg Nesterov <oleg@redhat.com> writes: > On 04/03, Eric W. Biederman wrote: >> >> You have asked why I have problems with your patch and so I am going to >> try to explain. Partly I want to see a clean set of patches that we >> can merge into Linus's tree before we make any compromises. Because the >> work preparing a clean patchset may inform us of something better. Plus >> we need to make something clean and long term maintainable in any event. >> >> Partly I object because your understanding and my understanding of >> cred_guard_mutex are very different. > > And I think there is another problem, your understanding and my understanding > of "clean" differ too much and it seems that we can not convince each other ;) We have barely begun. You have not shown anyone what your idea of a clean fix actually is. All I have seen from you is a quick hack that is a hack that is back-portable. Focusing on the back port is the wrong order to solve the issue in. We need to solve this in an upstream mergable and maintainable way and then we can worry about backports. >From a userspace perspective. I find anything in the kernel blocking on a zombie to be just wrong. A zombie is dead. Waiting for a zombie should in all cases be optional. The system should not break if we don't reap a zombie. You have made a clear case that the zombies need to exist for strace -f to wait on. So since the zombies must exist we should make them follow normal zombie rules. With your change exec still blocks waiting for zombies. Furthermore you have to violate the reasonable rule that: * pre-exec resources are guarded with the pre-exec process cred. * post-exec resources are guarded with the post-exec process cred. So from a userspace perspective I think your semantics are absolutely insane. We also need clean code to implement all of this (which I am still inching towards). But we need to be implementing something that makes sense from a userspace perspective. > The last series looks buggy (I'll send more emails later today), but the > main problem is that - in my opinion! - your approach is "obviously wrong > and much less clean". But yes, yes, I understand that this is my opinion, > and I can be wrong. How is changing fixing the implementation so that we don't block waiting for zombies to be reaped wrong? > Eric, I think we need more CC's. Linus, probably security list, the more > the better. > > I am going to resend my series with more CC's, then you can nack it and > explain what you think we should do. Perhaps someone else will suggest > a better solution, or at least review the patches. OK? I will be happy to look but my primary objectionions to your patch were: - You implemented a hack for backporting rather than fixing things cleanly the first time. - You made comments about cred_guard_mutex and it's scope that when I reviewed the code were false. cred_guard_mutex although probably the wrong locking structure is semantically and fundamentally where it needs to be. We can optimize it but we can't change what is protected to make our lives easier. Eric ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. 2017-04-05 17:53 ` Eric W. Biederman @ 2017-04-05 18:15 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-05 18:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 04/05, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > - You made comments about cred_guard_mutex and it's scope that when I > reviewed the code were false. Too late for me. I'll try to read other emails from you and reply tomorrow. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. @ 2017-04-06 15:55 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-06 15:55 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api, Eugene Syromiatnikov On 04/03, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > >> I reviewed the code and cred_guard_mutex needs to cover what it covers. > > > > I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow > > it in any case, even if the current code was not bugy. But this is almost > > offtopic, lets discuss this separately. And let me repeat/clarify. I meant, I see absolutely no reason to do copy_strings() with this mutex held, for example. And note that copy_strings() can use a lot of memory/time, it can trigger oom,swapping, etc. I also think that we can probably do check_unsafe_exec() which in particular sets LSM_UNSAFE_ at bit later, but I am less sure about this and this needs more work. And perhaps more changes like this to narrow the scope of this mutex. And I thought you were already agree with this? > You have asked why I have problems with your patch and so I am going to > try to explain. Partly I want to see a clean set of patches that we > can merge into Linus's tree before we make any compromises. Sure, me too. I do not see a simple and clean fix, your attempts were wrong so far and imo were worse even if they were correct. And this makes me think again that we need to restart this discusion with more CC's. > Partly I object because your understanding and my understanding of > cred_guard_mutex are very different. > > As I read and understand the code the job of cred_guard_mutex is to keep > ptrace (and other threads of the proccess) from interferring in > exec and to ensure old resources are accessed with permission checks > using our original credentials and that new and modified resources are > accessed with permission checks using our new credentials. Yes, this is clear. > I object to your patch in particular because you deliberately mess up > the part of only making old resources available with old creds and > new resources available with new creds. Could you spell please? I don't understand. > AKA What I see neededing to be protected looks something like: > mutex_lock(); > new_cred = compute_new_cred(tsk); > new_mm = compute_new_mm(tsk); > tsk->mm = new_mm; > tsk->cred = new_cred; > zap_other_threads(tsk); > update_sighand(tsk); > update_signal(tsk); > do_close_on_exec(); > update_tsk_fields(tsk); > mutex_unlock(); > > The only way I can see of reducing the scope of cred_guard_mutex is > performing work in such a way that ptrace and the other threads can't > interfere and then taking the lock. Computing the new mm and the new > credentials are certainly candidates for that kind of treatment. OK. And yes, I am not sure this all is optimal, but didn't I say from the very beginning that unlikely we can change this? Now let me quote your next email: On 04/05, Eric W. Biederman wrote: > > We have barely begun. You have not shown anyone what your idea of a > clean fix actually is. All I have seen from you is a quick hack that is > a hack that is back-portable. Yes. I really tried to make a back-portable fix. Otherwise I would start with ->notify_count cleanups. > With your change exec still blocks waiting for zombies. And this is what we currently do. Whether we should do this may be debatable, but why do you blame my patch? > Furthermore you have to violate the reasonable rule that: > * pre-exec resources are guarded with the pre-exec process cred. > * post-exec resources are guarded with the post-exec process cred. For example? > So from a userspace perspective I think your semantics are absolutely > insane. Which semantics were changed by my patch? > - You made comments about cred_guard_mutex and it's scope that when I > reviewed the code were false. Which of my comments was wrong? > We can optimize it And this is what I meant when I said "we should narrow it in any case" > but we can't change what is protected I am not that sure. But a) I did not even try to suggest to change anything in this area right now, and b) I said that unlikely this is possible. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. @ 2017-04-06 15:55 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-04-06 15:55 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, Eugene Syromiatnikov On 04/03, Eric W. Biederman wrote: > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > >> I reviewed the code and cred_guard_mutex needs to cover what it covers. > > > > I strongly, strongly disagree. Its scope is unnecessary huge, we should narrow > > it in any case, even if the current code was not bugy. But this is almost > > offtopic, lets discuss this separately. And let me repeat/clarify. I meant, I see absolutely no reason to do copy_strings() with this mutex held, for example. And note that copy_strings() can use a lot of memory/time, it can trigger oom,swapping, etc. I also think that we can probably do check_unsafe_exec() which in particular sets LSM_UNSAFE_ at bit later, but I am less sure about this and this needs more work. And perhaps more changes like this to narrow the scope of this mutex. And I thought you were already agree with this? > You have asked why I have problems with your patch and so I am going to > try to explain. Partly I want to see a clean set of patches that we > can merge into Linus's tree before we make any compromises. Sure, me too. I do not see a simple and clean fix, your attempts were wrong so far and imo were worse even if they were correct. And this makes me think again that we need to restart this discusion with more CC's. > Partly I object because your understanding and my understanding of > cred_guard_mutex are very different. > > As I read and understand the code the job of cred_guard_mutex is to keep > ptrace (and other threads of the proccess) from interferring in > exec and to ensure old resources are accessed with permission checks > using our original credentials and that new and modified resources are > accessed with permission checks using our new credentials. Yes, this is clear. > I object to your patch in particular because you deliberately mess up > the part of only making old resources available with old creds and > new resources available with new creds. Could you spell please? I don't understand. > AKA What I see neededing to be protected looks something like: > mutex_lock(); > new_cred = compute_new_cred(tsk); > new_mm = compute_new_mm(tsk); > tsk->mm = new_mm; > tsk->cred = new_cred; > zap_other_threads(tsk); > update_sighand(tsk); > update_signal(tsk); > do_close_on_exec(); > update_tsk_fields(tsk); > mutex_unlock(); > > The only way I can see of reducing the scope of cred_guard_mutex is > performing work in such a way that ptrace and the other threads can't > interfere and then taking the lock. Computing the new mm and the new > credentials are certainly candidates for that kind of treatment. OK. And yes, I am not sure this all is optimal, but didn't I say from the very beginning that unlikely we can change this? Now let me quote your next email: On 04/05, Eric W. Biederman wrote: > > We have barely begun. You have not shown anyone what your idea of a > clean fix actually is. All I have seen from you is a quick hack that is > a hack that is back-portable. Yes. I really tried to make a back-portable fix. Otherwise I would start with ->notify_count cleanups. > With your change exec still blocks waiting for zombies. And this is what we currently do. Whether we should do this may be debatable, but why do you blame my patch? > Furthermore you have to violate the reasonable rule that: > * pre-exec resources are guarded with the pre-exec process cred. > * post-exec resources are guarded with the post-exec process cred. For example? > So from a userspace perspective I think your semantics are absolutely > insane. Which semantics were changed by my patch? > - You made comments about cred_guard_mutex and it's scope that when I > reviewed the code were false. Which of my comments was wrong? > We can optimize it And this is what I meant when I said "we should narrow it in any case" > but we can't change what is protected I am not that sure. But a) I did not even try to suggest to change anything in this area right now, and b) I said that unlikely this is possible. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. @ 2017-04-07 22:07 ` Kees Cook 0 siblings, 0 replies; 93+ messages in thread From: Kees Cook @ 2017-04-07 22:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Michal Hocko, Ulrich Obergfell, LKML, Linux API, Eugene Syromiatnikov On Thu, Apr 6, 2017 at 8:55 AM, Oleg Nesterov <oleg@redhat.com> wrote: > And this makes me think again that we need to restart this discusion with > more CC's. I'm a fan of that; I've not been able to follow this thread as it seems to have gone far from the original deadlock problem. :) I've seen issues with ptrace, zombies, and now exec. I'm lost. :P >> Partly I object because your understanding and my understanding of >> cred_guard_mutex are very different. >> >> As I read and understand the code the job of cred_guard_mutex is to keep >> ptrace (and other threads of the proccess) from interferring in >> exec and to ensure old resources are accessed with permission checks >> using our original credentials and that new and modified resources are >> accessed with permission checks using our new credentials. > > Yes, this is clear. Maybe stupid idea: can we get a patch that just adds this kind of documentation somewhere in the source? If we can agree on the purpose of cred_guard_mutex, and get it into the code, that seems like a good step in discussion... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: scope of cred_guard_mutex. @ 2017-04-07 22:07 ` Kees Cook 0 siblings, 0 replies; 93+ messages in thread From: Kees Cook @ 2017-04-07 22:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Michal Hocko, Ulrich Obergfell, LKML, Linux API, Eugene Syromiatnikov On Thu, Apr 6, 2017 at 8:55 AM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > And this makes me think again that we need to restart this discusion with > more CC's. I'm a fan of that; I've not been able to follow this thread as it seems to have gone far from the original deadlock problem. :) I've seen issues with ptrace, zombies, and now exec. I'm lost. :P >> Partly I object because your understanding and my understanding of >> cred_guard_mutex are very different. >> >> As I read and understand the code the job of cred_guard_mutex is to keep >> ptrace (and other threads of the proccess) from interferring in >> exec and to ensure old resources are accessed with permission checks >> using our original credentials and that new and modified resources are >> accessed with permission checks using our new credentials. > > Yes, this is clear. Maybe stupid idea: can we get a patch that just adds this kind of documentation somewhere in the source? If we can agree on the purpose of cred_guard_mutex, and get it into the code, that seems like a good step in discussion... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-09-04 3:19 ` Robert O'Callahan 0 siblings, 0 replies; 93+ messages in thread From: Robert O'Callahan @ 2017-09-04 3:19 UTC (permalink / raw) To: Eric W. Biederman Cc: Oleg Nesterov, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, open list, linux-api Sorry about replying to this old thread, but... On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually > know what the implications of changing it are. Let's see... > > gdb - no > upstart - no > lldb - yes > strace - no For the record, rr uses PTRACE_O_TRACEEXIT. When a thread exits we need to examine its address space to find its robust futex list and record the changes that will be performed by the kernel as it cleans up that list. So, any failures to deliver PTRACE_EVENT_EXIT are potentially problematic for us because we won't get a chance to examine the address space before it disappears. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped. @ 2017-09-04 3:19 ` Robert O'Callahan 0 siblings, 0 replies; 93+ messages in thread From: Robert O'Callahan @ 2017-09-04 3:19 UTC (permalink / raw) To: Eric W. Biederman Cc: Oleg Nesterov, Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, open list, linux-api-u79uwXL29TY76Z2rM5mHXA Sorry about replying to this old thread, but... On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually > know what the implications of changing it are. Let's see... > > gdb - no > upstart - no > lldb - yes > strace - no For the record, rr uses PTRACE_O_TRACEEXIT. When a thread exits we need to examine its address space to find its robust futex list and record the changes that will be performed by the kernel as it cleans up that list. So, any failures to deliver PTRACE_EVENT_EXIT are potentially problematic for us because we won't get a chance to examine the address space before it disappears. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-03-04 16:54 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-03-04 16:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel, linux-api On 03/03, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > >> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > >> } > >> > >> tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; > >> - if (tsk->exit_state == EXIT_DEAD) > >> - list_add(&tsk->ptrace_entry, &dead); > >> > >> /* mt-exec, de_thread() is waiting for group leader */ > >> if (unlikely(tsk->signal->notify_count < 0)) > >> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > >> list_del_init(&p->ptrace_entry); > >> release_task(p); > >> } > >> + if (autoreap) > >> + release_task(tsk); > > > > These 2 changes are not needed. release_task(tsk) will be called by > > list_for_each_entry_safe() above if autoreap == T. > > Except for the practical case that for threads that are ptraced > tsk->ptrace_entry is already in use. Which means we can't use > list_add(&tsk->ptrace_entry, &dead). Yes, I was wrong here, thanks for correcting me. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/2] fix the traced mt-exec deadlock @ 2017-03-04 16:54 ` Oleg Nesterov 0 siblings, 0 replies; 93+ messages in thread From: Oleg Nesterov @ 2017-03-04 16:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Aleksa Sarai, Andy Lutomirski, Attila Fazekas, Jann Horn, Kees Cook, Michal Hocko, Ulrich Obergfell, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On 03/03, Eric W. Biederman wrote: > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > >> @@ -699,8 +701,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > >> } > >> > >> tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; > >> - if (tsk->exit_state == EXIT_DEAD) > >> - list_add(&tsk->ptrace_entry, &dead); > >> > >> /* mt-exec, de_thread() is waiting for group leader */ > >> if (unlikely(tsk->signal->notify_count < 0)) > >> @@ -711,6 +711,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > >> list_del_init(&p->ptrace_entry); > >> release_task(p); > >> } > >> + if (autoreap) > >> + release_task(tsk); > > > > These 2 changes are not needed. release_task(tsk) will be called by > > list_for_each_entry_safe() above if autoreap == T. > > Except for the practical case that for threads that are ptraced > tsk->ptrace_entry is already in use. Which means we can't use > list_add(&tsk->ptrace_entry, &dead). Yes, I was wrong here, thanks for correcting me. Oleg. ^ permalink raw reply [flat|nested] 93+ messages in thread
end of thread, other threads:[~2017-09-04 3:19 UTC | newest] Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-13 14:14 [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov 2017-02-13 14:15 ` [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov 2017-02-13 16:12 ` kbuild test robot 2017-02-13 16:47 ` Oleg Nesterov 2017-02-13 16:39 ` kbuild test robot 2017-02-13 17:27 ` Mika Penttilä 2017-02-13 18:01 ` Oleg Nesterov 2017-02-13 18:04 ` [PATCH V2 " Oleg Nesterov 2017-02-16 11:42 ` Eric W. Biederman 2017-02-20 15:22 ` Oleg Nesterov 2017-02-20 15:36 ` Oleg Nesterov 2017-02-20 22:30 ` Eric W. Biederman 2017-02-21 17:53 ` Oleg Nesterov 2017-02-21 20:20 ` Eric W. Biederman 2017-02-22 17:41 ` Oleg Nesterov 2017-02-17 4:42 ` Eric W. Biederman 2017-02-20 15:50 ` Oleg Nesterov 2017-02-13 14:15 ` [PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov 2017-02-24 16:03 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov 2017-03-03 1:05 ` Eric W. Biederman 2017-03-03 17:33 ` Oleg Nesterov 2017-03-03 18:23 ` Eric W. Biederman 2017-03-03 18:23 ` Eric W. Biederman 2017-03-03 18:59 ` Eric W. Biederman 2017-03-03 18:59 ` Eric W. Biederman 2017-03-03 20:06 ` Eric W. Biederman 2017-03-03 20:06 ` Eric W. Biederman 2017-03-03 20:11 ` [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped Eric W. Biederman 2017-03-03 20:11 ` Eric W. Biederman 2017-03-04 17:03 ` Oleg Nesterov 2017-03-30 8:07 ` Eric W. Biederman 2017-04-01 5:11 ` [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang Eric W. Biederman 2017-04-01 5:11 ` Eric W. Biederman 2017-04-01 5:14 ` [RFC][PATCH 1/2] sighand: Count each thread group once in sighand_struct Eric W. Biederman 2017-04-01 5:14 ` Eric W. Biederman 2017-04-01 5:16 ` [RFC][PATCH 2/2] exec: If possible don't wait for ptraced threads to be reaped Eric W. Biederman 2017-04-01 5:16 ` Eric W. Biederman 2017-04-02 15:35 ` Oleg Nesterov 2017-04-02 15:35 ` Oleg Nesterov 2017-04-02 18:53 ` Eric W. Biederman 2017-04-02 18:53 ` Eric W. Biederman 2017-04-03 18:12 ` Oleg Nesterov 2017-04-03 18:12 ` Oleg Nesterov 2017-04-03 21:04 ` Eric W. Biederman 2017-04-05 16:44 ` Oleg Nesterov 2017-04-02 15:38 ` [RFC][PATCH 0/2] exec: Fixing ptrace'd mulit-threaded hang Oleg Nesterov 2017-04-02 15:38 ` Oleg Nesterov 2017-04-02 22:50 ` [RFC][PATCH v2 0/5] " Eric W. Biederman 2017-04-02 22:50 ` Eric W. Biederman 2017-04-02 22:51 ` [RFC][PATCH v2 1/5] ptrace: Don't wait in PTRACE_O_TRACEEXIT for exec or coredump Eric W. Biederman 2017-04-02 22:51 ` Eric W. Biederman 2017-04-05 16:19 ` Oleg Nesterov 2017-04-02 22:51 ` [RFC][PATCH v2 2/5] sighand: Count each thread group once in sighand_struct Eric W. Biederman 2017-04-02 22:51 ` Eric W. Biederman 2017-04-02 22:52 ` [RFC][PATCH v2 3/5] clone: Disallown CLONE_THREAD with a shared sighand_struct Eric W. Biederman 2017-04-02 22:52 ` Eric W. Biederman 2017-04-05 16:24 ` Oleg Nesterov 2017-04-05 16:24 ` Oleg Nesterov 2017-04-05 17:34 ` Eric W. Biederman 2017-04-05 18:11 ` Oleg Nesterov 2017-04-02 22:53 ` [RFC][PATCH v2 4/5] exec: If possible don't wait for ptraced threads to be reaped Eric W. Biederman 2017-04-02 22:53 ` Eric W. Biederman 2017-04-05 16:15 ` Oleg Nesterov 2017-04-02 22:57 ` [RFC][PATCH v2 5/5] signal: Don't allow accessing signal_struct by old threads after exec Eric W. Biederman 2017-04-02 22:57 ` Eric W. Biederman 2017-04-05 16:18 ` Oleg Nesterov 2017-04-05 16:18 ` Oleg Nesterov 2017-04-05 18:16 ` Eric W. Biederman 2017-04-05 18:16 ` Eric W. Biederman 2017-04-06 15:48 ` Oleg Nesterov 2017-04-06 15:48 ` Oleg Nesterov 2017-04-02 16:15 ` [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped Oleg Nesterov 2017-04-02 16:15 ` Oleg Nesterov 2017-04-02 21:07 ` Eric W. Biederman 2017-04-02 21:07 ` Eric W. Biederman 2017-04-03 18:37 ` Oleg Nesterov 2017-04-03 18:37 ` Oleg Nesterov 2017-04-03 22:49 ` Eric W. Biederman 2017-04-03 22:49 ` Eric W. Biederman 2017-04-03 22:49 ` scope of cred_guard_mutex Eric W. Biederman 2017-04-03 22:49 ` Eric W. Biederman 2017-04-05 16:08 ` Oleg Nesterov 2017-04-05 16:11 ` Kees Cook 2017-04-05 17:53 ` Eric W. Biederman 2017-04-05 18:15 ` Oleg Nesterov 2017-04-06 15:55 ` Oleg Nesterov 2017-04-06 15:55 ` Oleg Nesterov 2017-04-07 22:07 ` Kees Cook 2017-04-07 22:07 ` Kees Cook 2017-09-04 3:19 ` [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped Robert O'Callahan 2017-09-04 3:19 ` Robert O'Callahan 2017-03-04 16:54 ` [PATCH 0/2] fix the traced mt-exec deadlock Oleg Nesterov 2017-03-04 16:54 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.