From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:39744 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1755006AbcKEO4e (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Sat, 5 Nov 2016 10:56:34 -0400
Date: Sat, 5 Nov 2016 15:56:23 +0100
From: Oleg Nesterov <oleg@redhat.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jann@thejh.net>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Roland McGrath <roland@hack.frob.com>,
        John Johansen <john.johansen@canonical.com>,
        James Morris <james.l.morris@oracle.com>,
        "Serge E. Hallyn" <serge@hallyn.com>,
        Paul Moore <aul@paul-moore.com>,
        Stephen Smalley <sds@tycho.nsa.gov>,
        Eric Paris <eparis@parisplace.org>,
        Casey Schaufler <casey@schaufler-ca.com>,
        Kees Cook <keescook@chromium.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Janis Danisevskis <jdanis@google.com>,
        Seth Forshee <seth.forshee@canonical.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Benjamin LaHaise <bcrl@kvack.org>,
        Ben Hutchings <ben@decadent.org.uk>,
        Andy Lutomirski <luto@amacapital.net>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Krister Johansen <kjlx@templeofstupid.com>,
        linux-fsdevel@vger.kernel.org,
        linux-security-module@vger.kernel.org, security@kernel.org
Subject: Re: [PATCH v3 1/8] exec: introduce cred_guard_light
Message-ID: <20161105145623.GA21207@redhat.com>
References: <1477863998-3298-1-git-send-email-jann@thejh.net>
 <1477863998-3298-2-git-send-email-jann@thejh.net>
 <20161102181806.GB1112@redhat.com>
 <20161102205011.GF8196@pc.thejh.net>
 <20161103181225.GA11212@redhat.com>
 <87k2cj2x6j.fsf@xmission.com>
 <87k2cjuw6h.fsf@xmission.com>
 <20161104180416.GA19221@redhat.com>
 <20161104184505.GA21320@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20161104184505.GA21320@redhat.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On 11/04, Oleg Nesterov wrote:
>
> On 11/04, Oleg Nesterov wrote:
> >
> > On 11/04, Eric W. Biederman wrote:
> > >
> > > The following mostly correct patch modifies zap_other_threads in
> > > the case of a de_thread to not wait for zombies to be reaped.  The only
> > > case that cares is ptrace (as threads are self reaping).  So I don't
> > > think this will cause any problems except removing the strace -f race.
> >
> > From my previous email:
> >
> > 	So the only plan I currently have is change de_thread() to wait until
> > 	other threads pass exit_notify() or even exit_signals(), but I don't
> > 	like this.
> >
> > And yes, I don't like this, but perhaps this is what we should do.
> >
> > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
> > checks doesn't look right, but off course technically this change should
> > be simple enough.
> >
> > But not that simple. Just for example, the exiting sub-threads should
> > not run with ->group_leader pointing to nowhere, in case it was reaped
> > by de_thread.
>
> Not to mention other potential problems outside of ptrace/exec. For example
> userns_install() can fail after mt-exec even without ptrace, simply because
> thread_group_empty() can be false. Sure, easy to fix, and probably _install()
> should use signal->live anyway, but still.
>
> And I didn't mention the fun with sighand unsharing. We simply can't do this
> until all sub-threads go away. IOW, your patch breaks the usage of ->siglock.
> The execing thread and the zombie threads will use different locks to, say,
> remove the task from thread-group. Again, this is fixable, but not that
> simple.
>
> > And we have another problem with PTRACE_EVENT_EXIT which can lead to the
> > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
> > defined. But this change will add the user-visible change.
> >
> > And if we add the user-visible changes, then perhaps we could simply untrace
> > the traced sub-threads on exec. This change is simple, we do not even need
> > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
> > if exec is in progress. But I'm afraid we can't do this.

So I was thinking about something like below. Untested, probably buggy/incomplete
too, but hopefully can work.

flush_old_exec() calls the new kill_sub_threads() helper which waits until
all the sub-threads pass exit_notify().

de_thread() is called after install_exec_creds(), it is simplified and waits
for thread_group_empty() without cred_guard_mutex.

But again, I do not really like this, and we need to do something with
PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible.

And I disagree that this has nothing to do with cred_guard_mutex. And in any
case we should narrow its scope in do_execve() path. Why do we take it so early?
Why do we need to do, say, copy_strings() with this lock held? The original
motivation for this has gone, acct_arg_size() can work just fine even if
multiple threads call sys_execve().

I'll try to discuss the possible changes in LSM hooks with Jann, I still think
that this is what we actually need to do. At least try to do, possibly this is
too complicated.

Oleg.

 fs/binfmt_elf.c         |   6 ++-
 fs/exec.c               | 108 ++++++++++++++++++++++++------------------------
 include/linux/binfmts.h |   1 +
 kernel/exit.c           |   5 ++-
 kernel/signal.c         |   3 +-
 5 files changed, 65 insertions(+), 58 deletions(-)


--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	setup_new_exec(bprm);
 	install_exec_creds(bprm);
 
+	retval = de_thread(current);
+	if (retval)
+		goto out_free_dentry;
+
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
 	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
 				 executable_stack);
 	if (retval < 0)
 		goto out_free_dentry;
-	
+
 	current->mm->start_stack = bprm->p;
 
 	/* Now we do a little grungy work by mmapping the ELF image into
diff --git a/fs/exec.c b/fs/exec.c
index 4e497b9..7246b9f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1036,13 +1036,59 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig)
+{
+	for (;;) {
+		if (unlikely(__fatal_signal_pending(tsk)))
+			goto killed;
+		set_current_state(TASK_KILLABLE);
+		if (!sig->notify_count)
+			break;
+		schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+
+killed:
+	/* protects against exit_notify() and __exit_signal() */
+	read_lock(&tasklist_lock);
+	sig->group_exit_task = NULL;
+	sig->notify_count = 0;
+	read_unlock(&tasklist_lock);
+	return -EINTR;
+}
+
+static int kill_sub_threads(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+	int err = -EINTR;
+
+	if (thread_group_empty(tsk))
+		return 0;
+
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (!signal_group_exit(sig)) {
+		sig->group_exit_task = tsk;
+		sig->notify_count = -zap_other_threads(tsk);
+		err = 0;
+	}
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
+
+	if (!err)
+		err = wait_for_notify_count(tsk, sig);
+	return err;
+
+}
+
 /*
  * This function makes sure the current process has its own signal table,
  * so that flush_signal_handlers can later reset the handlers without
  * disturbing other processes.  (Other processes might share the signal
  * table via the CLONE_SIGHAND option to clone().)
  */
-static int de_thread(struct task_struct *tsk)
+int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
@@ -1051,34 +1097,15 @@ static int de_thread(struct task_struct *tsk)
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
 
-	/*
-	 * Kill all other threads in the thread group.
-	 */
 	spin_lock_irq(lock);
-	if (signal_group_exit(sig)) {
-		/*
-		 * Another group action in progress, just
-		 * return so that the signal is processed.
-		 */
-		spin_unlock_irq(lock);
-		return -EAGAIN;
-	}
-
-	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
+	sig->notify_count = sig->nr_threads;
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
-
-	while (sig->notify_count) {
-		__set_current_state(TASK_KILLABLE);
-		spin_unlock_irq(lock);
-		schedule();
-		if (unlikely(__fatal_signal_pending(tsk)))
-			goto killed;
-		spin_lock_irq(lock);
-	}
 	spin_unlock_irq(lock);
 
+	if (wait_for_notify_count(tsk, sig))
+		return -EINTR;
+
 	/*
 	 * At this point all other threads have exited, all we have to
 	 * do is to wait for the thread group leader to become inactive,
@@ -1087,24 +1114,8 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk)) {
 		struct task_struct *leader = tsk->group_leader;
 
-		for (;;) {
-			threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exit_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			threadgroup_change_end(tsk);
-			schedule();
-			if (unlikely(__fatal_signal_pending(tsk)))
-				goto killed;
-		}
-
+		threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 		/*
 		 * The only record we have of the real-time age of a
 		 * process, regardless of execs it's done, is start_time.
@@ -1162,10 +1173,9 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+no_thread_group:
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
-
-no_thread_group:
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
 
@@ -1197,14 +1207,6 @@ static int de_thread(struct task_struct *tsk)
 
 	BUG_ON(!thread_group_leader(tsk));
 	return 0;
-
-killed:
-	/* protects against exit_notify() and __exit_signal() */
-	read_lock(&tasklist_lock);
-	sig->group_exit_task = NULL;
-	sig->notify_count = 0;
-	read_unlock(&tasklist_lock);
-	return -EAGAIN;
 }
 
 char *get_task_comm(char *buf, struct task_struct *tsk)
@@ -1239,7 +1241,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 * Make sure we have a private signal table and that
 	 * we are unassociated from the previous thread group.
 	 */
-	retval = de_thread(current);
+	retval = kill_sub_threads(current);
 	if (retval)
 		goto out;
 
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *);
 extern int search_binary_handler(struct linux_binprm *);
 extern int flush_old_exec(struct linux_binprm * bprm);
 extern void setup_new_exec(struct linux_binprm * bprm);
+extern int de_thread(struct task_struct *tsk);
 extern void would_dump(struct linux_binprm *, struct file *);
 
 extern int suid_dumpable;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9d68c45..f3dd46d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -690,8 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
+	/* mt-exec, kill_sub_threads() is waiting for group exit */
+	if (unlikely(tsk->signal->notify_count < 0) &&
+	    !++tsk->signal->notify_count)
 		wake_up_process(tsk->signal->group_exit_task);
 	write_unlock_irq(&tasklist_lock);
 
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1194,13 +1194,12 @@ int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
-
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+		count++;
 	}
 
 	return count;