From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from out03.mta.xmission.com ([166.70.13.233]:37372 "EHLO
        out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1761889AbcKDN3g (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 4 Nov 2016 09:29:36 -0400
From: ebiederm@xmission.com (Eric W. Biederman)
To: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jann@thejh.net>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Roland McGrath <roland@hack.frob.com>,
        John Johansen <john.johansen@canonical.com>,
        James Morris <james.l.morris@oracle.com>,
        "Serge E. Hallyn" <serge@hallyn.com>,
        Paul Moore <aul@paul-moore.com>,
        Stephen Smalley <sds@tycho.nsa.gov>,
        Eric Paris <eparis@parisplace.org>,
        Casey Schaufler <casey@schaufler-ca.com>,
        Kees Cook <keescook@chromium.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Janis Danisevskis <jdanis@google.com>,
        Seth Forshee <seth.forshee@canonical.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Benjamin LaHaise <bcrl@kvack.org>,
        Ben Hutchings <ben@decadent.org.uk>,
        Andy Lutomirski <luto@amacapital.net>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Krister Johansen <kjlx@templeofstupid.com>,
        linux-fsdevel@vger.kernel.org,
        linux-security-module@vger.kernel.org, security@kernel.org
References: <1477863998-3298-1-git-send-email-jann@thejh.net>
        <1477863998-3298-2-git-send-email-jann@thejh.net>
        <20161102181806.GB1112@redhat.com>
        <20161102205011.GF8196@pc.thejh.net>
        <20161103181225.GA11212@redhat.com>
Date: Fri, 04 Nov 2016 08:26:28 -0500
In-Reply-To: <20161103181225.GA11212@redhat.com> (Oleg Nesterov's message of
        "Thu, 3 Nov 2016 19:12:25 +0100")
Message-ID: <87k2cj2x6j.fsf@xmission.com>
MIME-Version: 1.0
Content-Type: text/plain
Subject: Re: [PATCH v3 1/8] exec: introduce cred_guard_light
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Oleg Nesterov <oleg@redhat.com> writes:

> On 11/02, Jann Horn wrote:
>>
>> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
>> > On 10/30, Jann Horn wrote:
>> > >
>> > > This is a new per-threadgroup lock that can often be taken instead of
>> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
>> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
>> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
>> > > thread, and the debugger attempts to inspect procfs files of the debugged
>> > > task.
>> >
>> > Yes, but let me repeat that we need to fix this anyway. So I don't really
>> > understand why should we add yet another mutex.
>>
>> execve() only takes the new mutex immediately after de_thread(), so this
>> problem shouldn't occur there.
>
> Yes, I see.
>
>> Basically, I think that I'm not making the
>> problem worse with my patches this way.
>
> In a sense that it doesn't add the new deadlocks, I agree. But it adds
> yet another per-process mutex while we already have the similar one,
>
>> I believe that it should be possible to convert most existing users of the
>> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
>> see are:
>>
>>  - PTRACE_ATTACH
>
> This is the main problem afaics. So "strace -f" can hang if it races
> with mt-exec. And we need to fix this. I constantly forget about this
> problem, but I tried many times to find a reasonable solution, still
> can't.
>
> IMO, it would be nice to rework the lsm hooks, so that we could take
> cred_guard_mutex after de_thread() (like your cred_guard_light) or
> at least drop it earlier, but unlikely this is possible...
>
> So the only plan I currently have is change de_thread() to wait until
> other threads pass exit_notify() or even exit_signals(), but I don't
> like this.
>
>>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
>
> I forgot about this one... Need to re-check but at first glance this
> is not a real problem.
>
>> Beyond that, conceptually, the new cred_guard_light could also be turned
>> into a read-write mutex
>
> Not sure I understand how this can help... doesn't matter.
>
> My point is, imo you should not add the new mutex. Just use the old
> one in (say) 4/8 (which I do not personally like as you know ;), this
> won't add the new problem.
>
>
>> It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
>> deadlocking issues.
>
> Yes, agreed.
>
>> PTRACE_ATTACH isn't that clear to me; if a debugger
>> tries to attach to a newly spawned thread while another ptraced thread is
>> dying because of de_thread() in a third thread, that might still cause
>> the debugger to deadlock, right?
>
> This is the trivial test-case I wrote when the problem was initially
> reported. And damn, I always knew that cred_guard_mutex needs fixes,
> but somehow I completely forgot that it is used by PTRACE_ATTACH when
> I was going to try to remove from fs/proc a long ago.
>
> 	void *thread(void *arg)
> 	{
> 		ptrace(PTRACE_TRACEME, 0,0,0);
> 		return NULL;
> 	}
>
> 	int main(void)
> 	{
> 		int pid = fork();
>
> 		if (!pid) {
> 			pthread_t pt;
> 			pthread_create(&pt, NULL, thread, NULL);
> 			pthread_join(pt, NULL);
> 			execlp("echo", "echo", "passed", NULL);
> 		}
>
> 		sleep(1);
> 		// or anything else which needs ->cred_guard_mutex,
> 		// say open(/proc/$pid/mem)
> 		ptrace(PTRACE_ATTACH, pid, 0,0);
> 		kill(pid, SIGCONT);
>
> 		return 0;
> 	}
>
> The problem is trivial. The execing thread waits until its sub-thread
> goes away, it should be reaped by the tracer, the tracer waits for
> cred_guard_mutex.

There is a bug here but I don't believe it has anything to do with
the cred_guard_mutex.

If we reach zap_other_threads fundamentally the tracer should not
be able to block the traced thread from exiting.  Those are the
semantics described in the comments in the code.

I have poked things a little and have a half fix for that but
the fix appears to be the wrong, but enlightening.

AKA the following prevents the hang of your test case.
diff --git a/kernel/signal.c b/kernel/signal.c
index 75761acc77cf..a6f83450500e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1200,7 +1200,7 @@ int zap_other_threads(struct task_struct *p)
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
-		signal_wake_up(t, 1);
+		signal_wake_up_state(t, TASK_WAKEKILL | __TASK_TRACED);
 	}
 
 	return count;

It looks like somewhere on the exit path the traced thread is blocking
without setting TASK_WAKEKILL.

Eric