From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qryl=4U=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 79673C3F2C6
	for <linux-mm@archiver.kernel.org>; Tue,  3 Mar 2020 08:44:23 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2D6EB20873
	for <linux-mm@archiver.kernel.org>; Tue,  3 Mar 2020 08:44:23 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2D6EB20873
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ubuntu.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id CAA5D6B0007; Tue,  3 Mar 2020 03:44:22 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C5BC06B0008; Tue,  3 Mar 2020 03:44:22 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B4B826B000A; Tue,  3 Mar 2020 03:44:22 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0162.hostedemail.com [216.40.44.162])
	by kanga.kvack.org (Postfix) with ESMTP id 9E0716B0007
	for <linux-mm@kvack.org>; Tue,  3 Mar 2020 03:44:22 -0500 (EST)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 61A89ABE3
	for <linux-mm@kvack.org>; Tue,  3 Mar 2020 08:44:22 +0000 (UTC)
X-FDA: 76553414364.08.tub68_8feda09e9c03
X-HE-Tag: tub68_8feda09e9c03
X-Filterd-Recvd-Size: 14261
Received: from youngberry.canonical.com (youngberry.canonical.com [91.189.89.112])
	by imf12.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue,  3 Mar 2020 08:44:21 +0000 (UTC)
Received: from ip5f5bf7ec.dynamic.kabel-deutschland.de ([95.91.247.236] helo=wittgenstein)
	by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
	(Exim 4.86_2)
	(envelope-from <christian.brauner@ubuntu.com>)
	id 1j939X-0004Ox-Q8; Tue, 03 Mar 2020 08:43:47 +0000
Date: Tue, 3 Mar 2020 09:43:46 +0100
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Kees Cook <keescook@chromium.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Oleg Nesterov <oleg@redhat.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Andrei Vagin <avagin@gmail.com>, Ingo Molnar <mingo@kernel.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Yuyang Du <duyuyang@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	David Howells <dhowells@redhat.com>,
	James Morris <jamorris@linux.microsoft.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Shakeel Butt <shakeelb@google.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	Christian Kellner <christian@kellner.me>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Aleksa Sarai <cyphar@cyphar.com>,
	"Dmitry V. Levin" <ldv@altlinux.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: [PATCHv4] exec: Fix a deadlock in ptrace
Message-ID: <20200303084346.jympckv5wjki3orb@wittgenstein>
References: <AM6PR03MB51704206634C009500A8080DE4E70@AM6PR03MB5170.eurprd03.prod.outlook.com>
 <875zfmloir.fsf@x220.int.ebiederm.org>
 <AM6PR03MB51707ABF20B6CBBECC34865FE4E70@AM6PR03MB5170.eurprd03.prod.outlook.com>
 <87v9nmjulm.fsf@x220.int.ebiederm.org>
 <AM6PR03MB5170B976E6387FDDAD59A118E4E70@AM6PR03MB5170.eurprd03.prod.outlook.com>
 <202003021531.C77EF10@keescook>
 <AM6PR03MB5170B5C1B95CB1D065EE3AFAE4E40@AM6PR03MB5170.eurprd03.prod.outlook.com>
 <202003022103.196C313623@keescook>
 <AM6PR03MB5170A15099986CEC3189F251E4E40@AM6PR03MB5170.eurprd03.prod.outlook.com>
 <20200303083425.jbf43axuymttijfv@wittgenstein>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20200303083425.jbf43axuymttijfv@wittgenstein>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> > On 3/3/20 6:29 AM, Kees Cook wrote:
> > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> > >> On 3/3/20 3:26 AM, Kees Cook wrote:
> > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > >>>> [...]
> > >>>
> > >>> If I'm reading this patch correctly, this changes the lifetime of the
> > >>> cred_guard_mutex lock to be:
> > >>> 	- during prepare_bprm_creds()
> > >>> 	- from flush_old_exec() through install_exec_creds()
> > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > >>> install_exec_creds().
> > > 
> > > BTW, I think the effect of this change (i.e. my paragraph above) should
> > > be distinctly called out in the commit log if this solution moves
> > > forward.
> > > 
> > 
> > Okay, will do.
> > 
> > >>> That means, for example, that check_unsafe_exec()'s documented invariant
> > >>> is violated:
> > >>>     /*
> > >>>      * determine how safe it is to execute the proposed program
> > >>>      * - the caller must hold ->cred_guard_mutex to protect against
> > >>>      *   PTRACE_ATTACH or seccomp thread-sync
> > >>>      */
> > >>
> > >> Oh, right, I haven't understood that hint...
> > > 
> > > I know no_new_privs is checked there, but I haven't studied the
> > > PTRACE_ATTACH part of that comment. If that is handled with the new
> > > check, this comment should be updated.
> > > 
> > 
> > Okay, I change that comment to:
> > 
> > /*
> >  * determine how safe it is to execute the proposed program
> >  * - the caller must have set ->cred_locked_in_execve to protect against
> >  *   PTRACE_ATTACH or seccomp thread-sync
> >  */
> > 
> > >>> I think it also means that the potentially multiple invocations
> > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > >>> a lock (another place where current's no_new_privs is evaluated).
> > >>
> > >> So no_new_privs can change from 0->1, but should not
> > >> when execve is running.
> > >>
> > >> As long as the calling thread is in execve it won't do this,
> > >> and the only other place, where it may set for other threads
> > >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > > 
> > > Yeah, everything was fine until I had to go complicate things with
> > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > > flag was used to control this, but it required that the filter not get
> > > applied during exec.
> > > 
> > >>> Related, it also means that cred_guard_mutex is unheld for every
> > >>> invocation of search_binary_handler() (which can loop via the previously
> > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > >>> currently.)
> > >>>
> > >>> For seccomp, the expectations about existing thread states risks races
> > >>> too. There are two locks held for TSYNC:
> > >>> - current->sighand->siglock is held to keep new threads from
> > >>>   appearing/disappearing, which would destroy filter refcounting and
> > >>>   lead to memory corruption.
> > >>
> > >> I don't understand what you mean here.
> > >> How can this lead to memory corruption?
> > > 
> > > Mainly this is a matter of how seccomp manages its filter hierarchy
> > > (since the filters are shared through process ancestry), so if a thread
> > > appears in the middle of TSYNC it may be racing another TSYNC and break
> > > ancestry, leading to bad reference counting on process death, etc.
> > > (Though, yes, with refcount_t now, things should never corrupt, just
> > > waste memory.)
> > > 
> > 
> > I assume for now, that the current->sighand->siglock held while iterating all
> > threads is sufficient here.
> > 
> > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> > >>>   avoid no_new_privs and filter confusion during exec, which could
> > >>>   lead to exploitable setuid conditions (see below).
> > >>>
> > >>> Just racing a malicious thread during TSYNC is not a very strong
> > >>> example (a malicious thread could do lots of fun things to "current"
> > >>> before it ever got near calling TSYNC), but I think there is the risk
> > >>> of mismatched/confused states that we don't want to allow. One is a
> > >>> particularly bad state that could lead to privilege escalations (in the
> > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > >>> has a filter attached that silently fails a priv-dropping setuid call
> > >>> and continues execution with elevated privs, it can be tricked into
> > >>> doing bad things on behalf of the unprivileged parent, which was the
> > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> > >>>
> > >>> thread A clones thread B
> > >>> thread B starts setuid exec
> > >>> thread A sets no_new_privs
> > >>> thread A calls seccomp with TSYNC
> > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > >>> thread B passes check_unsafe_exec() with no_new_privs unset
> > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > >>> thread B finishes exec, now running with elevated privs, a filter chosen
> > >>>          by thread A, _and_ nnp set (which doesn't matter)
> > >>>
> > >>> With the original locking, thread B will fail check_unsafe_exec()
> > >>> because filter and nnp state are changed together, with "atomicity"
> > >>> protected by the cred_guard_mutex.
> > >>>
> > >>
> > >> Ah, good point, thanks!
> > >>
> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> > >> while the cred_guard_mutex is locked, like this for instance:
> > >>
> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > >> index b6ea3dc..377abf0 100644
> > >> --- a/kernel/seccomp.c
> > >> +++ b/kernel/seccomp.c
> > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> > >>         BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> > >>         assert_spin_locked(&current->sighand->siglock);
> > >>  
> > >> +       if (current->signal->cred_locked_for_ptrace)
> > >> +               return -EAGAIN;
> > >> +
> > > 
> > > Hmm. I guess something like that could work. TSYNC expects to be able to
> > > report _which_ thread wrecked the call, though... I wonder if in_execve
> > > could be used to figure out the offending thread. Hm, nope, that would
> > > be outside of lock too (and all users are "current" right now, so the
> > > lock wasn't needed before).
> > > 
> > 
> > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> > but the caller will die quickly and cannot do anything with that information
> > when another thread executes execve, right?
> > 
> > >>         /* Validate all threads being eligible for synchronization. */
> > >>         caller = current;
> > >>         for_each_thread(caller, thread) {
> > >>
> > >>
> > >>> And this is just the bad state I _can_ see. I'm worried there are more...
> > >>>
> > >>> All this said, I do see a small similarity here to the work I did to
> > >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> > >>> decisions for the bprm based on current's state -- but current's state
> > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > >>> current's copy until exec ended and then stored bprm's copy into current.
> > >>> If the only problem anyone can see here is the handling of no_new_privs,
> > >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> > >>> from cred_guard_mutex.
> > >>>
> > >>
> > >> I still think that is solvable with using cred_locked_for_ptrace and
> > >> simply make the tsync fail if it would otherwise be blocked.
> > > 
> > > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > > 
> > 
> > Yeah, I'd go with "cred_locked_in_execve".
> > 
> > > And the comment on bool cred_locked_for_ptrace should mention that
> > > access is only allowed under cred_guard_mutex lock.
> > > 
> > 
> > okay.
> > 
> > >>>> +	sig->cred_locked_for_ptrace = false;
> > > 
> > > This is redundant to the zalloc -- I think you can drop it (unless
> > > someone wants to keep it for clarify?)
> > > 
> > 
> > I'll remove that here and in init/init_task.c
> > 
> > > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > > be sufficient to see a stable version of the thread...
> > > 
> > 
> > No, these need to be addressed individually, but most users just want
> > to know if the current credentials are sufficient at this moment, but will
> > not change the credentials, as ptrace and TSYNC do. 
> > 
> > BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> > So adding an access to cred_locked_for_execve in ptrace_may_access is
> > probably not an option.
> > 
> > However, one nice added value by this change is this:
> > 
> > void *thread(void *arg)
> > {
> > 	ptrace(PTRACE_TRACEME, 0,0,0);
> > 	return NULL;
> > }
> > 
> > int main(void)
> > {
> > 	int pid = fork();
> > 
> > 	if (!pid) {
> > 		pthread_t pt;
> > 		pthread_create(&pt, NULL, thread, NULL);
> > 		pthread_join(pt, NULL);
> > 		execlp("echo", "echo", "passed", NULL);
> > 	}
> > 
> > 	sleep(1000);
> > 	ptrace(PTRACE_ATTACH, pid, 0,0);
> > 	kill(pid, SIGCONT);
> > 	return 0;
> > }
> > 
> > cat /proc/3812/stack 
> > [<0>] flush_old_exec+0xbf/0x760
> > [<0>] load_elf_binary+0x35a/0x16c0
> > [<0>] search_binary_handler+0x97/0x1d0
> > [<0>] __do_execve_file.isra.40+0x624/0x920
> > [<0>] __x64_sys_execve+0x49/0x60
> > [<0>] do_syscall_64+0x64/0x220
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > 
> > > (I remain very nervous about weakening cred_guard_mutex without
> > > addressing the many many users...)
> > > 
> > 
> > They need to be looked at closely, that's pretty clear.
> > Most fall in the class, that just the current credentials need
> > to stay stable for a certain time.
> 
> I remain rather set on wanting some very basic tests with this change.
> Imho, looking through tools/testing/selftests again we don't have nearly
> enough for these codepaths; not to say none. Basically, if someone wants
> to make a change affecting the current problem we should really have at
> least a single simple test/reproducer that can be run without digging
> through lore. And hopefully over time we'll have more tests.

Which you added in v4. Which is great! (I should've mentioned this in my
first mail.)
Christian