From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0E7EC3F2D7 for ; Wed, 4 Mar 2020 15:30:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 79EF022B48 for ; Wed, 4 Mar 2020 15:30:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 79EF022B48 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ubuntu.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C226A6B0003; Wed, 4 Mar 2020 10:30:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BD25E6B0005; Wed, 4 Mar 2020 10:30:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC1516B0006; Wed, 4 Mar 2020 10:30:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 944A06B0003 for ; Wed, 4 Mar 2020 10:30:45 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3A2A7381E for ; Wed, 4 Mar 2020 15:30:45 +0000 (UTC) X-FDA: 76558067250.19.field16_7aec3e815f322 X-HE-Tag: field16_7aec3e815f322 X-Filterd-Recvd-Size: 12204 Received: from youngberry.canonical.com (youngberry.canonical.com [91.189.89.112]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Mar 2020 15:30:44 +0000 (UTC) Received: from b2b-5-147-251-51.unitymedia.biz ([5.147.251.51] helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1j9VyE-0002FV-Te; Wed, 04 Mar 2020 15:30:03 +0000 Date: Wed, 4 Mar 2020 16:30:02 +0100 From: Christian Brauner To: Bernd Edlinger Cc: Kees Cook , "Eric W. Biederman" , Jann Horn , Jonathan Corbet , Alexander Viro , Andrew Morton , Alexey Dobriyan , Thomas Gleixner , Oleg Nesterov , Frederic Weisbecker , Andrei Vagin , Ingo Molnar , "Peter Zijlstra (Intel)" , Yuyang Du , David Hildenbrand , Sebastian Andrzej Siewior , Anshuman Khandual , David Howells , James Morris , Greg Kroah-Hartman , Shakeel Butt , Jason Gunthorpe , Christian Kellner , Andrea Arcangeli , Aleksa Sarai , "Dmitry V. Levin" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "stable@vger.kernel.org" Subject: Re: [PATCHv4] exec: Fix a deadlock in ptrace Message-ID: <20200304153002.ck77l6nifnvn647p@wittgenstein> References: <87k142lpfz.fsf@x220.int.ebiederm.org> <875zfmloir.fsf@x220.int.ebiederm.org> <87v9nmjulm.fsf@x220.int.ebiederm.org> <202003021531.C77EF10@keescook> <202003022103.196C313623@keescook> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote: > On 3/3/20 6:29 AM, Kees Cook wrote: > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote: > >> On 3/3/20 3:26 AM, Kees Cook wrote: > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote: > >>>> [...] > >>> > >>> If I'm reading this patch correctly, this changes the lifetime of the > >>> cred_guard_mutex lock to be: > >>> - during prepare_bprm_creds() > >>> - from flush_old_exec() through install_exec_creds() > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through > >>> install_exec_creds(). > > > > BTW, I think the effect of this change (i.e. my paragraph above) should > > be distinctly called out in the commit log if this solution moves > > forward. > > > > Okay, will do. > > >>> That means, for example, that check_unsafe_exec()'s documented invariant > >>> is violated: > >>> /* > >>> * determine how safe it is to execute the proposed program > >>> * - the caller must hold ->cred_guard_mutex to protect against > >>> * PTRACE_ATTACH or seccomp thread-sync > >>> */ > >> > >> Oh, right, I haven't understood that hint... > > > > I know no_new_privs is checked there, but I haven't studied the > > PTRACE_ATTACH part of that comment. If that is handled with the new > > check, this comment should be updated. > > > > Okay, I change that comment to: > > /* > * determine how safe it is to execute the proposed program > * - the caller must have set ->cred_locked_in_execve to protect against > * PTRACE_ATTACH or seccomp thread-sync > */ > > >>> I think it also means that the potentially multiple invocations > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without > >>> a lock (another place where current's no_new_privs is evaluated). > >> > >> So no_new_privs can change from 0->1, but should not > >> when execve is running. > >> > >> As long as the calling thread is in execve it won't do this, > >> and the only other place, where it may set for other threads > >> is in seccomp_sync_threads, but that can easily be avoided see below. > > > > Yeah, everything was fine until I had to go complicate things with > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while > > later gaining a seccomp filter from an unpriv process. The no_new_privs > > flag was used to control this, but it required that the filter not get > > applied during exec. > > > >>> Related, it also means that cred_guard_mutex is unheld for every > >>> invocation of search_binary_handler() (which can loop via the previously > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() > >>> currently.) > >>> > >>> For seccomp, the expectations about existing thread states risks races > >>> too. There are two locks held for TSYNC: > >>> - current->sighand->siglock is held to keep new threads from > >>> appearing/disappearing, which would destroy filter refcounting and > >>> lead to memory corruption. > >> > >> I don't understand what you mean here. > >> How can this lead to memory corruption? > > > > Mainly this is a matter of how seccomp manages its filter hierarchy > > (since the filters are shared through process ancestry), so if a thread > > appears in the middle of TSYNC it may be racing another TSYNC and break > > ancestry, leading to bad reference counting on process death, etc. > > (Though, yes, with refcount_t now, things should never corrupt, just > > waste memory.) > > > > I assume for now, that the current->sighand->siglock held while iterating all > threads is sufficient here. > > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to > >>> avoid no_new_privs and filter confusion during exec, which could > >>> lead to exploitable setuid conditions (see below). > >>> > >>> Just racing a malicious thread during TSYNC is not a very strong > >>> example (a malicious thread could do lots of fun things to "current" > >>> before it ever got near calling TSYNC), but I think there is the risk > >>> of mismatched/confused states that we don't want to allow. One is a > >>> particularly bad state that could lead to privilege escalations (in the > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process > >>> has a filter attached that silently fails a priv-dropping setuid call > >>> and continues execution with elevated privs, it can be tricked into > >>> doing bad things on behalf of the unprivileged parent, which was the > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]): > >>> > >>> thread A clones thread B > >>> thread B starts setuid exec > >>> thread A sets no_new_privs > >>> thread A calls seccomp with TSYNC > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B > >>> thread B passes check_unsafe_exec() with no_new_privs unset > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B > >>> thread B finishes exec, now running with elevated privs, a filter chosen > >>> by thread A, _and_ nnp set (which doesn't matter) > >>> > >>> With the original locking, thread B will fail check_unsafe_exec() > >>> because filter and nnp state are changed together, with "atomicity" > >>> protected by the cred_guard_mutex. > >>> > >> > >> Ah, good point, thanks! > >> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace > >> while the cred_guard_mutex is locked, like this for instance: > >> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c > >> index b6ea3dc..377abf0 100644 > >> --- a/kernel/seccomp.c > >> +++ b/kernel/seccomp.c > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) > >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); > >> assert_spin_locked(¤t->sighand->siglock); > >> > >> + if (current->signal->cred_locked_for_ptrace) > >> + return -EAGAIN; > >> + > > > > Hmm. I guess something like that could work. TSYNC expects to be able to > > report _which_ thread wrecked the call, though... I wonder if in_execve > > could be used to figure out the offending thread. Hm, nope, that would > > be outside of lock too (and all users are "current" right now, so the > > lock wasn't needed before). > > > > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, > but the caller will die quickly and cannot do anything with that information > when another thread executes execve, right? > > >> /* Validate all threads being eligible for synchronization. */ > >> caller = current; > >> for_each_thread(caller, thread) { > >> > >> > >>> And this is just the bad state I _can_ see. I'm worried there are more... > >>> > >>> All this said, I do see a small similarity here to the work I did to > >>> stabilize stack rlimits (there was an ongoing problem with making multiple > >>> decisions for the bprm based on current's state -- but current's state > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored > >>> current's copy until exec ended and then stored bprm's copy into current. > >>> If the only problem anyone can see here is the handling of no_new_privs, > >>> we might be able to solve that similarly, at least disentangling tsync/nnp > >>> from cred_guard_mutex. > >>> > >> > >> I still think that is solvable with using cred_locked_for_ptrace and > >> simply make the tsync fail if it would otherwise be blocked. > > > > I wonder if we can find a better name than "cred_locked_for_ptrace"? > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something? > > > > Yeah, I'd go with "cred_locked_in_execve". > > > And the comment on bool cred_locked_for_ptrace should mention that > > access is only allowed under cred_guard_mutex lock. > > > > okay. > > >>>> + sig->cred_locked_for_ptrace = false; > > > > This is redundant to the zalloc -- I think you can drop it (unless > > someone wants to keep it for clarify?) > > > > I'll remove that here and in init/init_task.c > > > Also, I think cred_locked_for_ptrace needs checking deeper, in > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to > > be sufficient to see a stable version of the thread... > > > > No, these need to be addressed individually, but most users just want > to know if the current credentials are sufficient at this moment, but will > not change the credentials, as ptrace and TSYNC do. > > BTW: Not all users have cred_guard_mutex, see mm/migrate.c, > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. > So adding an access to cred_locked_for_execve in ptrace_may_access is > probably not an option. That could be solved by e.g. adding ptrace_may_access_{no}exec() taking cred_guard_mutex.