From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BD58C4346F for ; Thu, 6 Aug 2020 11:01:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 10AD322B48 for ; Thu, 6 Aug 2020 11:01:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WDgd4Q+d" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726446AbgHFFpY (ORCPT ); Thu, 6 Aug 2020 01:45:24 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:26780 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726097AbgHFFpY (ORCPT ); Thu, 6 Aug 2020 01:45:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1596692722; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=B1WkhY8QD0f5DxDB9jVqeA3H8Wo7kGb4Rskkut4YioA=; b=WDgd4Q+dZ1JXg6DJQ0lL6+13Go1N7JdajefJGl4f36Pc79Vt2CLKyd6+okLH/E09PCxLVz EEJahe/C9zmbj6bbgDStlpj/t+rK8B9iA3cAU2Cn3B3kWbCKY89qUIu/uVdzK4XqJTYQNS 3QDEtsPwiIxJ2CnEXaFr7XuoCSMT164= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-503-zjAqd0duPTeNftDHboRY5Q-1; Thu, 06 Aug 2020 01:44:09 -0400 X-MC-Unique: zjAqd0duPTeNftDHboRY5Q-1 Received: by mail-wm1-f70.google.com with SMTP id g72so3172518wme.4 for ; Wed, 05 Aug 2020 22:44:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=B1WkhY8QD0f5DxDB9jVqeA3H8Wo7kGb4Rskkut4YioA=; b=Bgk9MNU1bqh/4ulLSHwOZzrqGGRrAUKGD36cTpVbXvKa4ni5+pwR3Tfu8nnzA+HCbJ uL+lAr5FGc9hY68+kIwlpQWWXGUvpkKbhEQWD3/2kIv5xz0ZbMvAlJIBctAXeVxWcVdR drufchGcSi/S1KLQemyaZCkEMb8MT+b9I5Q3wxOmM/0y7COVyZinGDWhn59Nd2mxGotl NOCRjONqjoiAy0XYJ6ylmXbNpxQZytKbT4B9h0sQL4vHyIMZSgpWaWaWklX6FMFBGXTC KkrP8bHDOaHQOmqORqrZZP/mtfUe1HlR/DopWA/vMPogpDAQv9rfZ7cIsDVID60fjqx1 nWKA== X-Gm-Message-State: AOAM533iRZdQSL7p6xuVDfJ5WRDEYIo3lXazEAKSuGloenOT9DUXoEN5 OiU3zt/Mmpc0JvVHW/D2lurx9Ag8F5bye/RhXkMZbGTSIkZ4pDCW3E95rEhBcP6YmZUKRwp3Q5G qxfbla9sHeBBvzDTC1NLw X-Received: by 2002:a1c:2095:: with SMTP id g143mr6059852wmg.78.1596692647830; Wed, 05 Aug 2020 22:44:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwmPXCaeMja3+wQqRTRcnR9xvH5V51yW3HFk0DygRewkzwUjPll1ovSwhp2Uz6I/TBHtDNZTQ== X-Received: by 2002:a1c:2095:: with SMTP id g143mr6059816wmg.78.1596692647427; Wed, 05 Aug 2020 22:44:07 -0700 (PDT) Received: from redhat.com (bzq-79-177-102-128.red.bezeqint.net. [79.177.102.128]) by smtp.gmail.com with ESMTPSA id v12sm5079783wri.47.2020.08.05.22.44.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Aug 2020 22:44:06 -0700 (PDT) Date: Thu, 6 Aug 2020 01:44:01 -0400 From: "Michael S. Tsirkin" To: Nick Kralevich Cc: Lokesh Gidra , Jeffrey Vander Stoep , Andrea Arcangeli , Suren Baghdasaryan , Kees Cook , Jonathan Corbet , Alexander Viro , Luis Chamberlain , Iurii Zaikin , Mauro Carvalho Chehab , Andrew Morton , Andy Shevchenko , Vlastimil Babka , Mel Gorman , Sebastian Andrzej Siewior , Peter Xu , Mike Rapoport , Jerome Glisse , Shaohua Li , linux-doc@vger.kernel.org, LKML , Linux FS Devel , Tim Murray , Minchan Kim , Sandeep Patil , kernel@android.com, Daniel Colascione , Kalesh Singh Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Message-ID: <20200806004351-mutt-send-email-mst@kernel.org> References: <202005200921.2BD5A0ADD@keescook> <20200520194804.GJ26186@redhat.com> <20200520195134.GK26186@redhat.com> <20200520211634.GL26186@redhat.com> <20200724093852-mutt-send-email-mst@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-doc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-doc@vger.kernel.org On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote: > On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin wrote: > > > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra wrote: > > > > From the discussion so far it seems that there is a consensus that > > > > patch 1/2 in this series should be upstreamed in any case. Is there > > > > anything that is pending on that patch? > > > > > > That's my reading of this thread too. > > > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > > > would be more a question to the Android userland team. > > > > > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > > > already applied to unprivileged software running without > > > > > > SYS_CAP_PTRACE capability? > > > > > > > > > > Yes. > > > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > > > seccomp on a few processes, but we have found that it has a > > > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > > > on system wide is not a good option. > > > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > > not applied to all processes. It's too expensive and impractical when > > > simpler implementations (such as this sysctl) can exist. It's also > > > significantly simpler to test a sysctl value for correctness as > > > opposed to a seccomp filter. > > > > Given that selinux is already used system-wide on Android, what is wrong > > with using selinux to control userfaultfd as opposed to seccomp? > > Userfaultfd file descriptors will be generally controlled by SELinux. > You can see the patchset at > https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ > (which is also referenced in the original commit message for this > patchset). However, the SELinux patchset doesn't include the ability > to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. > > SELinux already has the ability to control who gets CAP_SYS_PTRACE, > which combined with this patch, is largely equivalent to direct > UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch > above, movement of userfaultfd file descriptors can be mediated by > SELinux, preventing one process from acquiring userfaultfd descriptors > of other processes unless allowed by security policy. > > It's an interesting question whether finer-grain SELinux support for > controlling UFFD_USER_MODE_ONLY should be added. I can see some > advantages to implementing this. However, we don't need to decide that > now. > > Kernel security checks generally break down into DAC (discretionary > access control) and MAC (mandatory access control) controls. Most > kernel security features check via both of these mechanisms. Security > attributes of the system should be settable without necessarily > relying on an LSM such as SELinux. This patch follows the same basic > model -- system wide control of a hardening feature is provided by the > unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, > SELinux support for this can also be implemented on top of the DAC > controls. > > This DAC/MAC split has been successful in several other security > features. For example, the ability to map at page zero is controlled > in DAC via the mmap_min_addr sysctl [1], and via SELinux via the > mmap_zero access vector [2]. Similarly, access to the kernel ring > buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as > well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict > sysctl is very similar to this patch -- it introduces a capability > (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. > > If we want to ensure that a security feature will be well tested and > vetted, it's important to not limit its use to LSMs only. This ensures > that kernel and application developers will always be able to test the > effects of a security feature, without relying on LSMs like SELinux. > It also ensures that all distributions can enable this security > mitigation should it be necessary for their unique environments, > without introducing an SELinux dependency. And this patch does not > preclude an SELinux implementation should it be necessary. > > Even if we decide to implement fine-grain SELinux controls on > UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this > an either/or choice between SELinux and this patch. Both are > necessary. > > -- Nick > > [1] https://wiki.debian.org/mmap_min_addr > [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions > [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt I am not sure I agree this is similar to dmesg access. The reason I say it is this: it is pretty easy for admins to know whether they run something that needs to access the kernel ring buffer. Or if it's a tool developer poking at dmesg, they can tell admins "we need these permissions". But it seems impossible for either an admin to know that a userfaultfd page e.g. used with shared memory is accessed from the kernel. So I guess the question is: how does anyone not running Android know to set this flag? I got the feeling it's not really possible, and so for a single-user feature like this a single API seems enough. Given a choice between a knob an admin is supposed to set and selinux policy written by presumably knowledgeable OS vendors, I'd opt for a second option. Hope this helps. > > > > > > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > > > subject) should be enforceable with minor changes to the BPF > > > > > > assembly. Otherwise it'd require more changes. > > > > > > It would be good to understand what these changes are. > > > > > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > > > if you set with option "2"? > > > > > > I can see your point if you think about the feature as a whole. > > > However, distributions (such as Android) have specialized knowledge of > > > their security environments, and may not want to support the typical > > > usages of userfaultfd. For such distributions, providing a mechanism > > > to prevent userfaultfd from being useful as an exploit primitive, > > > while still allowing the very limited use of userfaultfd for userspace > > > faults only, is desirable. Distributions shouldn't be forced into > > > supporting 100% of the use cases envisioned by userfaultfd when their > > > needs may be more specialized, and this sysctl knob empowers > > > distributions to make this choice for themselves. > > > > > > > > > Is the system owner really going to purely run on his systems CRIU > > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > > > nothing else that could break? > > > > > > This is a great example of a capability which a distribution may not > > > want to support, due to distribution specific security policies. > > > > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > > > this single user can already enforce model "2" by only tweaking its > > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > > > Can you elaborate on what bug is present by processes being > > > unprotected by seccomp? > > > > > > Seccomp cannot be universally applied on Android due to previously > > > mentioned performance concerns. Seccomp is used in Android primarily > > > as a tool to enforce the list of allowed syscalls, so that such > > > syscalls can be audited before being included as part of the Android > > > API. > > > > > > -- Nick > > > > > > -- > > > Nick Kralevich | nnk@google.com > > > > > -- > Nick Kralevich | nnk@google.com