From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31348C55179 for ; Tue, 3 Nov 2020 14:10:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C095422226 for ; Tue, 3 Nov 2020 14:10:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kinvolk.io header.i=@kinvolk.io header.b="jNSrYWNM" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729295AbgKCOKx (ORCPT ); Tue, 3 Nov 2020 09:10:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729561AbgKCOKt (ORCPT ); Tue, 3 Nov 2020 09:10:49 -0500 Received: from mail-ed1-x542.google.com (mail-ed1-x542.google.com [IPv6:2a00:1450:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C1A6C061A4A for ; Tue, 3 Nov 2020 06:10:48 -0800 (PST) Received: by mail-ed1-x542.google.com with SMTP id k9so18418079edo.5 for ; Tue, 03 Nov 2020 06:10:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kinvolk.io; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=y8WQjpT9rqVrgeh99s8m7L7lW+jgLfIjyGz2M8+krP0=; b=jNSrYWNMri4UzTfQTwvUoeI3d9PbxAntszBtbkOHWPbsmFNLXoekIJ8TcfDFJbJf7v 6+eYLKBEUBAXwADA9DJI10J7imuIzSuwLQhtXLJY1nc7fmB3dMavlYA1Xmy9wmU6aG0F m9nefJssTHLgeDNMLEKBtF2hV/CbFCCq5QOzc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=y8WQjpT9rqVrgeh99s8m7L7lW+jgLfIjyGz2M8+krP0=; b=jogLaJKduvF+00PDLnr+13S4azMAWA3p7CkmFAhc2n1S/TJNTHWCBY7C3EJCU7eNmv 8yHbvCWKLs33dC10GEMkFggla5a00Q/GLjk562VPIOsNXNlwjuzZC2ZIsdP67Hx5KJ4m fby3w02p0d6elQQe7RDSlytwSVKuq72lL2cGgpCGZ5ZEiesn25MV096qtrc3cyy85uMd AZ7ea1XDkP1Og2DxFdFoolpbq3u8qV2yNfPMH1ZZN81Zmd2V0iJEwbIilIF9c1j6aOcR 7RtBiomxD2BF37kbp9VVK0vpn3eJryPufkWstACXOJop73AbxUcNSZojaPfXVILeXjc2 5RhQ== X-Gm-Message-State: AOAM530/VZstfhV9Wl33ybbjdnmDf2GcfpdS66WysBFPm/3PiGoZxZ3n pJmydQ0meA4pG68vSI1u1Ot/LmH5DfKhIt0KtjJ8gA== X-Google-Smtp-Source: ABdhPJwp4QL8RwmQmcT2CoPcUplHSU9HiByLsXAEntwM01IPEN+3MiJMYkcJQ6Od81V70aDSuFwsTxILAnA/OMfjXrI= X-Received: by 2002:a05:6402:a57:: with SMTP id bt23mr10741907edb.62.1604412647178; Tue, 03 Nov 2020 06:10:47 -0800 (PST) MIME-Version: 1.0 References: <20201029003252.2128653-1-christian.brauner@ubuntu.com> <87pn51ghju.fsf@x220.int.ebiederm.org> <20201029155148.5odu4j2kt62ahcxq@yavin.dot.cyphar.com> <87361xdm4c.fsf@x220.int.ebiederm.org> In-Reply-To: <87361xdm4c.fsf@x220.int.ebiederm.org> From: Alban Crequy Date: Tue, 3 Nov 2020 15:10:35 +0100 Message-ID: Subject: Re: [PATCH 00/34] fs: idmapped mounts To: "Eric W. Biederman" Cc: Aleksa Sarai , Christian Brauner , Alexander Viro , Christoph Hellwig , linux-fsdevel , John Johansen , James Morris , Mimi Zohar , Dmitry Kasatkin , Stephen Smalley , Casey Schaufler , Arnd Bergmann , Andreas Dilger , OGAWA Hirofumi , Geoffrey Thomas , Mrunal Patel , Josh Triplett , Andy Lutomirski , Amir Goldstein , Miklos Szeredi , Theodore Tso , Tycho Andersen , David Howells , James Bottomley , Jann Horn , Seth Forshee , =?UTF-8?Q?St=C3=A9phane_Graber?= , Lennart Poettering , smbarber@chromium.org, Phil Estes , Serge Hallyn , Kees Cook , Todd Kjos , Jonathan Corbet , Linux Containers , LSM , linux-api@vger.kernel.org, linux-ext4@vger.kernel.org, linux-unionfs@vger.kernel.org, linux-audit@redhat.com, linux-integrity , selinux@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-unionfs@vger.kernel.org On Thu, Oct 29, 2020 at 5:37 PM Eric W. Biederman w= rote: > > Aleksa Sarai writes: > > > On 2020-10-29, Eric W. Biederman wrote: > >> Christian Brauner writes: > >> > >> > Hey everyone, > >> > > >> > I vanished for a little while to focus on this work here so sorry fo= r > >> > not being available by mail for a while. > >> > > >> > Since quite a long time we have issues with sharing mounts between > >> > multiple unprivileged containers with different id mappings, sharing= a > >> > rootfs between multiple containers with different id mappings, and a= lso > >> > sharing regular directories and filesystems between users with diffe= rent > >> > uids and gids. The latter use-cases have become even more important = with > >> > the availability and adoption of systemd-homed (cf. [1]) to implemen= t > >> > portable home directories. > >> > >> Can you walk us through the motivating use case? > >> > >> As of this year's LPC I had the distinct impression that the primary u= se > >> case for such a feature was due to the RLIMIT_NPROC problem where two > >> containers with the same users still wanted different uid mappings to > >> the disk because the users were conflicting with each other because of > >> the per user rlimits. > >> > >> Fixing rlimits is straight forward to implement, and easier to manage > >> for implementations and administrators. > > > > This is separate to the question of "isolated user namespaces" and > > managing different mappings between containers. This patchset is solvin= g > > the same problem that shiftfs solved -- sharing a single directory tree > > between containers that have different ID mappings. rlimits (nor any of > > the other proposals we discussed at LPC) will help with this problem. > > First and foremost: A uid shift on write to a filesystem is a security > bug waiting to happen. This is especially in the context of facilities > like iouring, that play very agressive games with how process context > makes it to system calls. > > The only reason containers were not immediately exploitable when iouring > was introduced is because the mechanisms are built so that even if > something escapes containment the security properties still apply. > Changes to the uid when writing to the filesystem does not have that > property. The tiniest slip in containment will be a security issue. > > This is not even the least bit theoretical. I have seem reports of how > shitfs+overlayfs created a situation where anyone could read > /etc/shadow. > > If you are going to write using the same uid to disk from different > containers the question becomes why can't those containers configure > those users to use the same kuid? > > What fixing rlimits does is it fixes one of the reasons that different > containers could not share the same kuid for users that want to write to > disk with the same uid. > > > I humbly suggest that it will be more secure, and easier to maintain for > both developers and users if we fix the reasons people want different > containers to have the same user running with different kuids. > > If not what are the reasons we fundamentally need the same on-disk user > using multiple kuids in the kernel? I would like to use this patch set in the context of Kubernetes. I described my two possible setups in https://www.spinics.net/lists/linux-containers/msg36537.html: 1. Each Kubernetes pod has its own userns but with the same user id mapping 2. Each Kubernetes pod has its own userns with non-overlapping user id mapping (providing additional isolation between pods) But even in the setup where all pods run with the same id mappings, this patch set is still useful to me for 2 reasons: 1. To avoid the expensive recursive chown of the rootfs. We cannot necessarily extract the tarball directly with the right uids because we might use the same container image for privileged containers (with the host userns) and unprivileged containers (with a new userns), so we have at least 2 =E2=80=9Cmappings=E2=80=9D (taking more time and resulti= ng in more storage space). Although the =E2=80=9Cmetacopy=E2=80=9D mount option in ove= rlayfs helps to make the recursive chown faster, it can still take time with large container images with lots of files. I=E2=80=99d like to use this pat= ch set to set up the root fs in constant time. 2. To manage large external volumes (NFS or other filesystems). Even if admins can decide to use the same kuid on all the nodes of the Kubernetes cluster, this is impractical for migration. People can have existing Kubernetes clusters (currently without using user namespaces) and large NFS volumes. If they want to switch to a new version of Kubernetes with the user namespace feature enabled, they would need to recursively chown all the files on the NFS shares. This could take time on large filesystems and realistically, we want to support rolling updates where some nodes use the previous version without user namespaces and new nodes are progressively migrated to the new userns with the new id mapping. If both sets of nodes use the same NFS share, that can=E2=80=99t work. Alban