From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF648C34022 for ; Mon, 17 Feb 2020 22:03:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8F2FB20801 for ; Mon, 17 Feb 2020 22:03:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726327AbgBQWDI convert rfc822-to-8bit (ORCPT ); Mon, 17 Feb 2020 17:03:08 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:55849 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726318AbgBQWDH (ORCPT ); Mon, 17 Feb 2020 17:03:07 -0500 Received: from mail-lf1-f44.google.com ([209.85.167.44]) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1j3oTp-0006Xg-IM for linux-fsdevel@vger.kernel.org; Mon, 17 Feb 2020 22:03:05 +0000 Received: by mail-lf1-f44.google.com with SMTP id n25so12976314lfl.0 for ; Mon, 17 Feb 2020 14:03:05 -0800 (PST) X-Gm-Message-State: APjAAAU0N7qbH0cw2U5PWTZ6BStShyW7t91sIyeHRFIdYOThl4hVbRUj FeFJhMhyjKGG49sJZUZyNwrFJS/ei9zh/+7sMMvKVA== X-Google-Smtp-Source: APXvYqxfzaNIrWsnxmQ/xqv/hlxasa72H1Nm3nfqgJQmPhBMHU82ojwvN6/kjkL7pS2rQUVLpiWQ8R4IEbGQPJ6GeNk= X-Received: by 2002:ac2:47e6:: with SMTP id b6mr8894012lfp.96.1581976984864; Mon, 17 Feb 2020 14:03:04 -0800 (PST) MIME-Version: 1.0 References: <20200214183554.1133805-1-christian.brauner@ubuntu.com> <1581973919.24289.12.camel@HansenPartnership.com> In-Reply-To: From: =?UTF-8?Q?St=C3=A9phane_Graber?= Date: Mon, 17 Feb 2020 17:02:52 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 00/28] user_namespace: introduce fsid mappings To: James Bottomley Cc: Christian Brauner , "Eric W. Biederman" , Aleksa Sarai , Jann Horn , Kees Cook , Jonathan Corbet , linux-kernel@vger.kernel.org, Linux Containers , smbarber@chromium.org, Seth Forshee , linux-security-module@vger.kernel.org, Alexander Viro , linux-api@vger.kernel.org, linux-fsdevel , Alexey Dobriyan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org And re-sending, this time hopefully actually in plain text mode. Sorry about that, my e-mail client isn't behaving today... Stéphane On Mon, Feb 17, 2020 at 4:57 PM Stéphane Graber wrote: > > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley wrote: >> >> On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: >> [...] >> > With this patch series we simply introduce the ability to create fsid >> > mappings that are different from the id mappings of a user namespace. >> > The whole feature set is placed under a config option that defaults >> > to false. >> > >> > In the usual case of running an unprivileged container we will have >> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will >> > correspond to this id mapping, i.e. all files which we want to appear >> > as 0:0 inside the user namespace will be chowned to 100000:100000 on >> > the host. This works, because whenever the kernel needs to do a >> > filesystem access it will lookup the corresponding uid and gid in the >> > idmapping tables of the container. >> > Now think about the case where we want to have an id mapping of 0 >> > 100000 100000 but an on-disk mapping of 0 300000 100000 which is >> > needed to e.g. share a single on-disk mapping with multiple >> > containers that all have different id mappings. >> > This will be problematic. Whenever a filesystem access is requested, >> > the kernel will now try to lookup a mapping for 300000 in the id >> > mapping tables of the user namespace but since there is none the >> > files will appear to be owned by the overflow id, i.e. usually >> > 65534:65534 or nobody:nogroup. >> > >> > With fsid mappings we can solve this by writing an id mapping of 0 >> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem >> > access the kernel will now lookup the mapping for 300000 in the fsid >> > mapping tables of the user namespace. And since such a mapping >> > exists, the corresponding files will have correct ownership. >> >> How do we parametrise this new fsid shift for the unprivileged use >> case? For newuidmap/newgidmap, it's easy because each user gets a >> dedicated range and everything "just works (tm)". However, for the >> fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool >> has to know not only your allocated uid/gid chunk, but also the offset >> map of the image. The former is easy, but the latter is going to vary >> by the actual image ... well unless we standardise some accepted shift >> for images and it simply becomes a known static offset. > > > For unprivileged runtimes, I would expect images to be unshifted and be > unpacked from within a userns. So your unprivileged user would be allowed > a uid/gid range through /etc/subuid and /etc/subgid and allowed to use > them through newuidmap/newgidmap.In that namespace, you can then pull > and unpack any images/layers you may want and the resulting fs tree will > look correct from within that namespace. > > All that is possible today and is how for example unprivileged LXC works > right now. > > What this patchset then allows is for containers to have differing > uid/gid maps while still being based off the same image or layers. > In this scenario, you would carve a subset of your main uid/gid map for > each container you run and run them in a child user namespace while > setting up a fsuid/fsgid map such that their filesystem access do not > follow their uid/gid map. This then results in proper isolation for > processes, networks, ... as everything runs as different kuid/kgid but > the VFS view will be the same in all containers. > > Shared storage between those otherwise isolated containers would also > work just fine by simply bind-mounting the same path into two or more > containers. > > > Now one additional thing that would be safe for a setuid wrapper to > allow would be for arbitrary mapping of any of the uid/gid that the user > owns to be used within the fsuid/fsgid map. One potential use for this > would be to create any number of user namespaces, each with their own > mapping for uid 0 while still having all VFS access be mapped to the > user that spawned them (say uid=1000, gid=1000). > > > Note that in our case, the intended use for this is from a privileged runtime > where our images would be unshifted as would be the container storage > and any shared storage for containers. The security model effectively relying > on properly configured filesystem permissions and mount namespaces such > that the content of those paths can never be seen by anyone but root outside > of those containers (and therefore avoids all the issues around setuid/setgid/fscaps). > > We will then be able to allocate distinct, random, ranges of 65536 uids/gids (or more) > for each container without ever having to do any uid/gid shifting at the filesystem layer > or run into issues when having to setup shared storage between containers or attaching > external storage volumes to those containers. > >> James > > > Stéphane