Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: "Stéphane Graber" <stgraber@ubuntu.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	"Aleksa Sarai" <cyphar@cyphar.com>,
	"Jann Horn" <jannh@google.com>
Cc: smbarber@chromium.org, Seth Forshee <seth.forshee@canonical.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Serge Hallyn <serge@hallyn.com>, James Morris <jmorris@namei.org>,
	Kees Cook <keescook@chromium.org>,
	Jonathan Corbet <corbet@lwn.net>, Phil Estes <estesp@gmail.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	containers@lists.linux-foundation.org,
	linux-security-module@vger.kernel.org, linux-api@vger.kernel.org,
	Christian Brauner <christian.brauner@ubuntu.com>
Subject: [PATCH v2 00/28] user_namespace: introduce fsid mappings
Date: Fri, 14 Feb 2020 19:35:26 +0100
Message-ID: <20200214183554.1133805-1-christian.brauner@ubuntu.com> (raw)

Hey everyone,

This is v2 with various fixes after discussions with Jann.

From pings and off-list questions and discussions at Google Container
Security Summit there seems to be quite a lot of interest in this
patchset with use-cases ranging from layer sharing for app containers
and k8s, as well as data sharing between containers with different id
mappings. I haven't Cced all people because I don't have all the email
adresses at hand but I've at least added Phil now. :)

This is the implementation of shiftfs which was cooked up during lunch at
Linux Plumbers 2019 the day after the container's microconference. The
idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
we all were quite busy with other work and couldn't really sit down and
implement it. But I took a few days last week to do this work, including
demos and performance testing.
This implementation does not require us to touch the vfs substantially
at all. Instead, we implement shiftfs via fsid mappings.
With this patch, it took me 20 mins to port both LXD and LXC to support
shiftfs via fsid mappings.

For anyone wanting to play with this the branch can be pulled from:
https://github.com/brauner/linux/tree/fsid_mappings
https://gitlab.com/brauner/linux/-/tree/fsid_mappings
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings

The main use case for shiftfs for us is in allowing shared writable
storage to multiple containers using non-overlapping id mappings.
In such a scenario you want the fsids to be valid and identical in both
containers for the shared mount. A demo for this exists in [3].
If you don't want to read on, go straight to the other demos below in
[1] and [2].

People not as familiar with user namespaces might not be aware that fsid
mappings already exist. Right now, fsid mappings are always identical to
id mappings. Specifically, the kernel will lookup fsuids in the uid
mappings and fsgids in the gid mappings of the relevant user namespace.

With this patch series we simply introduce the ability to create fsid
mappings that are different from the id mappings of a user namespace.
The whole feature set is placed under a config option that defaults to
false.

In the usual case of running an unprivileged container we will have
setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
correspond to this id mapping, i.e. all files which we want to appear as
0:0 inside the user namespace will be chowned to 100000:100000 on the
host. This works, because whenever the kernel needs to do a filesystem
access it will lookup the corresponding uid and gid in the idmapping
tables of the container.
Now think about the case where we want to have an id mapping of 0 100000
100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
share a single on-disk mapping with multiple containers that all have
different id mappings.
This will be problematic. Whenever a filesystem access is requested, the
kernel will now try to lookup a mapping for 300000 in the id mapping
tables of the user namespace but since there is none the files will
appear to be owned by the overflow id, i.e. usually 65534:65534 or
nobody:nogroup.

With fsid mappings we can solve this by writing an id mapping of 0
100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
access the kernel will now lookup the mapping for 300000 in the fsid
mapping tables of the user namespace. And since such a mapping exists,
the corresponding files will have correct ownership.

A note on proc (and sys), the proc filesystem is special in sofar as it
only has a single superblock that is (currently but might be about to
change) visible in all user namespaces (same goes for sys). This means
it has special semantics in many ways, including how file ownership and
access works. The fsid mapping implementation does not alter how proc
(and sys) ownership works. proc and sys will both continue to lookup
filesystem access in id mapping tables.

When Writing fsid mappings the same rules apply as when writing id
mappings so I won't reiterate them here. The limit of fs id mappings is
the same as for id mappings, i.e. 340 lines.

# Performance
Back when I extended the range of possible id mappings to 340 I did
performance testing by booting into single user mode, creating 1,000,000
files to fstat()ing them and calculated the mean fstat() time per file.
(Back when Linux was still fast. I won't mention that the stat
 numbers have (thanks microcode!) doubled since then...)
I did the same test for this patchset: one vanilla kernel, one kernel
with my fsid mapping patches but CONFIG_USER_NS_FSID set to n and one
with fsid mappings patches enabled. I then ran the same test on all
three kernels and compared the numbers. The implementation does not
introduce overhead. That's all I can say. Here are the numbers:

             | vanilla v5.5 | fsid mappings       | fsid mappings      | fsid mappings      |
	     |              | disabled in Kconfig | enabled in Kconfig | enabled in Kconfig |
	     |   	    |                     | and unset for all  | and set for all    |
	     |   	    |    		  | test cases         | test cases         |
-------------|--------------|---------------------|--------------------|--------------------|
 0  mappings |       367 ns |              365 ns |             365 ns |             N/A    |
 1  mappings |       362 ns |              367 ns |             363 ns |             363 ns |
 2  mappings |       361 ns |              369 ns |             363 ns |             364 ns |
 3  mappings |       361 ns |              368 ns |             366 ns |             365 ns |
 5  mappings |       365 ns |              368 ns |             363 ns |             365 ns |
 10 mappings |       391 ns |              388 ns |             387 ns |             389 ns |
 50 mappings |       395 ns |              398 ns |             401 ns |             397 ns |
100 mappings |       400 ns |              405 ns |             399 ns |             399 ns |
200 mappings |       404 ns |              407 ns |             430 ns |             404 ns |
300 mappings |       492 ns |              494 ns |             432 ns |             413 ns |
340 mappings |       495 ns |              497 ns |             500 ns |             484 ns |

# Demos
[1]: Create a container with different id and fsid mappings.
     https://asciinema.org/a/300233 
[2]: Create a container with id mappings but without fsid mappings.
     https://asciinema.org/a/300234
[3]: Share storage between multiple containers with non-overlapping id
     mappings.
     https://asciinema.org/a/300235

Thanks!
Christian

Christian Brauner (28):
  user_namespace: introduce fsid mappings infrastructure
  proc: add /proc/<pid>/fsuid_map
  proc: add /proc/<pid>/fsgid_map
  fsuidgid: add fsid mapping helpers
  proc: task_state(): use from_kfs{g,u}id_munged
  cred: add kfs{g,u}id
  sys: __sys_setfsuid(): handle fsid mappings
  sys: __sys_setfsgid(): handle fsid mappings
  sys:__sys_setuid(): handle fsid mappings
  sys:__sys_setgid(): handle fsid mappings
  sys:__sys_setreuid(): handle fsid mappings
  sys:__sys_setregid(): handle fsid mappings
  sys:__sys_setresuid(): handle fsid mappings
  sys:__sys_setresgid(): handle fsid mappings
  fs: add is_userns_visible() helper
  namei: may_{o_}create(): handle fsid mappings
  inode: inode_owner_or_capable(): handle fsid mappings
  capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  stat: handle fsid mappings
  open: handle fsid mappings
  posix_acl: handle fsid mappings
  attr: notify_change(): handle fsid mappings
  commoncap: cap_bprm_set_creds(): handle fsid mappings
  commoncap: cap_task_fix_setuid(): handle fsid mappings
  commoncap: handle fsid mappings with vfs caps
  exec: bprm_fill_uid(): handle fsid mappings
  ptrace: adapt ptrace_may_access() to always uses unmapped fsids
  devpts: handle fsid mappings

 fs/attr.c                      |  23 ++-
 fs/devpts/inode.c              |   7 +-
 fs/exec.c                      |  25 ++-
 fs/inode.c                     |   7 +-
 fs/namei.c                     |  36 +++-
 fs/open.c                      |  16 +-
 fs/posix_acl.c                 |  21 +--
 fs/proc/array.c                |   5 +-
 fs/proc/base.c                 |  34 ++++
 fs/stat.c                      |  48 ++++--
 include/linux/cred.h           |   4 +
 include/linux/fs.h             |   5 +
 include/linux/fsuidgid.h       | 122 +++++++++++++
 include/linux/stat.h           |   1 +
 include/linux/user_namespace.h |  10 ++
 init/Kconfig                   |  11 ++
 kernel/capability.c            |  10 +-
 kernel/ptrace.c                |   4 +-
 kernel/sys.c                   | 106 +++++++++---
 kernel/user.c                  |  22 +++
 kernel/user_namespace.c        | 303 ++++++++++++++++++++++++++++++++-
 security/commoncap.c           |  35 ++--
 22 files changed, 757 insertions(+), 98 deletions(-)
 create mode 100644 include/linux/fsuidgid.h


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
-- 
2.25.0


             reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-14 18:35 Christian Brauner [this message]
2020-02-14 18:35 ` [PATCH v2 01/28] user_namespace: introduce fsid mappings infrastructure Christian Brauner
2020-02-14 18:35 ` [PATCH v2 02/28] proc: add /proc/<pid>/fsuid_map Christian Brauner
2020-02-14 18:35 ` [PATCH v2 03/28] proc: add /proc/<pid>/fsgid_map Christian Brauner
2020-02-14 18:35 ` [PATCH v2 04/28] fsuidgid: add fsid mapping helpers Christian Brauner
2020-02-14 19:11   ` Jann Horn
2020-02-16 16:55     ` Christian Brauner
2020-02-14 18:35 ` [PATCH v2 05/28] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
2020-02-14 18:35 ` [PATCH v2 06/28] cred: add kfs{g,u}id Christian Brauner
2020-02-14 18:35 ` [PATCH v2 07/28] sys: __sys_setfsuid(): handle fsid mappings Christian Brauner
2020-02-14 18:35 ` [PATCH v2 08/28] sys: __sys_setfsgid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 09/28] sys:__sys_setuid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 10/28] sys:__sys_setgid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 11/28] sys:__sys_setreuid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 12/28] sys:__sys_setregid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 13/28] sys:__sys_setresuid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 14/28] sys:__sys_setresgid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 15/28] fs: add is_userns_visible() helper Christian Brauner
2020-02-14 18:35 ` [PATCH v2 16/28] namei: may_{o_}create(): handle fsid mappings Christian Brauner
2020-02-14 18:35 ` [PATCH v2 17/28] inode: inode_owner_or_capable(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 18/28] capability: privileged_wrt_inode_uidgid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 19/28] stat: " Christian Brauner
2020-02-14 19:03   ` Tycho Andersen
2020-02-16 14:12     ` Christian Brauner
2020-02-14 18:35 ` [PATCH v2 20/28] open: " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 21/28] posix_acl: " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 22/28] attr: notify_change(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 23/28] commoncap: cap_bprm_set_creds(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 24/28] commoncap: cap_task_fix_setuid(): " Christian Brauner
2020-02-14 18:35 ` [PATCH v2 25/28] commoncap: handle fsid mappings with vfs caps Christian Brauner
2020-02-14 18:35 ` [PATCH v2 26/28] exec: bprm_fill_uid(): handle fsid mappings Christian Brauner
2020-02-14 18:35 ` [PATCH v2 27/28] ptrace: adapt ptrace_may_access() to always uses unmapped fsids Christian Brauner
2020-02-14 18:35 ` [PATCH v2 28/28] devpts: handle fsid mappings Christian Brauner
2020-02-16 15:55 ` [PATCH v2 00/28] user_namespace: introduce " Florian Weimer
2020-02-16 16:40   ` Christian Brauner
2020-02-17 21:06 ` James Bottomley
2020-02-17 21:20   ` Christian Brauner
2020-02-17 22:35     ` James Bottomley
2020-02-17 23:05       ` Christian Brauner
2020-02-17 21:11 ` James Bottomley
     [not found]   ` <CA+enf=vwd-dxzve87t7Mw1Z35RZqdLzVaKq=fZ4EGOpnES0f5w@mail.gmail.com>
2020-02-17 22:02     ` Stéphane Graber
2020-02-17 23:03     ` James Bottomley
2020-02-17 23:11       ` Stéphane Graber

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200214183554.1133805-1-christian.brauner@ubuntu.com \
    --to=christian.brauner@ubuntu.com \
    --cc=adobriyan@gmail.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=cyphar@cyphar.com \
    --cc=ebiederm@xmission.com \
    --cc=estesp@gmail.com \
    --cc=jannh@google.com \
    --cc=jmorris@namei.org \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=serge@hallyn.com \
    --cc=seth.forshee@canonical.com \
    --cc=smbarber@chromium.org \
    --cc=stgraber@ubuntu.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git